27 May BA IT outage miscellaneous discussions thread

Reply Subscribe

Thread Tools

Search this Thread

May 27, 2017, 2:53 pm

#61

SW7London

Join Date: Oct 2005

Location: London

Posts: 726

Quote:

Originally Posted by flatlander

This sort of incident needs to be investigated in the manner of a flight operations incident, so analysis of what happened, direct causes and contributing factors, and analysis of how to stop these same causes acting again to cause another incident. Taking a first action of deploying either the HR disciplinary procedure (on internal staff) or lawyers with damages briefs (on external suppliers) will achieve little for better reliability. I encourage anyone who investigates this sort of IT failure, or wants to know how to do it better, to read AAIB accident reports and books on Just Culture in technical operations.

I agree this certainly isnt the time to start threatening staff or suppliers. But dont forget seldom are AAIB investigated accidents caused by cost cutting, whereas a lot of technology/IT failures are. In my view there absolutely should be consequences for short term thinking managers who take the easy option of cutting costs/increasing risks within IT

May 27, 2017, 3:08 pm

#62

lorcancoyle

Join Date: Feb 2010

Location: London

Programs: BA GGL (for now) and Lifetime Gold, Marriott fan thanks to Bonvoy Moments

Posts: 5,115

Quote:

Originally Posted by DesertNomad

How many empty positioning flights are running? If I were BA, I would be really tempted to just board whoever wants to go anywhere. 400 seats available to Japan. Show a passport and go. If the gate agents could not determine if a visa was required, then don't board them. Maybe have them prove that they have a return ticket (printout or such). If they did this, I doubt many people would game the system... e.g. if I had a round trip LHR-ZRH, I am not going to jump on a plane to Japan since getting back would be a problem.

It'd be a way to clear it out. Put Richard Branson in charge of things and there would be a solution. I have boarded third world airlines with a hand written ticket and BA is looking fairly third world at the moment. Time to break out the hand-written tickets.

unfortunately I don't think many other countries would permit flights to arrive unless the manifest had been provided in advance - Spain and US are 2 with API requirements that spring to mind.

Depending on how long this lasts it'll be interesting (you know what I mean) to see where this ranks vs. the RBS / NatWest / Ulster Bank failure a few years ago and the recent NHS disaster.

May 27, 2017, 3:21 pm

#63

DesertNomad

Join Date: Jun 2008

Location: Northern Nevada

Programs: DL,EK

Posts: 1,652

Quote:

Originally Posted by lorcancoyle

unfortunately I don't think many other countries would permit flights to arrive unless the manifest had been provided in advance - Spain and US are 2 with API requirements that spring to mind.

That's why we have pens, paper and fax machines.

May 27, 2017, 3:29 pm

#64

Oaxaca

Join Date: Dec 2014

Location: UK

Programs: BA, U2+, SK, AF/KL, IHG, Hilton, others gathering dust...

Posts: 2,552

The reason is understandable, but this forum is the busiest I've seen it in the couple of years I've been on here. I opened the forum homepage and thought there was something wrong with the formatting on my iPad - but no, it was just a massive list of active users. As I write there are 1322 active users on the BA FT forum, over 300 registered users and over 1000 lurkers.

I hope Alex Cruz sent out a few hi-viz jackets for the mods to get them through the night, though FT's IT seems to be coping admirably...

May 27, 2017, 3:39 pm

#65

alextheengineer

Join Date: Jun 2014

Location: Heathrow

Posts: 218

OK so some bits seem to be coming back online.

Including, the page on the intranet where colleagues can get info about the operational performance.

Risk status: low. Just in case anyone was wondering...

May 27, 2017, 3:40 pm

#66

Prospero

Moderator: British Airways Executive Club, Iberia Airlines, Airport Lounges and Environmentally Friendly Travel

Join Date: Jan 2003

Location: London, UK

Posts: 22,212

Quote:

Originally Posted by Oaxaca

We were approaching 1,700 active users earlier this evening. That is a remarkable figure for any FlyerTalk forum, matched only by the UA forum several weeks ago.

May 27, 2017, 3:41 pm

#67

Lewis King

Join Date: Aug 2016

Programs: BAEC - Lowly blue

Posts: 282

Seeing a few flights scheduled to depart at 1-2am. Guessing Heathrow have allowed BA to break the curfew?

May 27, 2017, 4:08 pm

#68

techie

Join Date: Mar 2013

Location: US of A

Programs: Delta Diamond, United 1K, BA Blue, Marriott Titanium, Hilton Gold, Amex Platinum

Posts: 1,775

What seems to not be talked about is that this event has -- and probably will continue for a few days -- to be a very large stress test for Heathrow airport, at least T3 and T5. There is testing things with a bunch of volunteers like they did before T2 opened and there is many thousands being stranded with luggage in potentially complex travel circumstances.

In these situations, there are a few things most passengers are looking for:
- Somewhere to be comfortable (standing around for hours even for young people is unpleasant)
- Plenty of food and drink
- Abundant power outlets for portable devices
- Fast wifi (not everyone would have cellular service and even then the mobile networks can be overwhelmed)
- Roomy bathroom facilities

While this will is and will be an expensive lesson for BA, I hope that Heathrow airport will learn a thing or 2 as well.

May 27, 2017, 4:12 pm

#69

Genius1

FlyerTalk Evangelist

Join Date: Oct 2006

Location: London, UK

Programs: BA Gold, SQ Gold, KQ Platinum, IHG Diamond Ambassador, Hilton Gold, Marriott Silver, Accor Silver

Posts: 16,348

Quote:

Originally Posted by techie

While this will is and will be an expensive lesson for BA, I hope that Heathrow airport will learn a thing or 2 as well.

HAL have head office staff on standby to assist in the terminals when things go wrong like today, as well as the flexibility to deploy additional operational staff - both of these things happened today and presumably will continue for the next day or two. HAL have contingency stocks containing items such as water and snack bars which helps to a degree when passengers are queuing, as well as the ability to hand out refreshment vouchers.

May 27, 2017, 4:24 pm

#70

Globaliser

FlyerTalk Evangelist

Join Date: Aug 2002

Location: London

Programs: Mucci. Nothing else matters.

Posts: 38,644

Quote:

Originally Posted by Lewis King

Seeing a few flights scheduled to depart at 1-2am. Guessing Heathrow have allowed BA to break the curfew?

It seems that that's something that's fairly readily permitted when there's a big problem.

May 27, 2017, 4:55 pm

#71

bioyuki

Join Date: Feb 2008

Location: SFO

Programs: Free agent, UA 1K, Bonvoy Gold

Posts: 353

Quote:

Originally Posted by flatlander

There have been a number of declarations of how it should be done better with independent data center locations and so on (many in the main thread about this event). Those points are valid, but in orchestrating the failover or load distribution between the datacenters is complex. Knowing what services are provided from where, avoiding split brain situations (ensuring that authoritative data sources are in one place or the other but not both) and so on is not easy.

In particular, while you can have two isolated copies of a system, two sets of services, etc, in each datacenter, the system that orchestrates which is running at any time has to have knowledge and connections in both datacenters. The fail-over mechanisms are not simple twins where if one dies, the other lives, but siamese twins, connected to both datacenters and subject to problems in either datacenter - if one dies, the other can get sick.

Comments so far indicate that BA has several locations, and uses a private cloud infrastructure. That private cloud will need a cloud resource manager, job scheduler, service availability manager, etc, and that cannot be isolated to one datacenter or another. If an unexpected failure occurred these cloud management systems then the entire cloud can stop working. This sort of failure has been the root of several Amazon Web Services outages, as well as other well-known outages, and the possibility of it happening is impossible to remove because the resource manager must have a connection with all datacenters, and so a vulnerability in all datacenters, to function.

In summary, I am afraid I am tempted, on a bad day, to answer those blunt comments in the other thread with "I think you'll find it's a bit more complicated than that."

Meanwhile, obviously such an outage should be quicker to repair in BA, and this is where most of the learnings will lie: questions of how to restart systems, investigation into previously unknown dependencies, reducing complexity to improve separability and therefore reliability in case of partial outages, and so on.

Additionally, those who call for 'heads to roll" and for people to be sacked over simply because it happened, are also calling for the wrong thing and acting in a way that will not improve reliability. Personal vengeance rarely makes complex systems any better.

This sort of incident needs to be investigated in the manner of a flight operations incident, so analysis of what happened, direct causes and contributing factors, and analysis of how to stop these same causes acting again to cause another incident. Taking a first action of deploying either the HR disciplinary procedure (on internal staff) or lawyers with damages briefs (on external suppliers) will achieve little for better reliability. I encourage anyone who investigates this sort of IT failure, or wants to know how to do it better, to read AAIB accident reports and books on Just Culture in technical operations.

First discover what went wrong, why it went wrong, and then if, and only if, there was gross negligence or malfeasance do you fire or sue people.

This will be one epic postmortem. I agree with your premise that an investigation needs to happen before heads roll, but I disagree with your fundamental argument that building a stable, resilient architecture is difficult. It's 2017 for goodness sake; even for a company like IAG where IT isn't its core competency, it's relatively straightforward to to design, develop and implement a robust infrastructure.

Power loss, disk failures, fiber cuts, hell, even the complete loss of availability of major services in multiple regions, are all expected to happen (and will eventually happen...), and should be appropriately planned for. It boggles my mind that 12+ hours into this and BA still hasn't recovered its systems.

May 27, 2017, 4:57 pm

#72

techie

Join Date: Mar 2013

Location: US of A

Programs: Delta Diamond, United 1K, BA Blue, Marriott Titanium, Hilton Gold, Amex Platinum

Posts: 1,775

Quote:

Originally Posted by Genius1

In situations where hundreds of people are queuing land-side and thousands of people are air-side, the distribution of supplies needs to be done with military precision and speed. BA have extensive catering available to them, with items that would have been loaded on planes if those planes were to fly.

Lots of empty spaces land-side are great if you want to keep pax flowing from one area to another, but they become a hindrance when said pax become bunched up like bees in a hive.

May 27, 2017, 5:04 pm

#73

Banana4321

Join Date: Apr 2014

Location: London

Programs: Don't even mention it. Grrrrrrr.

Posts: 968

Quote:

Originally Posted by bioyuki

If you started in 2017 then yes it's easier. Try walking into a company with legacy systems dating from the 1960s (which most large companies do); add in a few systems acquired as part of a takeover/merger; a few rationalisations of suppliers; a few upgrades here and there of operating systems; a decision based on a dogmatic CIO to change database suppliers.

As has been said: "I think you'll find that it's a bit more complicated than that".

As the joke goes.
Driver stopping for directions: "Which is the way to <my destination> please?"
Pedestrian: "Well, I wouldn't start from here if I were you"

May 27, 2017, 5:18 pm

#74

bioyuki

Join Date: Feb 2008

Location: SFO

Programs: Free agent, UA 1K, Bonvoy Gold

Posts: 353

Quote:

Originally Posted by Banana4321

Still not an excuse...I don't work in the airline space, but I've worked on plenty of old and crufty systems, acquired or homegrown, and while it's significantly harder, doesn't mean it can't be done. From an outsider's perspective, it seems like Southwest has done a good job modernizing its systems over the last five years to bring them into the modern era from its inception in the 1960s.

May 27, 2017, 5:29 pm

#75

techie

Join Date: Mar 2013

Location: US of A

Programs: Delta Diamond, United 1K, BA Blue, Marriott Titanium, Hilton Gold, Amex Platinum

Posts: 1,775

Quote:

Originally Posted by Banana4321

Zero sympathy. With the right levels of investment, the commitment, the right people and the right processes, it can be done. It will not be done in a month or a year, but it can be gradually transformed and be brought into the 21st century. But BA has never struck me as an airline that is a progressive thinker.