27 May BA IT outage miscellaneous discussions thread

Reply Subscribe

Thread Tools

Search this Thread

May 27, 2017, 2:29 pm

#46

funkyfreddy

Join Date: Apr 2010

Posts: 11

Quote:

Originally Posted by drb1979

True. I would have thought from an application perspective especially so. Not so much from an infra/dns/ad perspective given the nature of those systems. i.e. you dont fail over AD. Its up, its distributed. And pretty resilient.

From the symptoms I've read today, I cant reconcile a datacenter power issue with people being able to use their email only if they were already logged in. Seems like something is cached enabling access but its all speculation as we've got limited visibility at the moment about what happened!

If they're O365 users then only being able to access if already logged in would make sense

May 27, 2017, 2:30 pm

#47

plunet

Join Date: Jan 2016

Location: LON

Programs: BAEC

Posts: 3,918

Just to throw another likely cause into the ring.... I would probably suspect storage as being either the root cause or a delaying factor in all of this.

Ultimately if there is a power issue everything gets a disorderly shutdown. When the power is restored (and we don't know how long it was out for) or there is a failover to the backup data centre stuff has to be restarted in a sensible sequence with things like networking and DNS first, then any shared storage platforms before any applications that depend upon that storage.

But if storage had problems the integrity of the stored data due to the nature of the sudden shutdown you could have to wait the order of hours for the rebuilds to take place from parity (checksum) disks. And on a scale of Infrastructure that BA probably have this could be why it has taken a considerable time to restart services.

Fundamentally, something fairly low level in the IT plumbing didn't like the power failure and/or restart, to my mind it would be someway along this chain... Power, networking, address resolution (DNS), authentication, storage, virtualisation, applications

May 27, 2017, 2:31 pm

#48

funkyfreddy

Join Date: Apr 2010

Posts: 11

Quote:

Originally Posted by Banana4321

I assumed from the symptoms that (and I know little about O365, but it appears we do have a resident expert on FT) MS caches the credentials for a period of time i.e. if you logged-in to O365 then BA's authentication servers in BA data centres are called and if they give the OK then the user can use O365 for a few hours, maybe a day even. However after those authentication servers disappear then no-one can login to O365 for the first time 'today' as their previous credentials would have expired.

correct, I know thats how most people seem to do O365 authentication.

May 27, 2017, 2:31 pm

#49

dakaix

Join Date: Aug 2015

Location: Somewhere around Europe...

Programs: BA Gold; MB Ti; HH Diamond; IHG Plat; RR Gold

Posts: 530

Quote:

Originally Posted by Fitch

What happened to your AY flights then ? Were they affected as well ?

Quite, I'm in TLL at the moment having done this route this morning - no issues whatsoever!

May 27, 2017, 2:33 pm

#50

SW7London

Join Date: Oct 2005

Location: London

Posts: 726

Quote:

Originally Posted by funkyfreddy

If they're O365 users then only being able to access if already logged in would make sense

Yes, so the network was up. But why couldnt others log in? that points to an authentication or DNS issue which doesnt make sense if the root cause is a data centre power issue hours earlier. Both systems should be distributed. Unless I'm missing something!

May 27, 2017, 2:37 pm

#51

dakaix

Join Date: Aug 2015

Location: Somewhere around Europe...

Programs: BA Gold; MB Ti; HH Diamond; IHG Plat; RR Gold

Posts: 530

Quote:

Originally Posted by drb1979

I may be missing something, but where have all of these comments about being "unable to login" come from?

Most customer-facing agents would have already been logged into their critical applications (i.e. FLY et all) when they started their shifts; I passed much of the early shift BA staff coming in to T5 at 5am this morning.

May 27, 2017, 2:37 pm

#52

funkyfreddy

Join Date: Apr 2010

Posts: 11

Quote:

Originally Posted by drb1979

Indeed. Sounds like AD/DNS/ADFS wasn't being successfully handled in the alternate data centre fully for some reason.

Although I've heard many times before people not noticing when AD/DNS/ADFS servers in some data centers stop handling requests and going un-noticed until some kind of disaster. Seems people don't set up enough monitoring on those kind of things.

May 27, 2017, 2:38 pm

#53

Banana4321

Join Date: Apr 2014

Location: London

Programs: Don't even mention it. Grrrrrrr.

Posts: 968

Quote:

Originally Posted by drb1979

It's been said before. What you think is distributed might not be; what you think doesn't get routed via the "down" datacentre, is routed through the "down" datacentre; there has already been an undetected failure in the datacentre that you are failing-over to; changes that have been made after the last DR rehearsal have prevented DR from working as expected. Any many, many more.

No doubt BA have a large legacy infrastructure. You can't click fingers or even spend money and get a perfect technology infrastructure. Every other company of a comparable size and history will have similar issues.

May 27, 2017, 2:41 pm

#54

lorcancoyle

Join Date: Feb 2010

Location: London

Programs: BA GGL (for now) and Lifetime Gold, Marriott fan thanks to Bonvoy Moments

Posts: 5,115

Quote:

Originally Posted by Fitch

What happened to your AY flights then ? Were they affected as well ?

no issues. Was a 10.20 flight so I was in gate area by the time the issue arose, so didn't see or hear anything. It's more about whether the system that gets partner flight details for posting Avios / TPs has fallen over too. Maybe the typical lag in data being sent will help here

May 27, 2017, 2:44 pm

#55

flatlander

Join Date: Dec 2009

Location: Flatland

Programs: AA Lifetime Gold 1MM, BA Gold, UA Peon

Posts: 6,112

There have been a number of declarations of how it should be done better with independent data center locations and so on (many in the main thread about this event). Those points are valid, but in orchestrating the failover or load distribution between the datacenters is complex. Knowing what services are provided from where, avoiding split brain situations (ensuring that authoritative data sources are in one place or the other but not both) and so on is not easy.

In particular, while you can have two isolated copies of a system, two sets of services, etc, in each datacenter, the system that orchestrates which is running at any time has to have knowledge and connections in both datacenters. The fail-over mechanisms are not simple twins where if one dies, the other lives, but siamese twins, connected to both datacenters and subject to problems in either datacenter - if one dies, the other can get sick.

Comments so far indicate that BA has several locations, and uses a private cloud infrastructure. That private cloud will need a cloud resource manager, job scheduler, service availability manager, etc, and that cannot be isolated to one datacenter or another. If an unexpected failure occurred these cloud management systems then the entire cloud can stop working. This sort of failure has been the root of several Amazon Web Services outages, as well as other well-known outages, and the possibility of it happening is impossible to remove because the resource manager must have a connection with all datacenters, and so a vulnerability in all datacenters, to function.

In summary, I am afraid I am tempted, on a bad day, to answer those blunt comments in the other thread with "I think you'll find it's a bit more complicated than that."

Meanwhile, obviously such an outage should be quicker to repair in BA, and this is where most of the learnings will lie: questions of how to restart systems, investigation into previously unknown dependencies, reducing complexity to improve separability and therefore reliability in case of partial outages, and so on.

Additionally, those who call for 'heads to roll" and for people to be sacked over simply because it happened, are also calling for the wrong thing and acting in a way that will not improve reliability. Personal vengeance rarely makes complex systems any better.

This sort of incident needs to be investigated in the manner of a flight operations incident, so analysis of what happened, direct causes and contributing factors, and analysis of how to stop these same causes acting again to cause another incident. Taking a first action of deploying either the HR disciplinary procedure (on internal staff) or lawyers with damages briefs (on external suppliers) will achieve little for better reliability. I encourage anyone who investigates this sort of IT failure, or wants to know how to do it better, to read AAIB accident reports and books on Just Culture in technical operations.

First discover what went wrong, why it went wrong, and then if, and only if, there was gross negligence or malfeasance do you fire or sue people.

May 27, 2017, 2:45 pm

#56

Geordie405

Join Date: Sep 2010

Location: Las Vegas

Programs: BA Gold; Hilton Honors Diamond

Posts: 3,228

Quote:

Originally Posted by plunet

I would certainly agree with this. Having had some storage shenanigans in the past it does take a long time for an array that has gone down hard to recover and validate the integrity of the data stored on it. This is then in addition to any application integrity checks (such as SQL or Exchange for example). Until the array has validated the data it won't make these volumes available and so any applications relying on them won't be available either.

May 27, 2017, 2:48 pm

#57

Fruitcake

Join Date: Dec 2009

Location: London

Programs: Mucci Petit Four de Pucci, RedVee's Navigator Badge, BA Gold, Hilton Diamond

Posts: 3,123

Quote:

Originally Posted by T8191

It will be interesting to see how this disaster pans out for those affected. I'm hugely grateful I have 2 weeks to go before being exposed to Cruz's BA, which I hope may be functional by then.

Damn, I have 3 BA bookings in MMB, plus another one connecting to AA. I have really lost faith in the operation.

I thought Cruz looked truly ridiculous in his reflective jacket in the official Twitter et al video. He didn't say much of value either.

May 27, 2017, 2:48 pm

#58

doctoravios

Join Date: Dec 2016

Programs: BA Gold

Posts: 487

Quote:

Originally Posted by flatlander

Additionally, those who call for 'heads to roll" and for people to be sacked over simply because it happened, are also calling for the wrong thing and acting in a way that will not improve reliability. Personal vengeance rarely makes complex systems any better.

This sort of incident needs to be investigated in the manner of a flight operations incident, so analysis of what happened, direct causes and contributing factors, and analysis of how to stop these same causes acting again to cause another incident. Taking a first action of deploying either the HR disciplinary procedure (on internal staff) or lawyers with damages briefs (on external suppliers) will achieve little for better reliability. I encourage anyone who investigates this sort of IT failure, or wants to know how to do it better, to read AAIB accident reports and books on Just Culture in technical operations.

First discover what went wrong, why it went wrong, and then if, and only if, there was gross negligence or malfeasance do you fire or sue people.

Completely agree. Perhaps this is a situation where the aviation industry could learn lessons from the aviation industry?

May 27, 2017, 2:48 pm

#59

Geordie405

Join Date: Sep 2010

Location: Las Vegas

Programs: BA Gold; Hilton Honors Diamond

Posts: 3,228

Quote:

Originally Posted by flatlander

^

I agree wholeheartedly with this. A little more complicated is putting it mildly. It's often a lot more complicated than that

May 27, 2017, 2:49 pm

#60

DesertNomad

Join Date: Jun 2008

Location: Northern Nevada

Programs: DL,EK

Posts: 1,652

How many empty positioning flights are running? If I were BA, I would be really tempted to just board whoever wants to go anywhere. 400 seats available to Japan. Show a passport and go. If the gate agents could not determine if a visa was required, then don't board them. Maybe have them prove that they have a return ticket (printout or such). If they did this, I doubt many people would game the system... e.g. if I had a round trip LHR-ZRH, I am not going to jump on a plane to Japan since getting back would be a problem.

It'd be a way to clear it out. Put Richard Branson in charge of things and there would be a solution. I have boarded third world airlines with a hand written ticket and BA is looking fairly third world at the moment. Time to break out the hand-written tickets.