Go Back  FlyerTalk Forums > Miles&Points > Airlines and Mileage Programs > British Airways | Executive Club
Reload this Page >

27 May BA IT outage miscellaneous discussions thread

Community
Wiki Posts
Search

27 May BA IT outage miscellaneous discussions thread

Thread Tools
 
Search this Thread
 
Old May 27, 2017, 2:29 pm
  #46  
 
Join Date: Apr 2010
Posts: 11
Originally Posted by drb1979
True. I would have thought from an application perspective especially so. Not so much from an infra/dns/ad perspective given the nature of those systems. i.e. you dont fail over AD. Its up, its distributed. And pretty resilient.

From the symptoms I've read today, I cant reconcile a datacenter power issue with people being able to use their email only if they were already logged in. Seems like something is cached enabling access but its all speculation as we've got limited visibility at the moment about what happened!
If they're O365 users then only being able to access if already logged in would make sense
funkyfreddy is offline  
Old May 27, 2017, 2:30 pm
  #47  
 
Join Date: Jan 2016
Location: LON
Programs: BAEC
Posts: 3,918
Just to throw another likely cause into the ring.... I would probably suspect storage as being either the root cause or a delaying factor in all of this.

Ultimately if there is a power issue everything gets a disorderly shutdown. When the power is restored (and we don't know how long it was out for) or there is a failover to the backup data centre stuff has to be restarted in a sensible sequence with things like networking and DNS first, then any shared storage platforms before any applications that depend upon that storage.

But if storage had problems the integrity of the stored data due to the nature of the sudden shutdown you could have to wait the order of hours for the rebuilds to take place from parity (checksum) disks. And on a scale of Infrastructure that BA probably have this could be why it has taken a considerable time to restart services.

​​Fundamentally, something fairly low level in the IT plumbing didn't like the power failure and/or restart, to my mind it would be someway along this chain... Power, networking, address resolution (DNS), authentication, storage, virtualisation, applications
plunet is offline  
Old May 27, 2017, 2:31 pm
  #48  
 
Join Date: Apr 2010
Posts: 11
Originally Posted by Banana4321
I assumed from the symptoms that (and I know little about O365, but it appears we do have a resident expert on FT) MS caches the credentials for a period of time i.e. if you logged-in to O365 then BA's authentication servers in BA data centres are called and if they give the OK then the user can use O365 for a few hours, maybe a day even. However after those authentication servers disappear then no-one can login to O365 for the first time 'today' as their previous credentials would have expired.
correct, I know thats how most people seem to do O365 authentication.
funkyfreddy is offline  
Old May 27, 2017, 2:31 pm
  #49  
 
Join Date: Aug 2015
Location: Somewhere around Europe...
Programs: BA Gold; MB Ti; HH Diamond; IHG Plat; RR Gold
Posts: 530
Originally Posted by Fitch
What happened to your AY flights then ? Were they affected as well ?
Quite, I'm in TLL at the moment having done this route this morning - no issues whatsoever!
dakaix is offline  
Old May 27, 2017, 2:33 pm
  #50  
 
Join Date: Oct 2005
Location: London
Posts: 726
Originally Posted by funkyfreddy
If they're O365 users then only being able to access if already logged in would make sense
Yes, so the network was up. But why couldnt others log in? that points to an authentication or DNS issue which doesnt make sense if the root cause is a data centre power issue hours earlier. Both systems should be distributed. Unless I'm missing something!
SW7London is offline  
Old May 27, 2017, 2:37 pm
  #51  
 
Join Date: Aug 2015
Location: Somewhere around Europe...
Programs: BA Gold; MB Ti; HH Diamond; IHG Plat; RR Gold
Posts: 530
Originally Posted by drb1979
Yes, so the network was up. But why couldnt others log in? that points to an authentication or DNS issue which doesnt make sense if the root cause is a data centre power issue hours earlier. Both systems should be distributed. Unless I'm missing something!
I may be missing something, but where have all of these comments about being "unable to login" come from?

Most customer-facing agents would have already been logged into their critical applications (i.e. FLY et all) when they started their shifts; I passed much of the early shift BA staff coming in to T5 at 5am this morning.
dakaix is offline  
Old May 27, 2017, 2:37 pm
  #52  
 
Join Date: Apr 2010
Posts: 11
Originally Posted by drb1979
Yes, so the network was up. But why couldnt others log in? that points to an authentication or DNS issue which doesnt make sense if the root cause is a data centre power issue hours earlier. Both systems should be distributed. Unless I'm missing something!
Indeed. Sounds like AD/DNS/ADFS wasn't being successfully handled in the alternate data centre fully for some reason.

Although I've heard many times before people not noticing when AD/DNS/ADFS servers in some data centers stop handling requests and going un-noticed until some kind of disaster. Seems people don't set up enough monitoring on those kind of things.
funkyfreddy is offline  
Old May 27, 2017, 2:38 pm
  #53  
 
Join Date: Apr 2014
Location: London
Programs: Don't even mention it. Grrrrrrr.
Posts: 968
Originally Posted by drb1979
Yes, so the network was up. But why couldnt others log in? that points to an authentication or DNS issue which doesnt make sense if the root cause is a data centre power issue hours earlier. Both systems should be distributed. Unless I'm missing something!
It's been said before. What you think is distributed might not be; what you think doesn't get routed via the "down" datacentre, is routed through the "down" datacentre; there has already been an undetected failure in the datacentre that you are failing-over to; changes that have been made after the last DR rehearsal have prevented DR from working as expected. Any many, many more.

No doubt BA have a large legacy infrastructure. You can't click fingers or even spend money and get a perfect technology infrastructure. Every other company of a comparable size and history will have similar issues.
Banana4321 is offline  
Old May 27, 2017, 2:41 pm
  #54  
 
Join Date: Feb 2010
Location: London
Programs: BA GGL (for now) and Lifetime Gold, Marriott fan thanks to Bonvoy Moments
Posts: 5,115
Originally Posted by Fitch
What happened to your AY flights then ? Were they affected as well ?
no issues. Was a 10.20 flight so I was in gate area by the time the issue arose, so didn't see or hear anything. It's more about whether the system that gets partner flight details for posting Avios / TPs has fallen over too. Maybe the typical lag in data being sent will help here
lorcancoyle is offline  
Old May 27, 2017, 2:44 pm
  #55  
 
Join Date: Dec 2009
Location: Flatland
Programs: AA Lifetime Gold 1MM, BA Gold, UA Peon
Posts: 6,112
There have been a number of declarations of how it should be done better with independent data center locations and so on (many in the main thread about this event). Those points are valid, but in orchestrating the failover or load distribution between the datacenters is complex. Knowing what services are provided from where, avoiding split brain situations (ensuring that authoritative data sources are in one place or the other but not both) and so on is not easy.

In particular, while you can have two isolated copies of a system, two sets of services, etc, in each datacenter, the system that orchestrates which is running at any time has to have knowledge and connections in both datacenters. The fail-over mechanisms are not simple twins where if one dies, the other lives, but siamese twins, connected to both datacenters and subject to problems in either datacenter - if one dies, the other can get sick.

Comments so far indicate that BA has several locations, and uses a private cloud infrastructure. That private cloud will need a cloud resource manager, job scheduler, service availability manager, etc, and that cannot be isolated to one datacenter or another. If an unexpected failure occurred these cloud management systems then the entire cloud can stop working. This sort of failure has been the root of several Amazon Web Services outages, as well as other well-known outages, and the possibility of it happening is impossible to remove because the resource manager must have a connection with all datacenters, and so a vulnerability in all datacenters, to function.

In summary, I am afraid I am tempted, on a bad day, to answer those blunt comments in the other thread with "I think you'll find it's a bit more complicated than that."

Meanwhile, obviously such an outage should be quicker to repair in BA, and this is where most of the learnings will lie: questions of how to restart systems, investigation into previously unknown dependencies, reducing complexity to improve separability and therefore reliability in case of partial outages, and so on.

Additionally, those who call for 'heads to roll" and for people to be sacked over simply because it happened, are also calling for the wrong thing and acting in a way that will not improve reliability. Personal vengeance rarely makes complex systems any better.

This sort of incident needs to be investigated in the manner of a flight operations incident, so analysis of what happened, direct causes and contributing factors, and analysis of how to stop these same causes acting again to cause another incident. Taking a first action of deploying either the HR disciplinary procedure (on internal staff) or lawyers with damages briefs (on external suppliers) will achieve little for better reliability. I encourage anyone who investigates this sort of IT failure, or wants to know how to do it better, to read AAIB accident reports and books on Just Culture in technical operations.

First discover what went wrong, why it went wrong, and then if, and only if, there was gross negligence or malfeasance do you fire or sue people.
flatlander is offline  
Old May 27, 2017, 2:45 pm
  #56  
 
Join Date: Sep 2010
Location: Las Vegas
Programs: BA Gold; Hilton Honors Diamond
Posts: 3,228
Originally Posted by plunet
Just to throw another likely cause into the ring.... I would probably suspect storage as being either the root cause or a delaying factor in all of this.

Ultimately if there is a power issue everything gets a disorderly shutdown. When the power is restored (and we don't know how long it was out for) or there is a failover to the backup data centre stuff has to be restarted in a sensible sequence with things like networking and DNS first, then any shared storage platforms before any applications that depend upon that storage.

But if storage had problems the integrity of the stored data due to the nature of the sudden shutdown you could have to wait the order of hours for the rebuilds to take place from parity (checksum) disks. And on a scale of Infrastructure that BA probably have this could be why it has taken a considerable time to restart services.

​​Fundamentally, something fairly low level in the IT plumbing didn't like the power failure and/or restart, to my mind it would be someway along this chain... Power, networking, address resolution (DNS), authentication, storage, virtualisation, applications
I would certainly agree with this. Having had some storage shenanigans in the past it does take a long time for an array that has gone down hard to recover and validate the integrity of the data stored on it. This is then in addition to any application integrity checks (such as SQL or Exchange for example). Until the array has validated the data it won't make these volumes available and so any applications relying on them won't be available either.
Geordie405 is offline  
Old May 27, 2017, 2:48 pm
  #57  
 
Join Date: Dec 2009
Location: London
Programs: Mucci Petit Four de Pucci, RedVee's Navigator Badge, BA Gold, Hilton Diamond
Posts: 3,123
Originally Posted by T8191
It will be interesting to see how this disaster pans out for those affected. I'm hugely grateful I have 2 weeks to go before being exposed to Cruz's BA, which I hope may be functional by then.

Damn, I have 3 BA bookings in MMB, plus another one connecting to AA. I have really lost faith in the operation.
I thought Cruz looked truly ridiculous in his reflective jacket in the official Twitter et al video. He didn't say much of value either.
Fruitcake is offline  
Old May 27, 2017, 2:48 pm
  #58  
 
Join Date: Dec 2016
Programs: BA Gold
Posts: 487
Originally Posted by flatlander
Additionally, those who call for 'heads to roll" and for people to be sacked over simply because it happened, are also calling for the wrong thing and acting in a way that will not improve reliability. Personal vengeance rarely makes complex systems any better.

This sort of incident needs to be investigated in the manner of a flight operations incident, so analysis of what happened, direct causes and contributing factors, and analysis of how to stop these same causes acting again to cause another incident. Taking a first action of deploying either the HR disciplinary procedure (on internal staff) or lawyers with damages briefs (on external suppliers) will achieve little for better reliability. I encourage anyone who investigates this sort of IT failure, or wants to know how to do it better, to read AAIB accident reports and books on Just Culture in technical operations.

First discover what went wrong, why it went wrong, and then if, and only if, there was gross negligence or malfeasance do you fire or sue people.
Completely agree. Perhaps this is a situation where the aviation industry could learn lessons from the aviation industry?
doctoravios is offline  
Old May 27, 2017, 2:48 pm
  #59  
 
Join Date: Sep 2010
Location: Las Vegas
Programs: BA Gold; Hilton Honors Diamond
Posts: 3,228
Originally Posted by flatlander
There have been a number of declarations of how it should be done better with independent data center locations and so on (many in the main thread about this event). Those points are valid, but in orchestrating the failover or load distribution between the datacenters is complex. Knowing what services are provided from where, avoiding split brain situations (ensuring that authoritative data sources are in one place or the other but not both) and so on is not easy.

In particular, while you can have two isolated copies of a system, two sets of services, etc, in each datacenter, the system that orchestrates which is running at any time has to have knowledge and connections in both datacenters. The fail-over mechanisms are not simple twins where if one dies, the other lives, but siamese twins, connected to both datacenters and subject to problems in either datacenter - if one dies, the other can get sick.

Comments so far indicate that BA has several locations, and uses a private cloud infrastructure. That private cloud will need a cloud resource manager, job scheduler, service availability manager, etc, and that cannot be isolated to one datacenter or another. If an unexpected failure occurred these cloud management systems then the entire cloud can stop working. This sort of failure has been the root of several Amazon Web Services outages, as well as other well-known outages, and the possibility of it happening is impossible to remove because the resource manager must have a connection with all datacenters, and so a vulnerability in all datacenters, to function.

In summary, I am afraid I am tempted, on a bad day, to answer those blunt comments in the other thread with "I think you'll find it's a bit more complicated than that."

Meanwhile, obviously such an outage should be quicker to repair in BA, and this is where most of the learnings will lie: questions of how to restart systems, investigation into previously unknown dependencies, reducing complexity to improve separability and therefore reliability in case of partial outages, and so on.

Additionally, those who call for 'heads to roll" and for people to be sacked over simply because it happened, are also calling for the wrong thing and acting in a way that will not improve reliability. Personal vengeance rarely makes complex systems any better.

This sort of incident needs to be investigated in the manner of a flight operations incident, so analysis of what happened, direct causes and contributing factors, and analysis of how to stop these same causes acting again to cause another incident. Taking a first action of deploying either the HR disciplinary procedure (on internal staff) or lawyers with damages briefs (on external suppliers) will achieve little for better reliability. I encourage anyone who investigates this sort of IT failure, or wants to know how to do it better, to read AAIB accident reports and books on Just Culture in technical operations.

First discover what went wrong, why it went wrong, and then if, and only if, there was gross negligence or malfeasance do you fire or sue people.
^

I agree wholeheartedly with this. A little more complicated is putting it mildly. It's often a lot more complicated than that
Geordie405 is offline  
Old May 27, 2017, 2:49 pm
  #60  
 
Join Date: Jun 2008
Location: Northern Nevada
Programs: DL,EK
Posts: 1,652
How many empty positioning flights are running? If I were BA, I would be really tempted to just board whoever wants to go anywhere. 400 seats available to Japan. Show a passport and go. If the gate agents could not determine if a visa was required, then don't board them. Maybe have them prove that they have a return ticket (printout or such). If they did this, I doubt many people would game the system... e.g. if I had a round trip LHR-ZRH, I am not going to jump on a plane to Japan since getting back would be a problem.

It'd be a way to clear it out. Put Richard Branson in charge of things and there would be a solution. I have boarded third world airlines with a hand written ticket and BA is looking fairly third world at the moment. Time to break out the hand-written tickets.
DesertNomad is offline  


Contact Us - Manage Preferences - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service -

This site is owned, operated, and maintained by MH Sub I, LLC dba Internet Brands. Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Designated trademarks are the property of their respective owners.