27 May BA IT outage miscellaneous discussions thread
#61
Join Date: Oct 2005
Location: London
Posts: 726
This sort of incident needs to be investigated in the manner of a flight operations incident, so analysis of what happened, direct causes and contributing factors, and analysis of how to stop these same causes acting again to cause another incident. Taking a first action of deploying either the HR disciplinary procedure (on internal staff) or lawyers with damages briefs (on external suppliers) will achieve little for better reliability. I encourage anyone who investigates this sort of IT failure, or wants to know how to do it better, to read AAIB accident reports and books on Just Culture in technical operations.
#62
Join Date: Feb 2010
Location: London
Programs: BA GGL (for now) and Lifetime Gold, Marriott fan thanks to Bonvoy Moments
Posts: 5,115
How many empty positioning flights are running? If I were BA, I would be really tempted to just board whoever wants to go anywhere. 400 seats available to Japan. Show a passport and go. If the gate agents could not determine if a visa was required, then don't board them. Maybe have them prove that they have a return ticket (printout or such). If they did this, I doubt many people would game the system... e.g. if I had a round trip LHR-ZRH, I am not going to jump on a plane to Japan since getting back would be a problem.
It'd be a way to clear it out. Put Richard Branson in charge of things and there would be a solution. I have boarded third world airlines with a hand written ticket and BA is looking fairly third world at the moment. Time to break out the hand-written tickets.
It'd be a way to clear it out. Put Richard Branson in charge of things and there would be a solution. I have boarded third world airlines with a hand written ticket and BA is looking fairly third world at the moment. Time to break out the hand-written tickets.
Depending on how long this lasts it'll be interesting (you know what I mean) to see where this ranks vs. the RBS / NatWest / Ulster Bank failure a few years ago and the recent NHS disaster.
#63
Join Date: Jun 2008
Location: Northern Nevada
Programs: DL,EK
Posts: 1,652
#64
Join Date: Dec 2014
Location: UK
Programs: BA, U2+, SK, AF/KL, IHG, Hilton, others gathering dust...
Posts: 2,552
The reason is understandable, but this forum is the busiest I've seen it in the couple of years I've been on here. I opened the forum homepage and thought there was something wrong with the formatting on my iPad - but no, it was just a massive list of active users. As I write there are 1322 active users on the BA FT forum, over 300 registered users and over 1000 lurkers.
I hope Alex Cruz sent out a few hi-viz jackets for the mods to get them through the night, though FT's IT seems to be coping admirably...
I hope Alex Cruz sent out a few hi-viz jackets for the mods to get them through the night, though FT's IT seems to be coping admirably...
#66
Moderator: British Airways Executive Club, Iberia Airlines, Airport Lounges and Environmentally Friendly Travel
Join Date: Jan 2003
Location: London, UK
Posts: 22,212
The reason is understandable, but this forum is the busiest I've seen it in the couple of years I've been on here. I opened the forum homepage and thought there was something wrong with the formatting on my iPad - but no, it was just a massive list of active users. As I write there are 1322 active users on the BA FT forum, over 300 registered users and over 1000 lurkers.
I hope Alex Cruz sent out a few hi-viz jackets for the mods to get them through the night, though FT's IT seems to be coping admirably...
I hope Alex Cruz sent out a few hi-viz jackets for the mods to get them through the night, though FT's IT seems to be coping admirably...
#68
Join Date: Mar 2013
Location: US of A
Programs: Delta Diamond, United 1K, BA Blue, Marriott Titanium, Hilton Gold, Amex Platinum
Posts: 1,775
What seems to not be talked about is that this event has -- and probably will continue for a few days -- to be a very large stress test for Heathrow airport, at least T3 and T5. There is testing things with a bunch of volunteers like they did before T2 opened and there is many thousands being stranded with luggage in potentially complex travel circumstances.
In these situations, there are a few things most passengers are looking for:
- Somewhere to be comfortable (standing around for hours even for young people is unpleasant)
- Plenty of food and drink
- Abundant power outlets for portable devices
- Fast wifi (not everyone would have cellular service and even then the mobile networks can be overwhelmed)
- Roomy bathroom facilities
While this will is and will be an expensive lesson for BA, I hope that Heathrow airport will learn a thing or 2 as well.
In these situations, there are a few things most passengers are looking for:
- Somewhere to be comfortable (standing around for hours even for young people is unpleasant)
- Plenty of food and drink
- Abundant power outlets for portable devices
- Fast wifi (not everyone would have cellular service and even then the mobile networks can be overwhelmed)
- Roomy bathroom facilities
While this will is and will be an expensive lesson for BA, I hope that Heathrow airport will learn a thing or 2 as well.
#69
FlyerTalk Evangelist
Join Date: Oct 2006
Location: London, UK
Programs: BA Gold, SQ Gold, KQ Platinum, IHG Diamond Ambassador, Hilton Gold, Marriott Silver, Accor Silver
Posts: 16,348
HAL have head office staff on standby to assist in the terminals when things go wrong like today, as well as the flexibility to deploy additional operational staff - both of these things happened today and presumably will continue for the next day or two. HAL have contingency stocks containing items such as water and snack bars which helps to a degree when passengers are queuing, as well as the ability to hand out refreshment vouchers.
#70
FlyerTalk Evangelist
Join Date: Aug 2002
Location: London
Programs: Mucci. Nothing else matters.
Posts: 38,644
#71
Join Date: Feb 2008
Location: SFO
Programs: Free agent, UA 1K, Bonvoy Gold
Posts: 353
There have been a number of declarations of how it should be done better with independent data center locations and so on (many in the main thread about this event). Those points are valid, but in orchestrating the failover or load distribution between the datacenters is complex. Knowing what services are provided from where, avoiding split brain situations (ensuring that authoritative data sources are in one place or the other but not both) and so on is not easy.
In particular, while you can have two isolated copies of a system, two sets of services, etc, in each datacenter, the system that orchestrates which is running at any time has to have knowledge and connections in both datacenters. The fail-over mechanisms are not simple twins where if one dies, the other lives, but siamese twins, connected to both datacenters and subject to problems in either datacenter - if one dies, the other can get sick.
Comments so far indicate that BA has several locations, and uses a private cloud infrastructure. That private cloud will need a cloud resource manager, job scheduler, service availability manager, etc, and that cannot be isolated to one datacenter or another. If an unexpected failure occurred these cloud management systems then the entire cloud can stop working. This sort of failure has been the root of several Amazon Web Services outages, as well as other well-known outages, and the possibility of it happening is impossible to remove because the resource manager must have a connection with all datacenters, and so a vulnerability in all datacenters, to function.
In summary, I am afraid I am tempted, on a bad day, to answer those blunt comments in the other thread with "I think you'll find it's a bit more complicated than that."
Meanwhile, obviously such an outage should be quicker to repair in BA, and this is where most of the learnings will lie: questions of how to restart systems, investigation into previously unknown dependencies, reducing complexity to improve separability and therefore reliability in case of partial outages, and so on.
Additionally, those who call for 'heads to roll" and for people to be sacked over simply because it happened, are also calling for the wrong thing and acting in a way that will not improve reliability. Personal vengeance rarely makes complex systems any better.
This sort of incident needs to be investigated in the manner of a flight operations incident, so analysis of what happened, direct causes and contributing factors, and analysis of how to stop these same causes acting again to cause another incident. Taking a first action of deploying either the HR disciplinary procedure (on internal staff) or lawyers with damages briefs (on external suppliers) will achieve little for better reliability. I encourage anyone who investigates this sort of IT failure, or wants to know how to do it better, to read AAIB accident reports and books on Just Culture in technical operations.
First discover what went wrong, why it went wrong, and then if, and only if, there was gross negligence or malfeasance do you fire or sue people.
In particular, while you can have two isolated copies of a system, two sets of services, etc, in each datacenter, the system that orchestrates which is running at any time has to have knowledge and connections in both datacenters. The fail-over mechanisms are not simple twins where if one dies, the other lives, but siamese twins, connected to both datacenters and subject to problems in either datacenter - if one dies, the other can get sick.
Comments so far indicate that BA has several locations, and uses a private cloud infrastructure. That private cloud will need a cloud resource manager, job scheduler, service availability manager, etc, and that cannot be isolated to one datacenter or another. If an unexpected failure occurred these cloud management systems then the entire cloud can stop working. This sort of failure has been the root of several Amazon Web Services outages, as well as other well-known outages, and the possibility of it happening is impossible to remove because the resource manager must have a connection with all datacenters, and so a vulnerability in all datacenters, to function.
In summary, I am afraid I am tempted, on a bad day, to answer those blunt comments in the other thread with "I think you'll find it's a bit more complicated than that."
Meanwhile, obviously such an outage should be quicker to repair in BA, and this is where most of the learnings will lie: questions of how to restart systems, investigation into previously unknown dependencies, reducing complexity to improve separability and therefore reliability in case of partial outages, and so on.
Additionally, those who call for 'heads to roll" and for people to be sacked over simply because it happened, are also calling for the wrong thing and acting in a way that will not improve reliability. Personal vengeance rarely makes complex systems any better.
This sort of incident needs to be investigated in the manner of a flight operations incident, so analysis of what happened, direct causes and contributing factors, and analysis of how to stop these same causes acting again to cause another incident. Taking a first action of deploying either the HR disciplinary procedure (on internal staff) or lawyers with damages briefs (on external suppliers) will achieve little for better reliability. I encourage anyone who investigates this sort of IT failure, or wants to know how to do it better, to read AAIB accident reports and books on Just Culture in technical operations.
First discover what went wrong, why it went wrong, and then if, and only if, there was gross negligence or malfeasance do you fire or sue people.
Power loss, disk failures, fiber cuts, hell, even the complete loss of availability of major services in multiple regions, are all expected to happen (and will eventually happen...), and should be appropriately planned for. It boggles my mind that 12+ hours into this and BA still hasn't recovered its systems.
#72
Join Date: Mar 2013
Location: US of A
Programs: Delta Diamond, United 1K, BA Blue, Marriott Titanium, Hilton Gold, Amex Platinum
Posts: 1,775
HAL have head office staff on standby to assist in the terminals when things go wrong like today, as well as the flexibility to deploy additional operational staff - both of these things happened today and presumably will continue for the next day or two. HAL have contingency stocks containing items such as water and snack bars which helps to a degree when passengers are queuing, as well as the ability to hand out refreshment vouchers.
Lots of empty spaces land-side are great if you want to keep pax flowing from one area to another, but they become a hindrance when said pax become bunched up like bees in a hive.
#73
Join Date: Apr 2014
Location: London
Programs: Don't even mention it. Grrrrrrr.
Posts: 968
As has been said: "I think you'll find that it's a bit more complicated than that".
As the joke goes.
Driver stopping for directions: "Which is the way to <my destination> please?"
Pedestrian: "Well, I wouldn't start from here if I were you"
#74
Join Date: Feb 2008
Location: SFO
Programs: Free agent, UA 1K, Bonvoy Gold
Posts: 353
If you started in 2017 then yes it's easier. Try walking into a company with legacy systems dating from the 1960s (which most large companies do); add in a few systems acquired as part of a takeover/merger; a few rationalisations of suppliers; a few upgrades here and there of operating systems; a decision based on a dogmatic CIO to change database suppliers.
As has been said: "I think you'll find that it's a bit more complicated than that".
As the joke goes.
Driver stopping for directions: "Which is the way to <my destination> please?"
Pedestrian: "Well, I wouldn't start from here if I were you"
As has been said: "I think you'll find that it's a bit more complicated than that".
As the joke goes.
Driver stopping for directions: "Which is the way to <my destination> please?"
Pedestrian: "Well, I wouldn't start from here if I were you"
#75
Join Date: Mar 2013
Location: US of A
Programs: Delta Diamond, United 1K, BA Blue, Marriott Titanium, Hilton Gold, Amex Platinum
Posts: 1,775
If you started in 2017 then yes it's easier. Try walking into a company with legacy systems dating from the 1960s (which most large companies do); add in a few systems acquired as part of a takeover/merger; a few rationalisations of suppliers; a few upgrades here and there of operating systems; a decision based on a dogmatic CIO to change database suppliers.
As has been said: "I think you'll find that it's a bit more complicated than that".
As the joke goes.
Driver stopping for directions: "Which is the way to <my destination> please?"
Pedestrian: "Well, I wouldn't start from here if I were you"
As has been said: "I think you'll find that it's a bit more complicated than that".
As the joke goes.
Driver stopping for directions: "Which is the way to <my destination> please?"
Pedestrian: "Well, I wouldn't start from here if I were you"