Go Back  FlyerTalk Forums > Miles&Points > Airlines and Mileage Programs > British Airways | Executive Club
Reload this Page >

27 May BA IT outage miscellaneous discussions thread

Community
Wiki Posts
Search

27 May BA IT outage miscellaneous discussions thread

Thread Tools
 
Search this Thread
 
Old May 27, 2017, 2:53 pm
  #61  
 
Join Date: Oct 2005
Location: London
Posts: 726
Originally Posted by flatlander

This sort of incident needs to be investigated in the manner of a flight operations incident, so analysis of what happened, direct causes and contributing factors, and analysis of how to stop these same causes acting again to cause another incident. Taking a first action of deploying either the HR disciplinary procedure (on internal staff) or lawyers with damages briefs (on external suppliers) will achieve little for better reliability. I encourage anyone who investigates this sort of IT failure, or wants to know how to do it better, to read AAIB accident reports and books on Just Culture in technical operations.
I agree this certainly isnt the time to start threatening staff or suppliers. But dont forget seldom are AAIB investigated accidents caused by cost cutting, whereas a lot of technology/IT failures are. In my view there absolutely should be consequences for short term thinking managers who take the easy option of cutting costs/increasing risks within IT
SW7London is offline  
Old May 27, 2017, 3:08 pm
  #62  
 
Join Date: Feb 2010
Location: London
Programs: BA GGL (for now) and Lifetime Gold, Marriott fan thanks to Bonvoy Moments
Posts: 5,115
Originally Posted by DesertNomad
How many empty positioning flights are running? If I were BA, I would be really tempted to just board whoever wants to go anywhere. 400 seats available to Japan. Show a passport and go. If the gate agents could not determine if a visa was required, then don't board them. Maybe have them prove that they have a return ticket (printout or such). If they did this, I doubt many people would game the system... e.g. if I had a round trip LHR-ZRH, I am not going to jump on a plane to Japan since getting back would be a problem.

It'd be a way to clear it out. Put Richard Branson in charge of things and there would be a solution. I have boarded third world airlines with a hand written ticket and BA is looking fairly third world at the moment. Time to break out the hand-written tickets.
unfortunately I don't think many other countries would permit flights to arrive unless the manifest had been provided in advance - Spain and US are 2 with API requirements that spring to mind.

Depending on how long this lasts it'll be interesting (you know what I mean) to see where this ranks vs. the RBS / NatWest / Ulster Bank failure a few years ago and the recent NHS disaster.
lorcancoyle is offline  
Old May 27, 2017, 3:21 pm
  #63  
 
Join Date: Jun 2008
Location: Northern Nevada
Programs: DL,EK
Posts: 1,652
Originally Posted by lorcancoyle
unfortunately I don't think many other countries would permit flights to arrive unless the manifest had been provided in advance - Spain and US are 2 with API requirements that spring to mind.
That's why we have pens, paper and fax machines.
DesertNomad is offline  
Old May 27, 2017, 3:29 pm
  #64  
 
Join Date: Dec 2014
Location: UK
Programs: BA, U2+, SK, AF/KL, IHG, Hilton, others gathering dust...
Posts: 2,552
The reason is understandable, but this forum is the busiest I've seen it in the couple of years I've been on here. I opened the forum homepage and thought there was something wrong with the formatting on my iPad - but no, it was just a massive list of active users. As I write there are 1322 active users on the BA FT forum, over 300 registered users and over 1000 lurkers.

I hope Alex Cruz sent out a few hi-viz jackets for the mods to get them through the night, though FT's IT seems to be coping admirably...
Oaxaca is offline  
Old May 27, 2017, 3:39 pm
  #65  
 
Join Date: Jun 2014
Location: Heathrow
Posts: 218
OK so some bits seem to be coming back online.

Including, the page on the intranet where colleagues can get info about the operational performance.

Risk status: low. Just in case anyone was wondering...
alextheengineer is offline  
Old May 27, 2017, 3:40 pm
  #66  
Moderator: British Airways Executive Club, Iberia Airlines, Airport Lounges and Environmentally Friendly Travel
 
Join Date: Jan 2003
Location: London, UK
Posts: 22,212
Originally Posted by Oaxaca
The reason is understandable, but this forum is the busiest I've seen it in the couple of years I've been on here. I opened the forum homepage and thought there was something wrong with the formatting on my iPad - but no, it was just a massive list of active users. As I write there are 1322 active users on the BA FT forum, over 300 registered users and over 1000 lurkers.

I hope Alex Cruz sent out a few hi-viz jackets for the mods to get them through the night, though FT's IT seems to be coping admirably...
We were approaching 1,700 active users earlier this evening. That is a remarkable figure for any FlyerTalk forum, matched only by the UA forum several weeks ago.
Prospero is offline  
Old May 27, 2017, 3:41 pm
  #67  
 
Join Date: Aug 2016
Programs: BAEC - Lowly blue
Posts: 282
Seeing a few flights scheduled to depart at 1-2am. Guessing Heathrow have allowed BA to break the curfew?
Lewis King is offline  
Old May 27, 2017, 4:08 pm
  #68  
 
Join Date: Mar 2013
Location: US of A
Programs: Delta Diamond, United 1K, BA Blue, Marriott Titanium, Hilton Gold, Amex Platinum
Posts: 1,775
What seems to not be talked about is that this event has -- and probably will continue for a few days -- to be a very large stress test for Heathrow airport, at least T3 and T5. There is testing things with a bunch of volunteers like they did before T2 opened and there is many thousands being stranded with luggage in potentially complex travel circumstances.

In these situations, there are a few things most passengers are looking for:
- Somewhere to be comfortable (standing around for hours even for young people is unpleasant)
- Plenty of food and drink
- Abundant power outlets for portable devices
- Fast wifi (not everyone would have cellular service and even then the mobile networks can be overwhelmed)
- Roomy bathroom facilities

While this will is and will be an expensive lesson for BA, I hope that Heathrow airport will learn a thing or 2 as well.
techie is offline  
Old May 27, 2017, 4:12 pm
  #69  
FlyerTalk Evangelist
 
Join Date: Oct 2006
Location: London, UK
Programs: BA Gold, SQ Gold, KQ Platinum, IHG Diamond Ambassador, Hilton Gold, Marriott Silver, Accor Silver
Posts: 16,348
Originally Posted by techie
While this will is and will be an expensive lesson for BA, I hope that Heathrow airport will learn a thing or 2 as well.
HAL have head office staff on standby to assist in the terminals when things go wrong like today, as well as the flexibility to deploy additional operational staff - both of these things happened today and presumably will continue for the next day or two. HAL have contingency stocks containing items such as water and snack bars which helps to a degree when passengers are queuing, as well as the ability to hand out refreshment vouchers.
Genius1 is offline  
Old May 27, 2017, 4:24 pm
  #70  
FlyerTalk Evangelist
 
Join Date: Aug 2002
Location: London
Programs: Mucci. Nothing else matters.
Posts: 38,644
Originally Posted by Lewis King
Seeing a few flights scheduled to depart at 1-2am. Guessing Heathrow have allowed BA to break the curfew?
It seems that that's something that's fairly readily permitted when there's a big problem.
Globaliser is offline  
Old May 27, 2017, 4:55 pm
  #71  
 
Join Date: Feb 2008
Location: SFO
Programs: Free agent, UA 1K, Bonvoy Gold
Posts: 353
Originally Posted by flatlander
There have been a number of declarations of how it should be done better with independent data center locations and so on (many in the main thread about this event). Those points are valid, but in orchestrating the failover or load distribution between the datacenters is complex. Knowing what services are provided from where, avoiding split brain situations (ensuring that authoritative data sources are in one place or the other but not both) and so on is not easy.

In particular, while you can have two isolated copies of a system, two sets of services, etc, in each datacenter, the system that orchestrates which is running at any time has to have knowledge and connections in both datacenters. The fail-over mechanisms are not simple twins where if one dies, the other lives, but siamese twins, connected to both datacenters and subject to problems in either datacenter - if one dies, the other can get sick.

Comments so far indicate that BA has several locations, and uses a private cloud infrastructure. That private cloud will need a cloud resource manager, job scheduler, service availability manager, etc, and that cannot be isolated to one datacenter or another. If an unexpected failure occurred these cloud management systems then the entire cloud can stop working. This sort of failure has been the root of several Amazon Web Services outages, as well as other well-known outages, and the possibility of it happening is impossible to remove because the resource manager must have a connection with all datacenters, and so a vulnerability in all datacenters, to function.

In summary, I am afraid I am tempted, on a bad day, to answer those blunt comments in the other thread with "I think you'll find it's a bit more complicated than that."

Meanwhile, obviously such an outage should be quicker to repair in BA, and this is where most of the learnings will lie: questions of how to restart systems, investigation into previously unknown dependencies, reducing complexity to improve separability and therefore reliability in case of partial outages, and so on.

Additionally, those who call for 'heads to roll" and for people to be sacked over simply because it happened, are also calling for the wrong thing and acting in a way that will not improve reliability. Personal vengeance rarely makes complex systems any better.

This sort of incident needs to be investigated in the manner of a flight operations incident, so analysis of what happened, direct causes and contributing factors, and analysis of how to stop these same causes acting again to cause another incident. Taking a first action of deploying either the HR disciplinary procedure (on internal staff) or lawyers with damages briefs (on external suppliers) will achieve little for better reliability. I encourage anyone who investigates this sort of IT failure, or wants to know how to do it better, to read AAIB accident reports and books on Just Culture in technical operations.

First discover what went wrong, why it went wrong, and then if, and only if, there was gross negligence or malfeasance do you fire or sue people.
This will be one epic postmortem. I agree with your premise that an investigation needs to happen before heads roll, but I disagree with your fundamental argument that building a stable, resilient architecture is difficult. It's 2017 for goodness sake; even for a company like IAG where IT isn't its core competency, it's relatively straightforward to to design, develop and implement a robust infrastructure.

Power loss, disk failures, fiber cuts, hell, even the complete loss of availability of major services in multiple regions, are all expected to happen (and will eventually happen...), and should be appropriately planned for. It boggles my mind that 12+ hours into this and BA still hasn't recovered its systems.
bioyuki is offline  
Old May 27, 2017, 4:57 pm
  #72  
 
Join Date: Mar 2013
Location: US of A
Programs: Delta Diamond, United 1K, BA Blue, Marriott Titanium, Hilton Gold, Amex Platinum
Posts: 1,775
Originally Posted by Genius1
HAL have head office staff on standby to assist in the terminals when things go wrong like today, as well as the flexibility to deploy additional operational staff - both of these things happened today and presumably will continue for the next day or two. HAL have contingency stocks containing items such as water and snack bars which helps to a degree when passengers are queuing, as well as the ability to hand out refreshment vouchers.
In situations where hundreds of people are queuing land-side and thousands of people are air-side, the distribution of supplies needs to be done with military precision and speed. BA have extensive catering available to them, with items that would have been loaded on planes if those planes were to fly.

Lots of empty spaces land-side are great if you want to keep pax flowing from one area to another, but they become a hindrance when said pax become bunched up like bees in a hive.
techie is offline  
Old May 27, 2017, 5:04 pm
  #73  
 
Join Date: Apr 2014
Location: London
Programs: Don't even mention it. Grrrrrrr.
Posts: 968
Originally Posted by bioyuki
This will be one epic postmortem. I agree with your premise that an investigation needs to happen before heads roll, but I disagree with your fundamental argument that building a stable, resilient architecture is difficult. It's 2017 for goodness sake;
If you started in 2017 then yes it's easier. Try walking into a company with legacy systems dating from the 1960s (which most large companies do); add in a few systems acquired as part of a takeover/merger; a few rationalisations of suppliers; a few upgrades here and there of operating systems; a decision based on a dogmatic CIO to change database suppliers.

As has been said: "I think you'll find that it's a bit more complicated than that".

As the joke goes.
Driver stopping for directions: "Which is the way to <my destination> please?"
Pedestrian: "Well, I wouldn't start from here if I were you"
Banana4321 is offline  
Old May 27, 2017, 5:18 pm
  #74  
 
Join Date: Feb 2008
Location: SFO
Programs: Free agent, UA 1K, Bonvoy Gold
Posts: 353
Originally Posted by Banana4321
If you started in 2017 then yes it's easier. Try walking into a company with legacy systems dating from the 1960s (which most large companies do); add in a few systems acquired as part of a takeover/merger; a few rationalisations of suppliers; a few upgrades here and there of operating systems; a decision based on a dogmatic CIO to change database suppliers.

As has been said: "I think you'll find that it's a bit more complicated than that".

As the joke goes.
Driver stopping for directions: "Which is the way to <my destination> please?"
Pedestrian: "Well, I wouldn't start from here if I were you"
Still not an excuse...I don't work in the airline space, but I've worked on plenty of old and crufty systems, acquired or homegrown, and while it's significantly harder, doesn't mean it can't be done. From an outsider's perspective, it seems like Southwest has done a good job modernizing its systems over the last five years to bring them into the modern era from its inception in the 1960s.
bioyuki is offline  
Old May 27, 2017, 5:29 pm
  #75  
 
Join Date: Mar 2013
Location: US of A
Programs: Delta Diamond, United 1K, BA Blue, Marriott Titanium, Hilton Gold, Amex Platinum
Posts: 1,775
Originally Posted by Banana4321
If you started in 2017 then yes it's easier. Try walking into a company with legacy systems dating from the 1960s (which most large companies do); add in a few systems acquired as part of a takeover/merger; a few rationalisations of suppliers; a few upgrades here and there of operating systems; a decision based on a dogmatic CIO to change database suppliers.

As has been said: "I think you'll find that it's a bit more complicated than that".

As the joke goes.
Driver stopping for directions: "Which is the way to <my destination> please?"
Pedestrian: "Well, I wouldn't start from here if I were you"
Zero sympathy. With the right levels of investment, the commitment, the right people and the right processes, it can be done. It will not be done in a month or a year, but it can be gradually transformed and be brought into the 21st century. But BA has never struck me as an airline that is a progressive thinker.
techie is offline  


Contact Us - Manage Preferences - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service -

This site is owned, operated, and maintained by MH Sub I, LLC dba Internet Brands. Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Designated trademarks are the property of their respective owners.