Go Back  FlyerTalk Forums > Miles&Points > Airlines and Mileage Programs > British Airways | Executive Club
Reload this Page >

27 May BA IT outage miscellaneous discussions thread

Community
Wiki Posts
Search

27 May BA IT outage miscellaneous discussions thread

Thread Tools
 
Search this Thread
 
Old May 27, 2017, 1:08 pm
  #16  
Moderator: British Airways Executive Club, Iberia Airlines, Airport Lounges and Environmentally Friendly Travel
 
Join Date: Jan 2003
Location: London, UK
Posts: 22,210
Originally Posted by Mapman
He may well have said this. I am not saying that this is not the case, nor that cyber attack is the reason. Simply passing on what colleagues working in critical national infrastructure have told me this afternoon. Time will tell.
What’s mind blowing is the combination of multiple IT systems going down together: sales, check-in, dispatch, crew roster et al.
Prospero is offline  
Old May 27, 2017, 1:11 pm
  #17  
 
Join Date: Aug 2009
Posts: 461
Originally Posted by Mapman
He may well have said this. I am not saying that this is not the case, nor that cyber attack is the reason. Simply passing on what colleagues working in critical national infrastructure have told me this afternoon. Time will tell.
Sounds like if there was a cyber attack it was targeting power distribution-SCADA. Thoughts?
BATLV is offline  
Old May 27, 2017, 1:12 pm
  #18  
 
Join Date: Dec 2014
Programs: BAEC (although I might just cut up the card)
Posts: 338
Originally Posted by T8191
Power failure = EU261 for inadequate provision of backup. No?

Cyber attack = get out of jail relatively free. Yes?

Watching with interest, as not flying BA for a couple of weeks. But become less enamoured by BA every month So sad.
This was my thought when they pretty much refused to confirm those rumours early on.

Maybe it will suddenly be a cyber attack.
r00ty is offline  
Old May 27, 2017, 1:12 pm
  #19  
FlyerTalk Evangelist
 
Join Date: Mar 2010
Location: JER
Programs: BA Gold/OWE, several MUCCI, and assorted Pensions!
Posts: 32,140
It will be interesting to see how this disaster pans out for those affected. I'm hugely grateful I have 2 weeks to go before being exposed to Cruz's BA, which I hope may be functional by then.

Damn, I have 3 BA bookings in MMB, plus another one connecting to AA. I have really lost faith in the operation.
T8191 is offline  
Old May 27, 2017, 1:13 pm
  #20  
 
Join Date: Dec 2014
Programs: BAEC (although I might just cut up the card)
Posts: 338
Originally Posted by Prospero
What’s mind blowing is the combination of multiple IT systems going down together: sales, check-in, dispatch, crew roster et al.
Not really surprising, all those systems require authentication. So it's a single point of failure, however that should be something that is very distributed.
r00ty is offline  
Old May 27, 2017, 1:14 pm
  #21  
 
Join Date: Dec 2016
Programs: BA Gold
Posts: 487
Originally Posted by T8191
Power failure = EU261 for inadequate provision of backup. No?

Cyber attack = get out of jail relatively free. Yes?

Watching with interest, as not flying BA for a couple of weeks. But become less enamoured by BA every month So sad.
The lawyers are going to have a field day with this. This is their dream come true. An unanticipated event which could be argued in any direction in a court. Let's hope it doesn't come to that and BA do the sensible thing and just compensate everyone affected. The lawyers earn enough as it is.
doctoravios is offline  
Old May 27, 2017, 1:17 pm
  #22  
 
Join Date: Apr 2014
Location: London
Programs: Don't even mention it. Grrrrrrr.
Posts: 968
Originally Posted by r00ty
Not really surprising, all those systems require authentication. So it's a single point of failure, however that should be something that is very distributed.
Authentication
Addressing (e.g. DNS)
Routing

Any issue in the above has the potential to be catastrophic.

As for moving to a disaster-recovery environment, one established method of activating a failover is a push to DNS of new IP addresses (the IP addresses of the back-up services). If that doesn't work (e.g. no DNS server(s)), then you are in a world of hurt.

Last edited by Banana4321; May 27, 2017 at 1:32 pm
Banana4321 is offline  
Old May 27, 2017, 1:20 pm
  #23  
 
Join Date: Aug 2015
Location: Somewhere around Europe...
Programs: BA Gold; MB Ti; HH Diamond; IHG Plat; RR Gold
Posts: 530
Originally Posted by Prospero
What’s mind blowing is the combination of multiple IT systems going down together: sales, check-in, dispatch, crew roster et al.
Precisely why a power failure and a failed DR attempt (we assume) make much more sense. I would highlight an eerily similar case at the Microsoft and Amazon data centres in Dublin back in 2011.

http://www.datacenterknowledge.com/archives/2011/08/07/lightning-in-dublin-knocks-amazon-microsoft-data-centers-offline/

Speaking professionally this in no way excuses not having a functional backup or well-drilled recovery plan. But then I can also count on one hand the number of customers I have come across that are actually that well prepared either.

Originally Posted by Banana4321
Authentication
Addressing (e.g. DNS)
Routing

Any issue in the above has the potential to be catastrophic.

As for moving to a disaster-recovery environment, one established method activating a failover is a push of new DNS addresses. If that doesn't work (e.g. no DNS server(s)), then you are in a world of hurt.
I'd also throw in there "Web Filtering", as if their user web proxy services are down then nobody would have external access either.
dakaix is offline  
Old May 27, 2017, 1:22 pm
  #24  
 
Join Date: Dec 2014
Programs: BAEC (although I might just cut up the card)
Posts: 338
Originally Posted by Banana4321
Authentication
Addressing (e.g. DNS)
Routing

Any issue in the above has the potential to be catastrophic.
DNS - Distributed. I have my own network, that I operate for fun and my DNS is distributed.
Routing, the internet routing is actually designed to safeguard against this. All of the main routing protocols should handle this.

DNS is a no brainer, these should ALWAYS be geographically distributed.

For routing, it's best practice to ensure that you don't have a single point of routing failure (single hub/multi spoke). For sure when you have a worldwide network running your entire airline operations, it would be terrible to think they'd done that.

So, none are acceptable types of failure either.

EDIT: Mind you, we've not heard the real full version and cause. So it's all hypothetical right now. Nothing wrong with a bit of speculation though..
r00ty is offline  
Old May 27, 2017, 1:25 pm
  #25  
Moderator: British Airways Executive Club, Iberia Airlines, Airport Lounges and Environmentally Friendly Travel
 
Join Date: Jan 2003
Location: London, UK
Posts: 22,210
Thank you r00ty, Banana4321, and dakaix for educating this non tech savvy individual of complex IT systems in such understandable terms
Prospero is offline  
Old May 27, 2017, 1:29 pm
  #26  
 
Join Date: Oct 2016
Posts: 698
Glad this miscellaneous discussion thread is here.

I remember last year with the gradual FLY introduction that most outages seemed to occur on weekends. It's interesting that this one also happened on a Saturday.

Looking for info here, but on this weekend last year, wasn't there an outage of FLY at Gatwick? I could be wrong but I seem to remember long queues on a bank holiday weekend.

If IT people are mainly on Mon-Fri contracts this seems wrong for a global airline.
MarkFlies is offline  
Old May 27, 2017, 1:38 pm
  #27  
 
Join Date: Apr 2014
Location: London
Programs: Don't even mention it. Grrrrrrr.
Posts: 968
Originally Posted by r00ty
DNS - Distributed. I have my own network, that I operate for fun and my DNS is distributed.
Routing, the internet routing is actually designed to safeguard against this. All of the main routing protocols should handle this.

DNS is a no brainer, these should ALWAYS be geographically distributed.

For routing, it's best practice to ensure that you don't have a single point of routing failure (single hub/multi spoke). For sure when you have a worldwide network running your entire airline operations, it would be terrible to think they'd done that.

So, none are acceptable types of failure either.

EDIT: Mind you, we've not heard the real full version and cause. So it's all hypothetical right now. Nothing wrong with a bit of speculation though..
DNS is distributed across the internet but not all names are in all DNS servers. BA won't have the address of their internal systems in Virgin's DNS servers. And these kind of glib comments are all very fine in theory, but practical engineering in a complex environment that are designed to serve millions of people, be secure and cope with edge-case, tail events (i.e. catastrophic events in the many forms that they may come) is...complicated. Extremely complicated.

If you think that BA is unique in being caught out by this then you are making an ill-informed judgment, although theirs is large, and public - which is most unfortunate for them. I have been closely involved in the annual failover "rehearsals" for organisations with a more critical service to the public than an airline. None of them were 100% successful. There was always a list of things to fix. Next year, due to the changes we had made over the course of that year, there was a list of different things to fix. And we didn't test firing a lightning bolt at the power supply (say), we did only what was necessary to send a letter to the regulatory body saying "we did that and passed". Also some times when we "flicked the switch" something unexpected and bad happened, in those cases we flick the switch back and try again in a few weeks time.

And some things are just too dangerous to test. Unless you want to stop flying planes, or stop people withdrawing cash, or stop paying salaries whilst you perform the testing. Failover and testing it in organisations with 42x7 operations is.....hard.

That is the reality.

Last edited by Banana4321; May 27, 2017 at 1:53 pm
Banana4321 is offline  
Old May 27, 2017, 1:41 pm
  #28  
 
Join Date: Oct 2015
Location: next to HAM
Programs: LH M+M
Posts: 960
Posted this in the first thread already, but got burried. Here's a description was happened at Delta last year leading into a cancel of some 1000ish flights and a huge financial loss.

http://perspectives.mvdirona.com/201...ts-arent-rare/

(and just think of what happened to AWS S3 last month..)

So all the 'rage' over decision incompetency, outsourcing and so on: might be - but could be different, too.
PAX_fips is offline  
Old May 27, 2017, 1:50 pm
  #29  
 
Join Date: Aug 2015
Location: Somewhere around Europe...
Programs: BA Gold; MB Ti; HH Diamond; IHG Plat; RR Gold
Posts: 530
Originally Posted by Banana4321
If you think that BA is unique in being caught out by this then you are making an ill-informed judgment, although theirs is large, and public - which is most unfortunate for them. I have been closely involved in the annual failover "rehearsals" for organisations with a more critical service to the public than an airline. None of them were 100% successful. There was always a list of things to fix. Next year, due to the changes we had made over the course of that year, there was a list of different things to fix. And we didn't test firing a lightning bolt at the power supply (say), we did only what was necessary to send a letter to the regulatory body saying "we did that and passed". Also some times when we "flicked the switch" something unexpected and bad happened, in those cases we flick the switch back and try again in a few weeks time.

That is the reality.
Glad I'm not the only one to have had that experience. The reality is that while many companies try (read: feign) to prepare for DR, few have a documented plan, fewer still actually perform any form of testing (let alone regular).

Sure you can test an application in isolation, maybe even one rack, but until you practically verify a complete data centre or geographical outage, there's simply no way to know for certain. Will routing work if the primary site is off? Are we sure all the fibers connecting the sites are diverse (so someone in a JCB doesnt knock out all our connectivity in a stroke)? If a site is entirely cratered how do we go about recovering? Would we have any staff left to do it?

The fact is very few companies would pass all of those criteria. At some point someone says "well is this really worth the cost?", because it's not tangible to them.

The first company I worked for was a data centre operator, they had a bi-weekly practice of performing "black building testing". As the name suggests they would literally disconnect the power supply to prove that everything worked, generators, UPS, failover etc. Theory is all fine, but it means nothing unless you test it. Regularly.

Last edited by dakaix; May 27, 2017 at 1:55 pm
dakaix is offline  
Old May 27, 2017, 1:53 pm
  #30  
 
Join Date: Oct 2015
Location: next to HAM
Programs: LH M+M
Posts: 960
The airline said in response that it would “never compromise the integrity and security of our IT systems,” adding that outsourcing of IT services was a “very common practice across all industries.”

Yep - and all across had their 'disruptive' experience with it.
PAX_fips is offline  


Contact Us - Manage Preferences - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service -

This site is owned, operated, and maintained by MH Sub I, LLC dba Internet Brands. Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Designated trademarks are the property of their respective owners.