27 May BA IT outage miscellaneous discussions thread
#16
Moderator: British Airways Executive Club, Iberia Airlines, Airport Lounges and Environmentally Friendly Travel
Join Date: Jan 2003
Location: London, UK
Posts: 22,210
What’s mind blowing is the combination of multiple IT systems going down together: sales, check-in, dispatch, crew roster et al.
#17
Join Date: Aug 2009
Posts: 461
Sounds like if there was a cyber attack it was targeting power distribution-SCADA. Thoughts?
#18
Join Date: Dec 2014
Programs: BAEC (although I might just cut up the card)
Posts: 338
Maybe it will suddenly be a cyber attack.
#19
FlyerTalk Evangelist
Join Date: Mar 2010
Location: JER
Programs: BA Gold/OWE, several MUCCI, and assorted Pensions!
Posts: 32,140
It will be interesting to see how this disaster pans out for those affected. I'm hugely grateful I have 2 weeks to go before being exposed to Cruz's BA, which I hope may be functional by then.
Damn, I have 3 BA bookings in MMB, plus another one connecting to AA. I have really lost faith in the operation.
Damn, I have 3 BA bookings in MMB, plus another one connecting to AA. I have really lost faith in the operation.
#20
Join Date: Dec 2014
Programs: BAEC (although I might just cut up the card)
Posts: 338
Not really surprising, all those systems require authentication. So it's a single point of failure, however that should be something that is very distributed.
#21
Join Date: Dec 2016
Programs: BA Gold
Posts: 487
The lawyers are going to have a field day with this. This is their dream come true. An unanticipated event which could be argued in any direction in a court. Let's hope it doesn't come to that and BA do the sensible thing and just compensate everyone affected. The lawyers earn enough as it is.
#22
Join Date: Apr 2014
Location: London
Programs: Don't even mention it. Grrrrrrr.
Posts: 968
Addressing (e.g. DNS)
Routing
Any issue in the above has the potential to be catastrophic.
As for moving to a disaster-recovery environment, one established method of activating a failover is a push to DNS of new IP addresses (the IP addresses of the back-up services). If that doesn't work (e.g. no DNS server(s)), then you are in a world of hurt.
Last edited by Banana4321; May 27, 2017 at 1:32 pm
#23
Join Date: Aug 2015
Location: Somewhere around Europe...
Programs: BA Gold; MB Ti; HH Diamond; IHG Plat; RR Gold
Posts: 530
http://www.datacenterknowledge.com/archives/2011/08/07/lightning-in-dublin-knocks-amazon-microsoft-data-centers-offline/
Speaking professionally this in no way excuses not having a functional backup or well-drilled recovery plan. But then I can also count on one hand the number of customers I have come across that are actually that well prepared either.
Authentication
Addressing (e.g. DNS)
Routing
Any issue in the above has the potential to be catastrophic.
As for moving to a disaster-recovery environment, one established method activating a failover is a push of new DNS addresses. If that doesn't work (e.g. no DNS server(s)), then you are in a world of hurt.
Addressing (e.g. DNS)
Routing
Any issue in the above has the potential to be catastrophic.
As for moving to a disaster-recovery environment, one established method activating a failover is a push of new DNS addresses. If that doesn't work (e.g. no DNS server(s)), then you are in a world of hurt.
#24
Join Date: Dec 2014
Programs: BAEC (although I might just cut up the card)
Posts: 338
Routing, the internet routing is actually designed to safeguard against this. All of the main routing protocols should handle this.
DNS is a no brainer, these should ALWAYS be geographically distributed.
For routing, it's best practice to ensure that you don't have a single point of routing failure (single hub/multi spoke). For sure when you have a worldwide network running your entire airline operations, it would be terrible to think they'd done that.
So, none are acceptable types of failure either.
EDIT: Mind you, we've not heard the real full version and cause. So it's all hypothetical right now. Nothing wrong with a bit of speculation though..
#25
Moderator: British Airways Executive Club, Iberia Airlines, Airport Lounges and Environmentally Friendly Travel
Join Date: Jan 2003
Location: London, UK
Posts: 22,210
Thank you r00ty, Banana4321, and dakaix for educating this non tech savvy individual of complex IT systems in such understandable terms
#26
Join Date: Oct 2016
Posts: 698
Glad this miscellaneous discussion thread is here.
I remember last year with the gradual FLY introduction that most outages seemed to occur on weekends. It's interesting that this one also happened on a Saturday.
Looking for info here, but on this weekend last year, wasn't there an outage of FLY at Gatwick? I could be wrong but I seem to remember long queues on a bank holiday weekend.
If IT people are mainly on Mon-Fri contracts this seems wrong for a global airline.
I remember last year with the gradual FLY introduction that most outages seemed to occur on weekends. It's interesting that this one also happened on a Saturday.
Looking for info here, but on this weekend last year, wasn't there an outage of FLY at Gatwick? I could be wrong but I seem to remember long queues on a bank holiday weekend.
If IT people are mainly on Mon-Fri contracts this seems wrong for a global airline.
#27
Join Date: Apr 2014
Location: London
Programs: Don't even mention it. Grrrrrrr.
Posts: 968
DNS - Distributed. I have my own network, that I operate for fun and my DNS is distributed.
Routing, the internet routing is actually designed to safeguard against this. All of the main routing protocols should handle this.
DNS is a no brainer, these should ALWAYS be geographically distributed.
For routing, it's best practice to ensure that you don't have a single point of routing failure (single hub/multi spoke). For sure when you have a worldwide network running your entire airline operations, it would be terrible to think they'd done that.
So, none are acceptable types of failure either.
EDIT: Mind you, we've not heard the real full version and cause. So it's all hypothetical right now. Nothing wrong with a bit of speculation though..
Routing, the internet routing is actually designed to safeguard against this. All of the main routing protocols should handle this.
DNS is a no brainer, these should ALWAYS be geographically distributed.
For routing, it's best practice to ensure that you don't have a single point of routing failure (single hub/multi spoke). For sure when you have a worldwide network running your entire airline operations, it would be terrible to think they'd done that.
So, none are acceptable types of failure either.
EDIT: Mind you, we've not heard the real full version and cause. So it's all hypothetical right now. Nothing wrong with a bit of speculation though..
If you think that BA is unique in being caught out by this then you are making an ill-informed judgment, although theirs is large, and public - which is most unfortunate for them. I have been closely involved in the annual failover "rehearsals" for organisations with a more critical service to the public than an airline. None of them were 100% successful. There was always a list of things to fix. Next year, due to the changes we had made over the course of that year, there was a list of different things to fix. And we didn't test firing a lightning bolt at the power supply (say), we did only what was necessary to send a letter to the regulatory body saying "we did that and passed". Also some times when we "flicked the switch" something unexpected and bad happened, in those cases we flick the switch back and try again in a few weeks time.
And some things are just too dangerous to test. Unless you want to stop flying planes, or stop people withdrawing cash, or stop paying salaries whilst you perform the testing. Failover and testing it in organisations with 42x7 operations is.....hard.
That is the reality.
Last edited by Banana4321; May 27, 2017 at 1:53 pm
#28
Join Date: Oct 2015
Location: next to HAM
Programs: LH M+M
Posts: 960
Posted this in the first thread already, but got burried. Here's a description was happened at Delta last year leading into a cancel of some 1000ish flights and a huge financial loss.
http://perspectives.mvdirona.com/201...ts-arent-rare/
(and just think of what happened to AWS S3 last month..)
So all the 'rage' over decision incompetency, outsourcing and so on: might be - but could be different, too.
http://perspectives.mvdirona.com/201...ts-arent-rare/
(and just think of what happened to AWS S3 last month..)
So all the 'rage' over decision incompetency, outsourcing and so on: might be - but could be different, too.
#29
Join Date: Aug 2015
Location: Somewhere around Europe...
Programs: BA Gold; MB Ti; HH Diamond; IHG Plat; RR Gold
Posts: 530
If you think that BA is unique in being caught out by this then you are making an ill-informed judgment, although theirs is large, and public - which is most unfortunate for them. I have been closely involved in the annual failover "rehearsals" for organisations with a more critical service to the public than an airline. None of them were 100% successful. There was always a list of things to fix. Next year, due to the changes we had made over the course of that year, there was a list of different things to fix. And we didn't test firing a lightning bolt at the power supply (say), we did only what was necessary to send a letter to the regulatory body saying "we did that and passed". Also some times when we "flicked the switch" something unexpected and bad happened, in those cases we flick the switch back and try again in a few weeks time.
That is the reality.
That is the reality.
Sure you can test an application in isolation, maybe even one rack, but until you practically verify a complete data centre or geographical outage, there's simply no way to know for certain. Will routing work if the primary site is off? Are we sure all the fibers connecting the sites are diverse (so someone in a JCB doesnt knock out all our connectivity in a stroke)? If a site is entirely cratered how do we go about recovering? Would we have any staff left to do it?
The fact is very few companies would pass all of those criteria. At some point someone says "well is this really worth the cost?", because it's not tangible to them.
The first company I worked for was a data centre operator, they had a bi-weekly practice of performing "black building testing". As the name suggests they would literally disconnect the power supply to prove that everything worked, generators, UPS, failover etc. Theory is all fine, but it means nothing unless you test it. Regularly.
Last edited by dakaix; May 27, 2017 at 1:55 pm
#30
Join Date: Oct 2015
Location: next to HAM
Programs: LH M+M
Posts: 960
The airline said in response that it would “never compromise the integrity and security of our IT systems,” adding that outsourcing of IT services was a “very common practice across all industries.”
Yep - and all across had their 'disruptive' experience with it.
Yep - and all across had their 'disruptive' experience with it.