27 May BA IT outage miscellaneous discussions thread

Reply Subscribe

Thread Tools

Search this Thread

May 27, 2017, 1:08 pm

#16

Prospero

Moderator: British Airways Executive Club, Iberia Airlines, Airport Lounges and Environmentally Friendly Travel

Join Date: Jan 2003

Location: London, UK

Posts: 22,210

Quote:

Originally Posted by Mapman

He may well have said this. I am not saying that this is not the case, nor that cyber attack is the reason. Simply passing on what colleagues working in critical national infrastructure have told me this afternoon. Time will tell.

What’s mind blowing is the combination of multiple IT systems going down together: sales, check-in, dispatch, crew roster et al.

May 27, 2017, 1:11 pm

#17

BATLV

Join Date: Aug 2009

Posts: 461

Quote:

Originally Posted by Mapman

Sounds like if there was a cyber attack it was targeting power distribution-SCADA. Thoughts?

May 27, 2017, 1:12 pm

#18

r00ty

Join Date: Dec 2014

Programs: BAEC (although I might just cut up the card)

Posts: 338

Quote:

Originally Posted by T8191

Power failure = EU261 for inadequate provision of backup. No?

Cyber attack = get out of jail relatively free. Yes?

Watching with interest, as not flying BA for a couple of weeks. But become less enamoured by BA every month

So sad.

This was my thought when they pretty much refused to confirm those rumours early on.

Maybe it will suddenly be a cyber attack.

May 27, 2017, 1:12 pm

#19

T8191

FlyerTalk Evangelist

Join Date: Mar 2010

Location: JER

Programs: BA Gold/OWE, several MUCCI, and assorted Pensions!

Posts: 32,140

It will be interesting to see how this disaster pans out for those affected. I'm hugely grateful I have 2 weeks to go before being exposed to Cruz's BA, which I hope may be functional by then.

Damn, I have 3 BA bookings in MMB, plus another one connecting to AA. I have really lost faith in the operation.

May 27, 2017, 1:13 pm

#20

r00ty

Join Date: Dec 2014

Programs: BAEC (although I might just cut up the card)

Posts: 338

Quote:

Originally Posted by Prospero

What’s mind blowing is the combination of multiple IT systems going down together: sales, check-in, dispatch, crew roster et al.

Not really surprising, all those systems require authentication. So it's a single point of failure, however that should be something that is very distributed.

May 27, 2017, 1:14 pm

#21

doctoravios

Join Date: Dec 2016

Programs: BA Gold

Posts: 487

Quote:

Originally Posted by T8191

So sad.

The lawyers are going to have a field day with this. This is their dream come true. An unanticipated event which could be argued in any direction in a court. Let's hope it doesn't come to that and BA do the sensible thing and just compensate everyone affected. The lawyers earn enough as it is.

May 27, 2017, 1:17 pm

#22

Banana4321

Join Date: Apr 2014

Location: London

Programs: Don't even mention it. Grrrrrrr.

Posts: 968

Quote:

Originally Posted by r00ty

Not really surprising, all those systems require authentication. So it's a single point of failure, however that should be something that is very distributed.

Authentication
Addressing (e.g. DNS)
Routing

Any issue in the above has the potential to be catastrophic.

As for moving to a disaster-recovery environment, one established method of activating a failover is a push to DNS of new IP addresses (the IP addresses of the back-up services). If that doesn't work (e.g. no DNS server(s)), then you are in a world of hurt.

Last edited by Banana4321; May 27, 2017 at 1:32 pm

May 27, 2017, 1:20 pm

#23

dakaix

Join Date: Aug 2015

Location: Somewhere around Europe...

Programs: BA Gold; MB Ti; HH Diamond; IHG Plat; RR Gold

Posts: 530

Quote:

Originally Posted by Prospero

What’s mind blowing is the combination of multiple IT systems going down together: sales, check-in, dispatch, crew roster et al.

Precisely why a power failure and a failed DR attempt (we assume) make much more sense. I would highlight an eerily similar case at the Microsoft and Amazon data centres in Dublin back in 2011.

http://www.datacenterknowledge.com/archives/2011/08/07/lightning-in-dublin-knocks-amazon-microsoft-data-centers-offline/

Speaking professionally this in no way excuses not having a functional backup or well-drilled recovery plan. But then I can also count on one hand the number of customers I have come across that are actually that well prepared either.

Quote:

Originally Posted by Banana4321

Authentication
Addressing (e.g. DNS)
Routing

Any issue in the above has the potential to be catastrophic.

As for moving to a disaster-recovery environment, one established method activating a failover is a push of new DNS addresses. If that doesn't work (e.g. no DNS server(s)), then you are in a world of hurt.

I'd also throw in there "Web Filtering", as if their user web proxy services are down then nobody would have external access either.

May 27, 2017, 1:22 pm

#24

r00ty

Join Date: Dec 2014

Programs: BAEC (although I might just cut up the card)

Posts: 338

Quote:

Originally Posted by Banana4321

Authentication
Addressing (e.g. DNS)
Routing

Any issue in the above has the potential to be catastrophic.

DNS - Distributed. I have my own network, that I operate for fun and my DNS is distributed.
Routing, the internet routing is actually designed to safeguard against this. All of the main routing protocols should handle this.

DNS is a no brainer, these should ALWAYS be geographically distributed.

For routing, it's best practice to ensure that you don't have a single point of routing failure (single hub/multi spoke). For sure when you have a worldwide network running your entire airline operations, it would be terrible to think they'd done that.

So, none are acceptable types of failure either.

EDIT: Mind you, we've not heard the real full version and cause. So it's all hypothetical right now. Nothing wrong with a bit of speculation though..

May 27, 2017, 1:25 pm

#25

Prospero

Moderator: British Airways Executive Club, Iberia Airlines, Airport Lounges and Environmentally Friendly Travel

Join Date: Jan 2003

Location: London, UK

Posts: 22,210

Thank you r00ty, Banana4321, and dakaix for educating this non tech savvy individual of complex IT systems in such understandable terms

May 27, 2017, 1:29 pm

#26

MarkFlies

Join Date: Oct 2016

Posts: 698

Glad this miscellaneous discussion thread is here.

I remember last year with the gradual FLY introduction that most outages seemed to occur on weekends. It's interesting that this one also happened on a Saturday.

Looking for info here, but on this weekend last year, wasn't there an outage of FLY at Gatwick? I could be wrong but I seem to remember long queues on a bank holiday weekend.

If IT people are mainly on Mon-Fri contracts this seems wrong for a global airline.

May 27, 2017, 1:38 pm

#27

Banana4321

Join Date: Apr 2014

Location: London

Programs: Don't even mention it. Grrrrrrr.

Posts: 968

Quote:

Originally Posted by r00ty

DNS is distributed across the internet but not all names are in all DNS servers. BA won't have the address of their internal systems in Virgin's DNS servers. And these kind of glib comments are all very fine in theory, but practical engineering in a complex environment that are designed to serve millions of people, be secure and cope with edge-case, tail events (i.e. catastrophic events in the many forms that they may come) is...complicated. Extremely complicated.

If you think that BA is unique in being caught out by this then you are making an ill-informed judgment, although theirs is large, and public - which is most unfortunate for them. I have been closely involved in the annual failover "rehearsals" for organisations with a more critical service to the public than an airline. None of them were 100% successful. There was always a list of things to fix. Next year, due to the changes we had made over the course of that year, there was a list of different things to fix. And we didn't test firing a lightning bolt at the power supply (say), we did only what was necessary to send a letter to the regulatory body saying "we did that and passed". Also some times when we "flicked the switch" something unexpected and bad happened, in those cases we flick the switch back and try again in a few weeks time.

And some things are just too dangerous to test. Unless you want to stop flying planes, or stop people withdrawing cash, or stop paying salaries whilst you perform the testing. Failover and testing it in organisations with 42x7 operations is.....hard.

That is the reality.

Last edited by Banana4321; May 27, 2017 at 1:53 pm

May 27, 2017, 1:41 pm

#28

PAX_fips

Join Date: Oct 2015

Location: next to HAM

Programs: LH M+M

Posts: 960

Posted this in the first thread already, but got burried. Here's a description was happened at Delta last year leading into a cancel of some 1000ish flights and a huge financial loss.

http://perspectives.mvdirona.com/201...ts-arent-rare/

(and just think of what happened to AWS S3 last month..)

So all the 'rage' over decision incompetency, outsourcing and so on: might be - but could be different, too.

May 27, 2017, 1:50 pm

#29

dakaix

Join Date: Aug 2015

Location: Somewhere around Europe...

Programs: BA Gold; MB Ti; HH Diamond; IHG Plat; RR Gold

Posts: 530

Quote:

Originally Posted by Banana4321

If you think that BA is unique in being caught out by this then you are making an ill-informed judgment, although theirs is large, and public - which is most unfortunate for them. I have been closely involved in the annual failover "rehearsals" for organisations with a more critical service to the public than an airline. None of them were 100% successful. There was always a list of things to fix. Next year, due to the changes we had made over the course of that year, there was a list of different things to fix. And we didn't test firing a lightning bolt at the power supply (say), we did only what was necessary to send a letter to the regulatory body saying "we did that and passed". Also some times when we "flicked the switch" something unexpected and bad happened, in those cases we flick the switch back and try again in a few weeks time.

That is the reality.

Glad I'm not the only one to have had that experience. The reality is that while many companies try (read: feign) to prepare for DR, few have a documented plan, fewer still actually perform any form of testing (let alone regular).

Sure you can test an application in isolation, maybe even one rack, but until you practically verify a complete data centre or geographical outage, there's simply no way to know for certain. Will routing work if the primary site is off? Are we sure all the fibers connecting the sites are diverse (so someone in a JCB doesnt knock out all our connectivity in a stroke)? If a site is entirely cratered how do we go about recovering? Would we have any staff left to do it?

The fact is very few companies would pass all of those criteria. At some point someone says "well is this really worth the cost?", because it's not tangible to them.

The first company I worked for was a data centre operator, they had a bi-weekly practice of performing "black building testing". As the name suggests they would literally disconnect the power supply to prove that everything worked, generators, UPS, failover etc. Theory is all fine, but it means nothing unless you test it. Regularly.

Last edited by dakaix; May 27, 2017 at 1:55 pm

May 27, 2017, 1:53 pm

#30

PAX_fips

Join Date: Oct 2015

Location: next to HAM

Programs: LH M+M

Posts: 960

The airline said in response that it would “never compromise the integrity and security of our IT systems,” adding that outsourcing of IT services was a “very common practice across all industries.”

Yep - and all across had their 'disruptive' experience with it.

Reply Share

First
Prev
2 / 40
Next
Last

Forum Jump