27 May BA IT outage miscellaneous discussions thread

Reply Subscribe

Thread Tools

Search this Thread

May 27, 2017, 1:57 pm

#31

doctoravios

Join Date: Dec 2016

Programs: BA Gold

Posts: 487

Quote:

Originally Posted by dakaix

The fact is very few companies would pass all of those criteria. At some point someone says "well is this really worth the cost?", because it's not tangible to them.

That's the tricky thing. No system can ever be 100% perfect. The law of diminishing returns is quickly reached given the exorbitant cost of building in multiple redundancy. One approach is to invest a huge amount as a sunk cost and the other is to invest very little (which is what I suspect BA have done) and hope everything just goes OK.

May 27, 2017, 2:02 pm

#32

r00ty

Join Date: Dec 2014

Programs: BAEC (although I might just cut up the card)

Posts: 338

Quote:

Originally Posted by Banana4321

DNS is distributed across the internet but not all names are in all DNS servers. BA won't have the address of their internal systems in Virgin's DNS servers. And these kind of glib comments are all very fine in theory, but practical engineering in a complex environment that are designed to serve millions of people, be secure and cope with edge-case, tail events (i.e. catastrophic events in the many forms that they may come) is...complicated. Extremely complicated.

But that's not how it works at all. Unless you're referring to the weird world of AD DNS. Which is another matter. Normal real world DNS doesn't work that way.

I have a domain. me.com. And I have a load of hosts. here.me.com, there.me.com, and I delegate a whole subdomain de.me.com for my German operation. Which has a primary server in Germany instead of the UK.

The main domain has primary DNS in the UK, German subdomain has it in Germany, we both secondary for eachother, and I have an extra secondary running in France and say one in the USA. All secondaries pull the entire domain whenever there are changes. If the main server fails, the secondary servers are able to provide authoritative answers for any query with no problem.

Now, if we're talking internal DNS, the same system works fine. They put their main DNS in head office, and have secondary servers in multiple locations, all on the internal network space.

Same applies for AD. Even the modest small multinational I work for has AD backups spread across the globe. If my local office exploded tomorrow, I could VPN into a server in say the USA and authenticate fine. Albeit, I wouldn't be able to access any local resources.

Quote:

Originally Posted by Banana4321

If you think that BA is unique in being caught out by this then you are making an ill-informed judgment, although theirs is large, and public - which is most unfortunate for them. I have been closely involved in the annual failover "rehearsals" for organisations with a more critical service to the public than an airline. None of them were 100% successful. There was always a list of things to fix. Next year, due to the changes we had made over the course of that year, there was a list of different things to fix. And we didn't test firing a lightning bolt at the power supply (say), we did only what was necessary to send a letter to the regulatory body saying "we did that and passed". Also some times when we "flicked the switch" something unexpected and bad happened, in those cases we flick the switch back and try again in a few weeks time.

And some things are just too dangerous to test. Unless you want to stop flying planes, or stop people withdrawing cash, or stop paying salaries whilst you perform the testing. Failover and testing it in organisations with 42x7 operations is.....hard.

That is the reality.

It's not that I'm saying it can't happen, clearly it has, and we've seen it happen elsewhere. It's just that this kind of redundancy failure isn't acceptable. Any way you look at it.

May 27, 2017, 2:04 pm

#33

r00ty

Join Date: Dec 2014

Programs: BAEC (although I might just cut up the card)

Posts: 338

Quote:

Originally Posted by dakaix

At some point someone says "well is this really worth the cost?", because it's not tangible to them.

My experience of the corporate world would put this front and centre as the root cause. Not that any company would ever admit that :P

May 27, 2017, 2:04 pm

#34

dakaix

Join Date: Aug 2015

Location: Somewhere around Europe...

Programs: BA Gold; MB Ti; HH Diamond; IHG Plat; RR Gold

Posts: 530

Quote:

Originally Posted by doctoravios

That's the tricky thing. No system can ever be 100% perfect.

Never truer than when talking about enterprise IT!

For a moment I was tempted to suggest that in a modern design you'd plan to expect and contain failure... you architect it into the applications so they can work around the underlying infrastructure issues.

Then I came to my senses and realised this was a legacy airline we're talking about.

May 27, 2017, 2:04 pm

#35

SW7London

Join Date: Oct 2005

Location: London

Posts: 726

Quote:

Originally Posted by dakaix

Glad I'm not the only one to have had that experience. The reality is that while many companies try (read: feign) to prepare for DR, few have a documented plan, fewer still actually perform any form of testing (let alone regular).

Sure you can test an application in isolation, maybe even one rack, but until you practically verify a complete data centre or geographical outage, there's simply no way to know for certain. Will routing work if the primary site is off? Are we sure all the fibers connecting the sites are diverse (so someone in a JCB doesnt knock out all our connectivity in a stroke)? If a site is entirely cratered how do we go about recovering? Would we have any staff left to do it?

The fact is very few companies would pass all of those criteria. At some point someone says "well is this really worth the cost?", because it's not tangible to them.

Most large companies have DR plans, and test regularly. I'd wager internal and external audit teams in all FTSE 250 companies would be very interested in the DR plans and test results of their IT departments!

Datacenter isolation tests are common too, however as Banana4321 states their is a limit. I used to work a UK high street bank and preparations and planning were extensive but even then we've had a few occasions where cash points would fail once we've isolated a datacenter. Good learning experience all round! As it was a controlled isolation, we could recover quickly.

As for todays issue, affecting such a wide variety of systems and talk of internal BA staff not being able to access Outlook/email unless they were already logged in I'd wager a big part of the issue would have been DNS and or Authentication(AD)

Both services would be widely distributed, however if somebody pushed out a configuration change across the entire system(or inadvertently deleted records in those systems!) then it'll be pretty hard and time consuming to recover.

However, BA are saying they had power issues today which - to my humble mind at least! - doesnt make sense given some of the symptoms.

May 27, 2017, 2:07 pm

#36

lorcancoyle

Join Date: Feb 2010

Location: London

Programs: BA GGL (for now) and Lifetime Gold, Marriott fan thanks to Bonvoy Moments

Posts: 5,115

Selfishly I'm wondering whether or not I'll have to claim for my LHR-HEL-TLL flights today on AY. Given everything else has been banjaxed I'm not hopeful...

May 27, 2017, 2:09 pm

#37

Banana4321

Join Date: Apr 2014

Location: London

Programs: Don't even mention it. Grrrrrrr.

Posts: 968

Quote:

Originally Posted by drb1979

However, BA are saying they had power issues today which - to my humble mind at least! - doesnt make sense given some of the symptoms.

Depends when the power issue happened. It may have happened as they were activating DR. In which case....ouch.

May 27, 2017, 2:13 pm

#38

Geordie405

Join Date: Sep 2010

Location: Las Vegas

Programs: BA Gold; Hilton Honors Diamond

Posts: 3,228

Dealing with a large-scale IT outage is never easy. On the one hand you need to restore services as quickly as possible, but equally you may need to troubleshoot what has caused the issue in the first place. Activating your disaster recovery plan as a first step in this process is never a good idea - the response of last resort should not be the first step!

Unless you have multiple data centres running in an active - active setup there will by necessity be some degree of delay in bringing the backup systems online. This can be caused by multiple factors: recovering data from backups or snapshots, checking the integrity of the data, bringing systems online in the right order (so database servers, application servers, web servers), validating that everything is running correctly, all dependencies are in place, that DNS is updated so that servers and services can talk to one another and so on.

Usually you may be dealing with an outage of one system at any one time but where you have a total outage of all systems at the same time it becomes massively more complicated to bring everything back online. You would be looking at core network systems (so backbone switches, routers, firewalls etc.) followed by network and authentication services (Active Directory and DNS), then you'd look to bring up your storage area networks and storage systems so you can access your data. Once this is up and running you can start to look at application services - so database servers, then application level servers, web servers etc.

You also have to ensure you have sufficient resources available - so staff with the relevant expertise, that you have support contracts in place with your vendors so you can call on 24/7 support from them etc. You are also looking at your disaster recovery plan, recovery point objectives, recovery time objectives in order to prioritise which systems are brought back online first etc.

Clearly a power outage should not bring down your entire infrastructure. You would expect all equipment to be fed from at least two independent sources with UPS / battery backup and generator backup in place. These systems should in themselves be redundant so the failure of a single piece of infrastructure doesn't lead to an outage, and you'd expect to be regularly maintaining and testing this. In a co-location / data centre environment this would be undertaken by the data centre provider.

I have to say that while it's easy to be critical of BA and say that this outage should never have happened I would spare a thought for the IT staff who will have been sorting out this mess and getting systems back online. Having been on the front line in outages for my company I can say it is stressful beyond words and you really do feel the pressure!

Hopefully this little summary will give an indication of what's involved in dealing with a large-scale outage. I don't work for BA so have no idea of their IT infrastructure but my experience comes from data centre management for a global law firm with 30+ offices worldwide.

May 27, 2017, 2:15 pm

#39

SW7London

Join Date: Oct 2005

Location: London

Posts: 726

Quote:

Originally Posted by Banana4321

Depends when the power issue happened. It may have happened as they were activating DR. In which case....ouch.

True. I would have thought from an application perspective especially so. Not so much from an infra/dns/ad perspective given the nature of those systems. i.e. you dont fail over AD. Its up, its distributed. And pretty resilient.

From the symptoms I've read today, I cant reconcile a datacenter power issue with people being able to use their email only if they were already logged in. Seems like something is cached enabling access but its all speculation as we've got limited visibility at the moment about what happened!

May 27, 2017, 2:22 pm

#40

funkyfreddy

Join Date: Apr 2010

Posts: 11

Quote:

Originally Posted by BATLV

Sounds like if there was a cyber attack it was targeting power distribution-SCADA. Thoughts?

Stuxnet? or variant of perhaps?

May 27, 2017, 2:24 pm

#41

Worcester

Join Date: Sep 2014

Location: Brexile in ADB

Programs: BA, TK, HHonours, Le Club, Best Western Rewards

Posts: 7,067

Aren't airports classed as critical infrastructure by the government? Perhaps they should force this Spanish airline (BA) to have higher standards of IT infrastructure if they have so many airport slots.

May 27, 2017, 2:27 pm

#42

Banana4321

Join Date: Apr 2014

Location: London

Programs: Don't even mention it. Grrrrrrr.

Posts: 968

Quote:

Originally Posted by drb1979

From the symptoms I've read today, I cant reconcile a datacenter power issue with people being able to use their email only if they were already logged in. Seems like something is cached enabling access but its all speculation as we've got limited visibility at the moment about what happened!

I assumed from the symptoms that (and I know little about O365, but it appears we do have a resident expert on FT) MS caches the credentials for a period of time i.e. if you logged-in to O365 then BA's authentication servers in BA data centres are called and if they give the OK then the user can use O365 for a few hours, maybe a day even. However after those authentication servers disappear then no-one can login to O365 for the first time 'today' as their previous credentials would have expired.

May 27, 2017, 2:27 pm

#43

doctoravios

Join Date: Dec 2016

Programs: BA Gold

Posts: 487

Quote:

Originally Posted by dakaix

Never truer than when talking about enterprise IT!

Absolutely. To be fair, BA aren't the only ones. There are so many organisations running outdated software on outdated hardware and network infrastructure who've just been lucky not to experience a serious failure while in operation.

The fact is that humans have great difficulty understanding an existential threat and forumulating a plan to deal with the consequences. We tend to look at the odds and if they are favourable just roll the dice and ignore the possibility that improbable events will ever be realised.

May 27, 2017, 2:27 pm

#44

funkyfreddy

Join Date: Apr 2010

Posts: 11

Quote:

Originally Posted by Banana4321

Authentication
Addressing (e.g. DNS)
Routing

Any issue in the above has the potential to be catastrophic.

As for moving to a disaster-recovery environment, one established method of activating a failover is a push to DNS of new IP addresses (the IP addresses of the back-up services). If that doesn't work (e.g. no DNS server(s)), then you are in a world of hurt.

yep, and perhaps all the kind of things you assume are distributed but may not notice if something goes wrong with the distribution / fail over potential.

May 27, 2017, 2:27 pm

#45

Fitch

Join Date: Nov 2010

Location: Bristol

Programs: BA GGL, UA Plat, DL Plat, Hilton Diamond

Posts: 2,380

Quote:

Originally Posted by lorcancoyle

Selfishly I'm wondering whether or not I'll have to claim for my LHR-HEL-TLL flights today on AY. Given everything else has been banjaxed I'm not hopeful...

What happened to your AY flights then ? Were they affected as well ?