Go Back  FlyerTalk Forums > Miles&Points > Airlines and Mileage Programs > British Airways | Executive Club
Reload this Page >

27 May BA IT outage miscellaneous discussions thread

27 May BA IT outage miscellaneous discussions thread

Old May 27, 2017, 1:57 pm
  #31  
 
Join Date: Dec 2016
Programs: BA Gold
Posts: 486
Originally Posted by dakaix
The fact is very few companies would pass all of those criteria. At some point someone says "well is this really worth the cost?", because it's not tangible to them.
That's the tricky thing. No system can ever be 100% perfect. The law of diminishing returns is quickly reached given the exorbitant cost of building in multiple redundancy. One approach is to invest a huge amount as a sunk cost and the other is to invest very little (which is what I suspect BA have done) and hope everything just goes OK.
doctoravios is offline  
Old May 27, 2017, 2:02 pm
  #32  
 
Join Date: Dec 2014
Programs: BAEC (although I might just cut up the card)
Posts: 338
Originally Posted by Banana4321
DNS is distributed across the internet but not all names are in all DNS servers. BA won't have the address of their internal systems in Virgin's DNS servers. And these kind of glib comments are all very fine in theory, but practical engineering in a complex environment that are designed to serve millions of people, be secure and cope with edge-case, tail events (i.e. catastrophic events in the many forms that they may come) is...complicated. Extremely complicated.
But that's not how it works at all. Unless you're referring to the weird world of AD DNS. Which is another matter. Normal real world DNS doesn't work that way.

I have a domain. me.com. And I have a load of hosts. here.me.com, there.me.com, and I delegate a whole subdomain de.me.com for my German operation. Which has a primary server in Germany instead of the UK.

The main domain has primary DNS in the UK, German subdomain has it in Germany, we both secondary for eachother, and I have an extra secondary running in France and say one in the USA. All secondaries pull the entire domain whenever there are changes. If the main server fails, the secondary servers are able to provide authoritative answers for any query with no problem.

Now, if we're talking internal DNS, the same system works fine. They put their main DNS in head office, and have secondary servers in multiple locations, all on the internal network space.

Same applies for AD. Even the modest small multinational I work for has AD backups spread across the globe. If my local office exploded tomorrow, I could VPN into a server in say the USA and authenticate fine. Albeit, I wouldn't be able to access any local resources.

Originally Posted by Banana4321
If you think that BA is unique in being caught out by this then you are making an ill-informed judgment, although theirs is large, and public - which is most unfortunate for them. I have been closely involved in the annual failover "rehearsals" for organisations with a more critical service to the public than an airline. None of them were 100% successful. There was always a list of things to fix. Next year, due to the changes we had made over the course of that year, there was a list of different things to fix. And we didn't test firing a lightning bolt at the power supply (say), we did only what was necessary to send a letter to the regulatory body saying "we did that and passed". Also some times when we "flicked the switch" something unexpected and bad happened, in those cases we flick the switch back and try again in a few weeks time.

And some things are just too dangerous to test. Unless you want to stop flying planes, or stop people withdrawing cash, or stop paying salaries whilst you perform the testing. Failover and testing it in organisations with 42x7 operations is.....hard.

That is the reality.
It's not that I'm saying it can't happen, clearly it has, and we've seen it happen elsewhere. It's just that this kind of redundancy failure isn't acceptable. Any way you look at it.
r00ty is offline  
Old May 27, 2017, 2:04 pm
  #33  
 
Join Date: Dec 2014
Programs: BAEC (although I might just cut up the card)
Posts: 338
Originally Posted by dakaix
At some point someone says "well is this really worth the cost?", because it's not tangible to them.
My experience of the corporate world would put this front and centre as the root cause. Not that any company would ever admit that :P
r00ty is offline  
Old May 27, 2017, 2:04 pm
  #34  
 
Join Date: Aug 2015
Location: Somewhere around Europe...
Programs: BA Gold; MB Ti; HH Diamond; IHG Plat; RR Gold
Posts: 530
Originally Posted by doctoravios
That's the tricky thing. No system can ever be 100% perfect.
Never truer than when talking about enterprise IT!

For a moment I was tempted to suggest that in a modern design you'd plan to expect and contain failure... you architect it into the applications so they can work around the underlying infrastructure issues.

Then I came to my senses and realised this was a legacy airline we're talking about.
dakaix is offline  
Old May 27, 2017, 2:04 pm
  #35  
 
Join Date: Oct 2005
Location: London
Posts: 725
Originally Posted by dakaix
Glad I'm not the only one to have had that experience. The reality is that while many companies try (read: feign) to prepare for DR, few have a documented plan, fewer still actually perform any form of testing (let alone regular).

Sure you can test an application in isolation, maybe even one rack, but until you practically verify a complete data centre or geographical outage, there's simply no way to know for certain. Will routing work if the primary site is off? Are we sure all the fibers connecting the sites are diverse (so someone in a JCB doesnt knock out all our connectivity in a stroke)? If a site is entirely cratered how do we go about recovering? Would we have any staff left to do it?

The fact is very few companies would pass all of those criteria. At some point someone says "well is this really worth the cost?", because it's not tangible to them.
Most large companies have DR plans, and test regularly. I'd wager internal and external audit teams in all FTSE 250 companies would be very interested in the DR plans and test results of their IT departments!

Datacenter isolation tests are common too, however as Banana4321 states their is a limit. I used to work a UK high street bank and preparations and planning were extensive but even then we've had a few occasions where cash points would fail once we've isolated a datacenter. Good learning experience all round! As it was a controlled isolation, we could recover quickly.

As for todays issue, affecting such a wide variety of systems and talk of internal BA staff not being able to access Outlook/email unless they were already logged in I'd wager a big part of the issue would have been DNS and or Authentication(AD)

Both services would be widely distributed, however if somebody pushed out a configuration change across the entire system(or inadvertently deleted records in those systems!) then it'll be pretty hard and time consuming to recover.

However, BA are saying they had power issues today which - to my humble mind at least! - doesnt make sense given some of the symptoms.
SW7London is offline  
Old May 27, 2017, 2:07 pm
  #36  
 
Join Date: Feb 2010
Location: London
Programs: BA GGL (for now) and Lifetime Gold, Marriott fan thanks to Bonvoy Moments
Posts: 5,114
Selfishly I'm wondering whether or not I'll have to claim for my LHR-HEL-TLL flights today on AY. Given everything else has been banjaxed I'm not hopeful...
lorcancoyle is offline  
Old May 27, 2017, 2:09 pm
  #37  
 
Join Date: Apr 2014
Location: London
Programs: Don't even mention it. Grrrrrrr.
Posts: 968
Originally Posted by drb1979

However, BA are saying they had power issues today which - to my humble mind at least! - doesnt make sense given some of the symptoms.
Depends when the power issue happened. It may have happened as they were activating DR. In which case....ouch.
Banana4321 is offline  
Old May 27, 2017, 2:13 pm
  #38  
 
Join Date: Sep 2010
Location: Las Vegas
Programs: BA Gold; Hilton Honors Diamond
Posts: 3,224
Dealing with a large-scale IT outage is never easy. On the one hand you need to restore services as quickly as possible, but equally you may need to troubleshoot what has caused the issue in the first place. Activating your disaster recovery plan as a first step in this process is never a good idea - the response of last resort should not be the first step!

Unless you have multiple data centres running in an active - active setup there will by necessity be some degree of delay in bringing the backup systems online. This can be caused by multiple factors: recovering data from backups or snapshots, checking the integrity of the data, bringing systems online in the right order (so database servers, application servers, web servers), validating that everything is running correctly, all dependencies are in place, that DNS is updated so that servers and services can talk to one another and so on.

Usually you may be dealing with an outage of one system at any one time but where you have a total outage of all systems at the same time it becomes massively more complicated to bring everything back online. You would be looking at core network systems (so backbone switches, routers, firewalls etc.) followed by network and authentication services (Active Directory and DNS), then you'd look to bring up your storage area networks and storage systems so you can access your data. Once this is up and running you can start to look at application services - so database servers, then application level servers, web servers etc.

You also have to ensure you have sufficient resources available - so staff with the relevant expertise, that you have support contracts in place with your vendors so you can call on 24/7 support from them etc. You are also looking at your disaster recovery plan, recovery point objectives, recovery time objectives in order to prioritise which systems are brought back online first etc.

Clearly a power outage should not bring down your entire infrastructure. You would expect all equipment to be fed from at least two independent sources with UPS / battery backup and generator backup in place. These systems should in themselves be redundant so the failure of a single piece of infrastructure doesn't lead to an outage, and you'd expect to be regularly maintaining and testing this. In a co-location / data centre environment this would be undertaken by the data centre provider.

I have to say that while it's easy to be critical of BA and say that this outage should never have happened I would spare a thought for the IT staff who will have been sorting out this mess and getting systems back online. Having been on the front line in outages for my company I can say it is stressful beyond words and you really do feel the pressure!

Hopefully this little summary will give an indication of what's involved in dealing with a large-scale outage. I don't work for BA so have no idea of their IT infrastructure but my experience comes from data centre management for a global law firm with 30+ offices worldwide.
Geordie405 is offline  
Old May 27, 2017, 2:15 pm
  #39  
 
Join Date: Oct 2005
Location: London
Posts: 725
Originally Posted by Banana4321
Depends when the power issue happened. It may have happened as they were activating DR. In which case....ouch.
True. I would have thought from an application perspective especially so. Not so much from an infra/dns/ad perspective given the nature of those systems. i.e. you dont fail over AD. Its up, its distributed. And pretty resilient.

From the symptoms I've read today, I cant reconcile a datacenter power issue with people being able to use their email only if they were already logged in. Seems like something is cached enabling access but its all speculation as we've got limited visibility at the moment about what happened!
SW7London is offline  
Old May 27, 2017, 2:22 pm
  #40  
 
Join Date: Apr 2010
Posts: 11
Originally Posted by BATLV
Sounds like if there was a cyber attack it was targeting power distribution-SCADA. Thoughts?
Stuxnet? or variant of perhaps?
funkyfreddy is offline  
Old May 27, 2017, 2:24 pm
  #41  
 
Join Date: Sep 2014
Location: Brexile in ADB
Programs: BA, TK, HHonours, Le Club, Best Western Rewards
Posts: 7,067
Aren't airports classed as critical infrastructure by the government? Perhaps they should force this Spanish airline (BA) to have higher standards of IT infrastructure if they have so many airport slots.
Worcester is offline  
Old May 27, 2017, 2:27 pm
  #42  
 
Join Date: Apr 2014
Location: London
Programs: Don't even mention it. Grrrrrrr.
Posts: 968
Originally Posted by drb1979
From the symptoms I've read today, I cant reconcile a datacenter power issue with people being able to use their email only if they were already logged in. Seems like something is cached enabling access but its all speculation as we've got limited visibility at the moment about what happened!
I assumed from the symptoms that (and I know little about O365, but it appears we do have a resident expert on FT) MS caches the credentials for a period of time i.e. if you logged-in to O365 then BA's authentication servers in BA data centres are called and if they give the OK then the user can use O365 for a few hours, maybe a day even. However after those authentication servers disappear then no-one can login to O365 for the first time 'today' as their previous credentials would have expired.
Banana4321 is offline  
Old May 27, 2017, 2:27 pm
  #43  
 
Join Date: Dec 2016
Programs: BA Gold
Posts: 486
Originally Posted by dakaix
Never truer than when talking about enterprise IT!

For a moment I was tempted to suggest that in a modern design you'd plan to expect and contain failure... you architect it into the applications so they can work around the underlying infrastructure issues.

Then I came to my senses and realised this was a legacy airline we're talking about.
Absolutely. To be fair, BA aren't the only ones. There are so many organisations running outdated software on outdated hardware and network infrastructure who've just been lucky not to experience a serious failure while in operation.

The fact is that humans have great difficulty understanding an existential threat and forumulating a plan to deal with the consequences. We tend to look at the odds and if they are favourable just roll the dice and ignore the possibility that improbable events will ever be realised.
doctoravios is offline  
Old May 27, 2017, 2:27 pm
  #44  
 
Join Date: Apr 2010
Posts: 11
Originally Posted by Banana4321
Authentication
Addressing (e.g. DNS)
Routing

Any issue in the above has the potential to be catastrophic.

As for moving to a disaster-recovery environment, one established method of activating a failover is a push to DNS of new IP addresses (the IP addresses of the back-up services). If that doesn't work (e.g. no DNS server(s)), then you are in a world of hurt.
yep, and perhaps all the kind of things you assume are distributed but may not notice if something goes wrong with the distribution / fail over potential.
funkyfreddy is offline  
Old May 27, 2017, 2:27 pm
  #45  
 
Join Date: Nov 2010
Location: Bristol
Programs: BA GGL, UA Plat, DL Plat, Hilton Diamond
Posts: 2,380
Originally Posted by lorcancoyle
Selfishly I'm wondering whether or not I'll have to claim for my LHR-HEL-TLL flights today on AY. Given everything else has been banjaxed I'm not hopeful...
What happened to your AY flights then ? Were they affected as well ?
Fitch is offline  

Thread Tools
Search this Thread

Contact Us - Manage Preferences - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service -

This site is owned, operated, and maintained by MH Sub I, LLC dba Internet Brands. Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Designated trademarks are the property of their respective owners.