27 May BA IT outage miscellaneous discussions thread
#31
Join Date: Dec 2016
Programs: BA Gold
Posts: 487
That's the tricky thing. No system can ever be 100% perfect. The law of diminishing returns is quickly reached given the exorbitant cost of building in multiple redundancy. One approach is to invest a huge amount as a sunk cost and the other is to invest very little (which is what I suspect BA have done) and hope everything just goes OK.
#32
Join Date: Dec 2014
Programs: BAEC (although I might just cut up the card)
Posts: 338
DNS is distributed across the internet but not all names are in all DNS servers. BA won't have the address of their internal systems in Virgin's DNS servers. And these kind of glib comments are all very fine in theory, but practical engineering in a complex environment that are designed to serve millions of people, be secure and cope with edge-case, tail events (i.e. catastrophic events in the many forms that they may come) is...complicated. Extremely complicated.
I have a domain. me.com. And I have a load of hosts. here.me.com, there.me.com, and I delegate a whole subdomain de.me.com for my German operation. Which has a primary server in Germany instead of the UK.
The main domain has primary DNS in the UK, German subdomain has it in Germany, we both secondary for eachother, and I have an extra secondary running in France and say one in the USA. All secondaries pull the entire domain whenever there are changes. If the main server fails, the secondary servers are able to provide authoritative answers for any query with no problem.
Now, if we're talking internal DNS, the same system works fine. They put their main DNS in head office, and have secondary servers in multiple locations, all on the internal network space.
Same applies for AD. Even the modest small multinational I work for has AD backups spread across the globe. If my local office exploded tomorrow, I could VPN into a server in say the USA and authenticate fine. Albeit, I wouldn't be able to access any local resources.
If you think that BA is unique in being caught out by this then you are making an ill-informed judgment, although theirs is large, and public - which is most unfortunate for them. I have been closely involved in the annual failover "rehearsals" for organisations with a more critical service to the public than an airline. None of them were 100% successful. There was always a list of things to fix. Next year, due to the changes we had made over the course of that year, there was a list of different things to fix. And we didn't test firing a lightning bolt at the power supply (say), we did only what was necessary to send a letter to the regulatory body saying "we did that and passed". Also some times when we "flicked the switch" something unexpected and bad happened, in those cases we flick the switch back and try again in a few weeks time.
And some things are just too dangerous to test. Unless you want to stop flying planes, or stop people withdrawing cash, or stop paying salaries whilst you perform the testing. Failover and testing it in organisations with 42x7 operations is.....hard.
That is the reality.
And some things are just too dangerous to test. Unless you want to stop flying planes, or stop people withdrawing cash, or stop paying salaries whilst you perform the testing. Failover and testing it in organisations with 42x7 operations is.....hard.
That is the reality.
#33
Join Date: Dec 2014
Programs: BAEC (although I might just cut up the card)
Posts: 338
#34
Join Date: Aug 2015
Location: Somewhere around Europe...
Programs: BA Gold; MB Ti; HH Diamond; IHG Plat; RR Gold
Posts: 530
Never truer than when talking about enterprise IT!
For a moment I was tempted to suggest that in a modern design you'd plan to expect and contain failure... you architect it into the applications so they can work around the underlying infrastructure issues.
Then I came to my senses and realised this was a legacy airline we're talking about.
For a moment I was tempted to suggest that in a modern design you'd plan to expect and contain failure... you architect it into the applications so they can work around the underlying infrastructure issues.
Then I came to my senses and realised this was a legacy airline we're talking about.
#35
Join Date: Oct 2005
Location: London
Posts: 726
Glad I'm not the only one to have had that experience. The reality is that while many companies try (read: feign) to prepare for DR, few have a documented plan, fewer still actually perform any form of testing (let alone regular).
Sure you can test an application in isolation, maybe even one rack, but until you practically verify a complete data centre or geographical outage, there's simply no way to know for certain. Will routing work if the primary site is off? Are we sure all the fibers connecting the sites are diverse (so someone in a JCB doesnt knock out all our connectivity in a stroke)? If a site is entirely cratered how do we go about recovering? Would we have any staff left to do it?
The fact is very few companies would pass all of those criteria. At some point someone says "well is this really worth the cost?", because it's not tangible to them.
Sure you can test an application in isolation, maybe even one rack, but until you practically verify a complete data centre or geographical outage, there's simply no way to know for certain. Will routing work if the primary site is off? Are we sure all the fibers connecting the sites are diverse (so someone in a JCB doesnt knock out all our connectivity in a stroke)? If a site is entirely cratered how do we go about recovering? Would we have any staff left to do it?
The fact is very few companies would pass all of those criteria. At some point someone says "well is this really worth the cost?", because it's not tangible to them.
Datacenter isolation tests are common too, however as Banana4321 states their is a limit. I used to work a UK high street bank and preparations and planning were extensive but even then we've had a few occasions where cash points would fail once we've isolated a datacenter. Good learning experience all round! As it was a controlled isolation, we could recover quickly.
As for todays issue, affecting such a wide variety of systems and talk of internal BA staff not being able to access Outlook/email unless they were already logged in I'd wager a big part of the issue would have been DNS and or Authentication(AD)
Both services would be widely distributed, however if somebody pushed out a configuration change across the entire system(or inadvertently deleted records in those systems!) then it'll be pretty hard and time consuming to recover.
However, BA are saying they had power issues today which - to my humble mind at least! - doesnt make sense given some of the symptoms.
#36
Join Date: Feb 2010
Location: London
Programs: BA GGL (for now) and Lifetime Gold, Marriott fan thanks to Bonvoy Moments
Posts: 5,115
Selfishly I'm wondering whether or not I'll have to claim for my LHR-HEL-TLL flights today on AY. Given everything else has been banjaxed I'm not hopeful...
#37
Join Date: Apr 2014
Location: London
Programs: Don't even mention it. Grrrrrrr.
Posts: 968
#38
Join Date: Sep 2010
Location: Las Vegas
Programs: BA Gold; Hilton Honors Diamond
Posts: 3,228
Dealing with a large-scale IT outage is never easy. On the one hand you need to restore services as quickly as possible, but equally you may need to troubleshoot what has caused the issue in the first place. Activating your disaster recovery plan as a first step in this process is never a good idea - the response of last resort should not be the first step!
Unless you have multiple data centres running in an active - active setup there will by necessity be some degree of delay in bringing the backup systems online. This can be caused by multiple factors: recovering data from backups or snapshots, checking the integrity of the data, bringing systems online in the right order (so database servers, application servers, web servers), validating that everything is running correctly, all dependencies are in place, that DNS is updated so that servers and services can talk to one another and so on.
Usually you may be dealing with an outage of one system at any one time but where you have a total outage of all systems at the same time it becomes massively more complicated to bring everything back online. You would be looking at core network systems (so backbone switches, routers, firewalls etc.) followed by network and authentication services (Active Directory and DNS), then you'd look to bring up your storage area networks and storage systems so you can access your data. Once this is up and running you can start to look at application services - so database servers, then application level servers, web servers etc.
You also have to ensure you have sufficient resources available - so staff with the relevant expertise, that you have support contracts in place with your vendors so you can call on 24/7 support from them etc. You are also looking at your disaster recovery plan, recovery point objectives, recovery time objectives in order to prioritise which systems are brought back online first etc.
Clearly a power outage should not bring down your entire infrastructure. You would expect all equipment to be fed from at least two independent sources with UPS / battery backup and generator backup in place. These systems should in themselves be redundant so the failure of a single piece of infrastructure doesn't lead to an outage, and you'd expect to be regularly maintaining and testing this. In a co-location / data centre environment this would be undertaken by the data centre provider.
I have to say that while it's easy to be critical of BA and say that this outage should never have happened I would spare a thought for the IT staff who will have been sorting out this mess and getting systems back online. Having been on the front line in outages for my company I can say it is stressful beyond words and you really do feel the pressure!
Hopefully this little summary will give an indication of what's involved in dealing with a large-scale outage. I don't work for BA so have no idea of their IT infrastructure but my experience comes from data centre management for a global law firm with 30+ offices worldwide.
Unless you have multiple data centres running in an active - active setup there will by necessity be some degree of delay in bringing the backup systems online. This can be caused by multiple factors: recovering data from backups or snapshots, checking the integrity of the data, bringing systems online in the right order (so database servers, application servers, web servers), validating that everything is running correctly, all dependencies are in place, that DNS is updated so that servers and services can talk to one another and so on.
Usually you may be dealing with an outage of one system at any one time but where you have a total outage of all systems at the same time it becomes massively more complicated to bring everything back online. You would be looking at core network systems (so backbone switches, routers, firewalls etc.) followed by network and authentication services (Active Directory and DNS), then you'd look to bring up your storage area networks and storage systems so you can access your data. Once this is up and running you can start to look at application services - so database servers, then application level servers, web servers etc.
You also have to ensure you have sufficient resources available - so staff with the relevant expertise, that you have support contracts in place with your vendors so you can call on 24/7 support from them etc. You are also looking at your disaster recovery plan, recovery point objectives, recovery time objectives in order to prioritise which systems are brought back online first etc.
Clearly a power outage should not bring down your entire infrastructure. You would expect all equipment to be fed from at least two independent sources with UPS / battery backup and generator backup in place. These systems should in themselves be redundant so the failure of a single piece of infrastructure doesn't lead to an outage, and you'd expect to be regularly maintaining and testing this. In a co-location / data centre environment this would be undertaken by the data centre provider.
I have to say that while it's easy to be critical of BA and say that this outage should never have happened I would spare a thought for the IT staff who will have been sorting out this mess and getting systems back online. Having been on the front line in outages for my company I can say it is stressful beyond words and you really do feel the pressure!
Hopefully this little summary will give an indication of what's involved in dealing with a large-scale outage. I don't work for BA so have no idea of their IT infrastructure but my experience comes from data centre management for a global law firm with 30+ offices worldwide.
#39
Join Date: Oct 2005
Location: London
Posts: 726
From the symptoms I've read today, I cant reconcile a datacenter power issue with people being able to use their email only if they were already logged in. Seems like something is cached enabling access but its all speculation as we've got limited visibility at the moment about what happened!
#41
Join Date: Sep 2014
Location: Brexile in ADB
Programs: BA, TK, HHonours, Le Club, Best Western Rewards
Posts: 7,067
Aren't airports classed as critical infrastructure by the government? Perhaps they should force this Spanish airline (BA) to have higher standards of IT infrastructure if they have so many airport slots.
#42
Join Date: Apr 2014
Location: London
Programs: Don't even mention it. Grrrrrrr.
Posts: 968
From the symptoms I've read today, I cant reconcile a datacenter power issue with people being able to use their email only if they were already logged in. Seems like something is cached enabling access but its all speculation as we've got limited visibility at the moment about what happened!
#43
Join Date: Dec 2016
Programs: BA Gold
Posts: 487
Never truer than when talking about enterprise IT!
For a moment I was tempted to suggest that in a modern design you'd plan to expect and contain failure... you architect it into the applications so they can work around the underlying infrastructure issues.
Then I came to my senses and realised this was a legacy airline we're talking about.
For a moment I was tempted to suggest that in a modern design you'd plan to expect and contain failure... you architect it into the applications so they can work around the underlying infrastructure issues.
Then I came to my senses and realised this was a legacy airline we're talking about.
The fact is that humans have great difficulty understanding an existential threat and forumulating a plan to deal with the consequences. We tend to look at the odds and if they are favourable just roll the dice and ignore the possibility that improbable events will ever be realised.
#44
Join Date: Apr 2010
Posts: 11
Authentication
Addressing (e.g. DNS)
Routing
Any issue in the above has the potential to be catastrophic.
As for moving to a disaster-recovery environment, one established method of activating a failover is a push to DNS of new IP addresses (the IP addresses of the back-up services). If that doesn't work (e.g. no DNS server(s)), then you are in a world of hurt.
Addressing (e.g. DNS)
Routing
Any issue in the above has the potential to be catastrophic.
As for moving to a disaster-recovery environment, one established method of activating a failover is a push to DNS of new IP addresses (the IP addresses of the back-up services). If that doesn't work (e.g. no DNS server(s)), then you are in a world of hurt.
#45
Join Date: Nov 2010
Location: Bristol
Programs: BA GGL, UA Plat, DL Plat, Hilton Diamond
Posts: 2,380