FlyerTalk Forums - View Single Post - 27 May BA IT outage miscellaneous discussions thread
Old May 27, 2017, 1:50 pm
  #29  
dakaix
 
Join Date: Aug 2015
Location: Somewhere around Europe...
Programs: BA Gold; MB Ti; HH Diamond; IHG Plat; RR Gold
Posts: 530
Originally Posted by Banana4321
If you think that BA is unique in being caught out by this then you are making an ill-informed judgment, although theirs is large, and public - which is most unfortunate for them. I have been closely involved in the annual failover "rehearsals" for organisations with a more critical service to the public than an airline. None of them were 100% successful. There was always a list of things to fix. Next year, due to the changes we had made over the course of that year, there was a list of different things to fix. And we didn't test firing a lightning bolt at the power supply (say), we did only what was necessary to send a letter to the regulatory body saying "we did that and passed". Also some times when we "flicked the switch" something unexpected and bad happened, in those cases we flick the switch back and try again in a few weeks time.

That is the reality.
Glad I'm not the only one to have had that experience. The reality is that while many companies try (read: feign) to prepare for DR, few have a documented plan, fewer still actually perform any form of testing (let alone regular).

Sure you can test an application in isolation, maybe even one rack, but until you practically verify a complete data centre or geographical outage, there's simply no way to know for certain. Will routing work if the primary site is off? Are we sure all the fibers connecting the sites are diverse (so someone in a JCB doesnt knock out all our connectivity in a stroke)? If a site is entirely cratered how do we go about recovering? Would we have any staff left to do it?

The fact is very few companies would pass all of those criteria. At some point someone says "well is this really worth the cost?", because it's not tangible to them.

The first company I worked for was a data centre operator, they had a bi-weekly practice of performing "black building testing". As the name suggests they would literally disconnect the power supply to prove that everything worked, generators, UPS, failover etc. Theory is all fine, but it means nothing unless you test it. Regularly.

Last edited by dakaix; May 27, 2017 at 1:55 pm
dakaix is offline