27 May BA IT outage miscellaneous discussions thread
#286
FlyerTalk Evangelist
Join Date: Mar 2013
Location: London
Posts: 17,007
undoubtedly there were other contributing factors. In fact we know there are as the backup solution didn't come online as expected. However I was replying to jerub who contends that the root cause was something that happened later in the sequence of events than the power failure/surge.
#287
FlyerTalk Evangelist
Join Date: Mar 2010
Location: JER
Programs: BA Gold/OWE, several MUCCI, and assorted Pensions!
Posts: 32,144
undoubtedly there were other contributing factors. In fact we know there are as the backup solution didn't come online as expected. However I was replying to jerub who contends that the root cause was something that happened later in the sequence of events than the power failure/surge.
My feeling, from several posts, is that something went horribly wrong during the recovery/reboot process. Which, in an ideal world, it shouldn't!
#288
FlyerTalk Evangelist
Join Date: Mar 2013
Location: London
Posts: 17,007
Yes, I was obviously being simplistic. I recall my sense of panic when my Mac backup started making protest noises, but at least I was able (as an old git) to carefully work my way around that.
My feeling, from several posts, is that something went horribly wrong during the recovery/reboot process. Which, in an ideal world, it shouldn't!
My feeling, from several posts, is that something went horribly wrong during the recovery/reboot process. Which, in an ideal world, it shouldn't!
#289
Join Date: Apr 2014
Location: London
Programs: Don't even mention it. Grrrrrrr.
Posts: 968
Not that I am authorized to speak for jerub but I think she/he is contending that the root cause occurred earlier than the power surge, i.e., it was a failure to arrange or maintain the backup systems properly that is the root cause. Such a failure makes an outage an inevitability, it is then a simple matter of timing.
As for "inevitability", everything happens in the end.
#290
Join Date: Nov 2015
Posts: 158
yeah, root cause of the service failure will be something like process failure to ensure backup systems were functioning, or process failure to implement correct backup procedures.
You could try to trace those back further to either negligence or some other item such as incompetence but going further will be difficult. Also it depends who is doing the investigation too...
Root cause can't be the power failure as that's a trigger for a known failure mode.
You could try to trace those back further to either negligence or some other item such as incompetence but going further will be difficult. Also it depends who is doing the investigation too...
Root cause can't be the power failure as that's a trigger for a known failure mode.
#291
Join Date: Aug 2012
Posts: 2,676
Of course - it wont always protect you. For example - a company I know had an outage due to a timezone issue. They had tested the DR when the UK and US were aligned on non DST (literally the week before a DR event). Failover failed when US was on DST and UK wasnt.
So you cant account for every variable but the failover should have been tested regularly.
Assuming it was tested - then the question is why did it fail THIS time?
#292
FlyerTalk Evangelist
Join Date: Mar 2010
Location: JER
Programs: BA Gold/OWE, several MUCCI, and assorted Pensions!
Posts: 32,144
True. Since that hiccup some months ago, I now check (occasionally) to ensure the silent background backups are happening as they should. A while later, the Mac died, and restoration to the new machine was painless!!
#293
Join Date: Oct 2008
Location: Isle of Skye, Scotland
Programs: BA gold
Posts: 3,902
#294
FlyerTalk Evangelist
Join Date: Mar 2010
Location: JER
Programs: BA Gold/OWE, several MUCCI, and assorted Pensions!
Posts: 32,144
#295
FlyerTalk Evangelist
Join Date: Mar 2013
Location: London
Posts: 17,007
^ You know, I hear BA may have an opening in their IT department for a redundancy engineer, if you could be tempted out of retirement.
#296
FlyerTalk Evangelist
Join Date: Mar 2010
Location: JER
Programs: BA Gold/OWE, several MUCCI, and assorted Pensions!
Posts: 32,144
if that was the case, I could stop this iPad forever having to reload pages. I'm sure it's FT's endless adverts, for bloody Condor Ferries, combined with a 10mbps connection, that's causing it
Every time I try to expand the text so that I can read it ... "Reload"!
#297
Join Date: Jul 2007
Posts: 66
The importance of chasing 5 nines seems to be lost on BA until this situation. Fifteen minutes of downtime a year seems to be outside of their previous understanding; I hope the lesson has been learned fully as a result of WHY funding 5 nines is an organizational imperative.
#298
Join Date: Aug 2012
Location: Provincie Antwerpen, Vlaanderen, België
Programs: MUCCI Gold
Posts: 2,512
#299
Join Date: Jul 2007
Posts: 66
*laughing* We are agreed on your point. It usually takes a brutal financial slap to wake large companies up to how insane they've let their IT get via primary usage of duct tape and spit fixes.
#300
Join Date: Sep 2013
Programs: BAEC Gold, EK Skywards (enhanced Blue !), Oman Air Sindbad Gold
Posts: 6,398
BBC Radio 5 Live have devoted a lot of time to BA during the course of this afternoon, with their coverage having now shifted firmly to issues surrounding handling & communications, rather than the specifics of cancellations/ delays.
Also included clips from Alex Cruz interview earlier today.
There have been contributions from affected passengers, BBC presenters, guest journos, and consumer/aviation 'specialists'.
The criticism throughout has been unfailingly relentless, and all centred on the standard of communication - described in one piece as 'just appalling.'
I think it is this matter of their communication & overall response to affected passengers which will prove most damaging - rather than the outage itself (the latter being something that many - perhaps even most - people might have been willing to accept)
Also included clips from Alex Cruz interview earlier today.
There have been contributions from affected passengers, BBC presenters, guest journos, and consumer/aviation 'specialists'.
The criticism throughout has been unfailingly relentless, and all centred on the standard of communication - described in one piece as 'just appalling.'
I think it is this matter of their communication & overall response to affected passengers which will prove most damaging - rather than the outage itself (the latter being something that many - perhaps even most - people might have been willing to accept)