Originally Posted by
corporate-wage-slave
The February one was apparently due to a server hardware failure - some disks burnt out. The failure this week was quite short - 2 hours - which suggests software, middleware or database failure, speaking as an armchair CIO.
I am sorry cws but ....
some disks burning out should not be the reason for the complete loss of IT service
Lack of proper investment in the infrastructure, too many single points of failure
I'd would also have thought an org like BA should have some contingency that would allow them to keep most services running if IT fails for a couple of hours. OK maybe taking new bookings would be a problem but they should have manual systems to allow them to get most services away
the current system I am building has 4 replicas as we cannot risk losing our business for as long as BA appears to find acceptable based on what they have built
If your back up takes "too long" to recover your systems then your HA/DR strategy is wrong
One massive failure maybe I could understand . but clearly no review of the system and steps taken to correct it, which suggests management believe the cost of these repeated extensive disruptions is acceptable rather than the cost of fixing the problem