FlyerTalk Forums - View Single Post - Travel Waiver: Crowdstrike Outage 18 July --UA/AA/DL Groundstops,operations resuming
Old Jul 19, 2024 | 10:21 am
  #59  
lincolnjkc
30 Countries Visited
1M
100 Nights
20 Years on Site
 
Join Date: Feb 2005
Location: CLE, DCA, and 30k feet
Programs: Honors LT Diamond; United 1K 1MM; Hertz PC
Posts: 5,620
Originally Posted by jsloan
While multi-cloud / multi-platform is recommended by people who have a vested interest in selling more engineering time (or more cloud hardware), there are few commonalities between the various cloud providers and various server OSes. So to avoid putting all of their eggs in Microsoft's basket, UA would have to increase their IT spending by at least 2x, but probably more like 5x, because you've created innumerable new failure points with all of the cross-cloud / cross-platform communication you need. And you pretty much have to restrict yourself to the least common denominator among your platforms -- so if you're aiming to run on both Google Cloud and Amazon Web Services, for example, you can't use any AWS functionality that doesn't have a direct Google Cloud analogue.

That's why this outage is so vast -- virtually nobody makes that investment, because the ROI is so amazingly low. Today, it would have been great, but up until now, it would have caused nothing but problems.
Not to mention it appears the most impactful of the error was on end user compute devices, which realistically there's not a whole lot of options of you don't want to be locked into Apple hardware and don't want to DIYit (not to mention training, Enterprise manageability, etc.) with Linux... It is certainly possible but if the ROI paid out the airlines would have already moved that way.

(Yes, Microsoft did have an issue with Azure impacting some services in close temporal proximity but that appeared to be isolated to one DR region and they were "encouraging" customers, my employer included, to fail over to other regions. We didn't because nothing I manage in Azure was affected, but a fail over when properly implemented should be virtually seamless.

Of course that doesn't matter if you've pushed a faulty update to every end user compute device in the org...
lincolnjkc is offline