FlyerTalk Forums - View Single Post - (Seriously) What should DL have done?

Jul 24, 2024 | 8:45 am

#16

third_wave

Join Date: Jan 2015

Location: USA

Programs: DL PM

Posts: 243

Quote:

Originally Posted by Bbcei

This is not helpful. Yes, telling the truth and empowering local employees to be flexible is great for dealing with a canceled flight at your local airport. But at Delta's scale, these measures are utterly insufficient for addressing a massive systemic issue like the CrowdStrike-induced outage.

What Delta needed was a substantial, long-term investment in its IT infrastructure, including (a) developing robust systems and (b) creating comprehensive contingency plans for major failures that are regularly tested with simulated failures. Companies like Google, Meta, and Amazon do not rely on "typewriters" or "carbon paper" or having their executive staff "jump into the trenches" to deal with system failures; instead, they continuously invest vast amounts in their IT services so their systems are highly resilient, and they regularly conduct dry runs of rigorous disaster recovery plans, so their systems can quickly and efficiently recover with minimal manual work.

Southwest saw the consequences of neglecting IT in the 2022 holiday meltdown. Delta is seeing it this year. The solution is pretty clear: IT must be treated not as just a cost center, but as a top-tier strategic priority, second only to safety. (Though the two should not be in tension, as a resilient IT system improves safety.) This means hiring and empowering the right people—experts from top-tier tech companies instead of MBAs—and ensuring they have enough resources to accomplish their goals. No airline IT system is going to be as resilient as Google's, but Delta can and should get a lot closer to Google-level resiliency from where it is right now.

Was going to write up a reply to this effect, but could not have written it better myself. Once the sh*t hit the fan, there was very little that could be done to immediately restore order, hence the multi-day recovery process. The issue was lack of redundancy and stress testing in their IT systems.

Reply