FlyerTalk Forums - View Single Post - (Seriously) What should DL have done?
View Single Post
Old Jul 24, 2024 | 11:15 am
  #30  
phongn
10 Countries Visited20 Countries Visited30 Countries Visited10 Years on Site
 
Join Date: Oct 2012
Posts: 17
Originally Posted by FlyingUnderTheRadar
As such, my magical butt wonders how many of these "mission critical" organizations have a test platform which are used to test updates?

FWIW I have auto-updates turned off on all of my personal computers for all software.
In general, yes, these organizations typically have staging environments for incoming updates, and for the problematic EDR software in question, you can opt into N-1 or N-2 for sensor updates. However, CrowdStrike's rapid-response signature updates were simply pushed out globally, regardless of N-1 or staging environment considerations. Given the nature of the bug in question, it's entirely possible that it did not fail in their woefully insufficient testing. They were also reliant on validation tools which, in this case, missed the bug. They claim they will now work to fix this process, including canary subsets, to avoid catastrophic global outages.

As pointed out in other threads and other places here, for single-point-of-failure (SPOF) systems, enterprises may consider multiple vendors to shield against one of them taking down a system. This is not CrowdStrike's first (or last) time breaking systems.

Originally Posted by 32LatT10
This disconnect between business continuity (the ability to sustain business processes and operations) and disaster recovery (the ability to recover IT systems) has happened before and will certainly occur again (broadly speaking, not just DL).
I've participated in and helped build out business continuity planning, and sat my butt down on weekends to observe actual testing of such things in production, and it takes quite a bit of work. For our business, the lightspeed delay across geographically separated facilities becomes a nontrivial issue.
phongn is offline