We received many inquiries about the CrowdStrike incident, how to deal with it, and what it could mean for the future. We thought it would be valuable to share our analysis.
Executive Summary
The CrowdStrike incident was caused by very unique (and rare) technical circumstances, combined with what seems to be a very impaired QA and release process by CrowdStrike. We will detail some of the more technical aspects of what happened (and how the process should have been). Good incident response and disaster recovery plans can minimize damages even in such extreme cases. We briefly detail a best practice approach as well.
Introduction
The CrowdStrike incident has raised many questions and concerns among users and industry professionals. Understanding what went wrong and how to prevent similar occurrences in the future is crucial. This article provides a detailed analysis of the CrowdStrike bug, the technical aspects involved, and recommendations for robust incident response and disaster recovery practices.
CrowdStrike Bug Analysis
What Do We Know About the Incident
The issue began with an update pushed automatically to CrowdStrike’s Falcon endpoint protection, installed on approximately 8.5 million Windows systems. The update triggered a bug in a CrowdStrike component running in Windows Kernel Space, resulting in an immediate system crash and a recurring Blue Screen of Death (BSOD) even after system restarts.
Why Did the Systems Crash
Windows Kernel Space is the core of the operating system, managing applications and hardware control. For security reasons, it is segregated from most applications that run in the more controlled User Space, preventing failures in one application from affecting others and the OS itself. Only specific applications like antivirus software, some VPNs, and virtual machine platforms use Kernel Space.
In this case, the bug created a system crash triggered during the initial phase of the boot process, causing an immediate system crash again upon restart.
Image by Bobbo – Own work, CC BY-SA 3.0
How Did It Happen
The CrowdStrike incident is a rare case, caused by a combination of unique characteristics of the CrowdStrike software (antivirus using Kernel Space, same software installed on numerous computers), and a failure to adhere to basic software development life cycle rules:
- QA Process: Basic QA was evidently not performed, as the bug immediately affected nearly all users. Proper QA involves multiple stages and extensive testing on various environments simulating client environments.
- Release Process: Releasing updates to all clients simultaneously is risky because it impacts everyone if something goes wrong. Gradual rollouts are preferred, involving stages like internal alpha, beta, limited release, and general availability.
Mitigation Strategies for Similar Future Incidents
Prepare an Incident Response Plan
An effective incident response plan helps quickly detect and mitigate incidents, minimizing damage and reducing stress caused by confusion about roles and responsibilities. This plan should outline the steps to take, who is responsible, and how to communicate effectively during an incident.
Install Disaster Recovery Measures
Disaster recovery measures are essential to minimize lost data and business disruptions. This involves having backup plans for hardware, software, and data, and a clear plan of action in case of an incident. Regular backups and testing of recovery procedures ensure that data can be restored quickly and accurately.
Conduct Regular Drills, Simulations, and Tabletop Exercises
Regular incident response and disaster recovery drills help keep the team prepared. These exercises can be based on risk scenarios identified during risk assessments. The quicker the response, the less damage incurred. Practicing with drills and simulations ensures that everyone knows their role and the procedures to follow, reducing the time taken to resolve an incident.
Conclusion
The CrowdStrike incident serves as a critical reminder of the importance of stringent QA and release processes, alongside well-prepared incident response and disaster recovery plans. By adopting these best practices, organizations can significantly reduce the risk and impact of similar incidents, ensuring greater resilience and reliability in their operations.
We’re here for any questions or assistance in mitigating the situation or preparing for future resilience.