Source: Dennizn via Alamy Stock Photo
A buggy "security content configuration update" to CrowdStrike's Falcon sensor, which is aimed at gathering telemetry on novel threat techniques for Windows, has been confirmed as the root cause of the problem that crashed computers around the world on July 19, and is still having an impact on global IT teams, the vendor says.
CrowdStrike — which has been thrust into the spotlight in the last week for all the wrong reasons — released a "preliminary Post Incident Review (PIR)" today identifying a defect in a Rapid Response Content configuration update as the reason for the global incident, which caused massive disruptions to business continuity and headaches for travelers, hospital patients, and business professionals alike.
These kinds of updates are one of the ways that CrowdStrike — which provides some 29,000 customers with cloud-based software for endpoint detection and response (EDR) — delivers new security content to its software, and are "a regular part of the dynamic protection mechanisms of the Falcon platform," according to the PIR report. Rapid Response Content specifically updates CrowdStrike's software with the latest threat intelligence, designed "to respond to the changing threat landscape at operational speed," according to the report.
"When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception," according to CrowdStrike. "This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD)."
CrowdStrike also used the release of the report to take to social media to apologize yet again for the outage, which many organizations are still in the process of mitigating.
"We can’t repeat enough, we’re aware of the impact and deeply sorry this occurred," the company posted on social media platform X. "We want to thank our customers and industry partners for their support and assistance following the release of a faulty content update. We know what happened and how to make sure it doesn’t happen again."
The Update Heard Round the World
Indeed, the report details step-by-step the leadup to Friday's incident and its immediate aftermath, as well as how the company is responding to the issue to prevent a repeat performance.
CrowdStrike uses Rapid Response Content to perform behavioral pattern-matching operations on the Falcon sensor; the content itself is a representation of fields and values, with associated filtering. The content is stored in a proprietary binary file that contains configuration data.
When updating the sensor, the content is delivered as what are called Template Instances, each of which maps to specific behaviors for the sensor to observe, detect, or prevent. The instances have a set of fields that can be configured to match the desired behavior.
"In other words, Template Types represent a sensor capability that enables new telemetry and detection, and their runtime behavior is configured dynamically by the Template Instance," CrowdStrike explained in the post.
The sensor involved in the incident, sensor 7.11, was put into customers' production environments on Feb. 28, introducing a new IPC Template Type to detect novel attack techniques that abuse Named Pipes, which in computing is a method of interprocess communication.
CrowdStrike performed a stress test of the template on March 5 in its staging environment, which it passed and thus was validated for use and then released to production as part of a content configuration update. The company then deployed three additional IPC Template Instances between April 8 and April 24 that "performed as expected in production."
Assuming nothing was amiss, CrowdStrike deployed two more IPC Template Instances on Friday. However, the company didn't know one of the two Template Instances passed validation "despite containing problematic content data" because of a bug in the Content Validator, the company said. This content triggered the exceptional behavior that subsequently led to the global Windows crash.
CrowdStrike Response & Mitigation Continues
CrowdStrike remains in the hot seat — quite literally, as the company's CEO George Kurtz has been called on to testify before Congress about the incident — and has considerable work to do to salvage its reputation in the wake of the incident, notes David Ferbrache, managing director of Beyond Blue, a cybersecurity and resilience consulting firm.
"Until last Friday, few outside of the security and technology industries had heard of CrowdStrike; now, the company has been catapulted into the conversations of consumers and business leaders alike," he observes. "How could a company hired to protect the digital world bring down over 8.5 million machines with a single action?"
Clearly understanding that torching global IT infrastructure with a bad update is not a good look for a security firm, CrowdStrike outlined the measures it's taking to improve testing and deployment around its Rapid Response Content updates.
For one, it plans to add a variety of new testing to its menu before deploying updates in the future — including local developer testing; content update and rollback checks; stress tests and fuzzing; and fault injection, stability, and content interface testing.
Further, CrowdStrike will add new validation checks to the Content Validator for Rapid Response Content "to guard against this type of problematic content from being deployed in the future," the company said, as well as enhance existing error handling in the Content Interpreter.
The company also will make changes to deployment of content updates, staggering the process so they are gradually deployed to larger portions of the sensor base, "starting with a canary deployment." It also will improve monitoring for both sensor and system performance to collect feedback during deployment "to guide a phased rollout."
Further and perhaps most importantly, according to Ferbrache, CrowdStrike will provide customers with greater control over the delivery of future content updates by allowing granular selection of when and where these updates are deployed, as well as provide content update details via release notes through a subscription.
This may help mitigate the inherent risk that comes with rapid automated updates in live production environments, giving organizations the ability to control how those updates are applied. It also would allow them to balance the risk of deferring an update and leaving a potential security gap against the risk of immediate application, Ferbrache says.
He adds: "This is a fine balance, and sophisticated customers need to be able to strike that balance."