Just over a week ago, the tech world witnessed what could easily be dubbed the largest IT outage in history. While many are scrambling to sweep this monumental mess under the rug, we’re not about to let that happen. Sure, Agio dodged this bullet, but let’s not kid ourselves – this was a wake–up call and it’s time we all sat up and paid attention. 

Whether you’re breathing a sigh of relief or still nursing a migraine from that infamous blue screen of death that greeted you on what should have been just another Friday morning, there’s a boatload of lessons here that we can’t afford to ignore. 

What happened?  

On July 19, 2024, a global IT outage was triggered by a faulty update to the CrowdStrike Falcon sensor for Windows. The update involved a configuration file (Channel File 291) designed to enhance the sensor’s behavioral protection mechanisms. However, the update contained a logic error that caused systems running the Falcon sensor to crash, leading to widespread disruptions. This outage affected numerous sectors, including airlines, banks, media companies, and critical infrastructure. 

Microsoft estimates the update affected 8.5 million Windows devices. While Microsoft is framing this as “only” less than one percent of devices, millions of Windows machines going full-blown digital blackout is no hiccup. It’s important to recognize the severity of the outage, and the risks that arise from global outages such as this.  

About 10% of our client base felt the sting of this faulty update fiasco throughout the day primarily through their vendors. But here’s where preparation met opportunity: Agio had already done its homework. We hit the ground running with remediation steps locked and loaded, guiding our clients through the issue. It wasn’t luck – it was readiness that made the difference between a full-blown crisis and a manageable hiccup for our clients. 

Was this a cybersecurity incident? 

The CrowdStrike incident on July 19, 2024, which led to a global IT outage, was not a cybersecurity incident in the traditional sense of a malicious attack or breach. Although the incident caused significant disruptions across various sectors, including airlines, banks, and media, it did not involve unauthorized access, data breaches, or exploitation by cybercriminals. Instead, it was a technical failure within a routine software update process. Therefore, it should be understood as a critical IT management and operational issue rather than a cybersecurity breach. 

It’s important to clarify that in situations like this, many organizations may think the best response is to uninstall the problematic platform. However, this is not best practice, especially when the platform is a crucial endpoint protection tool like CrowdStrike. The most effective approach is to resolve the issue while maintaining your security posture. 

Who is Crowdstrike?  

CrowdStrike is a leading American cybersecurity company, based in Austin, Texas. Established in 2011, it specializes in cloud workload protection and endpoint security. CrowdStrike’s flagship product, Falcon, leverages artificial intelligence (AI) to detect and prevent breaches in real-time. The company is renowned for its advanced threat intelligence and comprehensive approach to cybersecurity, which includes threat hunting and incident response services. 

See also  What is the best server for a private equity firm?

What is Crowdstrike Falcon?  

CrowdStrike Falcon is a cloud-native endpoint protection platform. It is designed to protect against all types of cyber threats by leveraging AI, machine learning, and behavioral analytics. Falcon operates by monitoring and analyzing endpoint activities to identify suspicious patterns that could indicate an attack. The platform provides continuous and real-time monitoring to prevent breaches and minimize the impact of any security incidents. 

global service desk 247 support anywhere anytime

Why was it reported as a Microsoft event? 

While the issue originated from a CrowdStrike update, it was widely perceived as a Microsoft event because the affected systems were primarily Windows-based. Moreover, Microsoft’s ecosystem and extensive user base amplified the visibility and impact of the outage. Despite this perception, the root cause was isolated to the CrowdStrike Falcon sensor update. 

As if the CrowdStrike outage wasn’t enough, a Microsoft Azure outage took out critical services around the world at the same time. These two digital disasters appeared entirely unrelated, creating a nightmare scenario where it was difficult to determine which outage was causing each problem, or if we were dealing with a problematic combination of both. 

Why didn’t this impact non-Microsoft devices? 

The outage did not impact macOS or Linux devices because the faulty update was specific to the Falcon sensor for Windows. The configuration file that caused the crash was only deployed to Windows systems running Falcon sensor version 7.11 and above. Consequently, devices running other operating systems were unaffected by the issue. 

How was it resolved?  

CrowdStrike quickly identified the faulty configuration file and provided a fix. The remediation process involved manually deleting the problematic channel file from affected systems. IT teams, including Agio, across the globe went into action to take these very manual remediation steps to restore impacted systems. Detailed instructions and technical support were provided to ensure a resolution.  

In case you missed it, here’s a list of Agio support articles accessible by clients in the AgioNow Portal:  

If you currently use CrowdStrike, should you replace CrowdStrike?  

Deciding whether to replace CrowdStrike requires careful consideration. While the incident was significant, CrowdStrike’s overall track record and the advanced capabilities of its Falcon platform are noteworthy. It is essential to weigh the benefits of CrowdStrike’s comprehensive threat protection against the potential risks of a similar incident recurring. Because if CrowdStrike doesn’t learn and does it again, you may be subject to stability issues. But on the other hand, if they learn from it, which they most likely will, you are in good hands. Additionally, consider the following: 

See also  Agio's AI-Powered Innovations Solve Your IT & Governance Concerns

Pros of Retaining CrowdStrike: 

  • Continuing the use one of the most effective EDR solutions on the market.  
  • Benefit from the extreme diligence CrowdStrike has planned to take in future updates and capabilities.  

Cons of Replacing CrowdStrike:  

  • Switching to another vendor doesn’t guarantee they won’t have the same issue 
  • Moving away from what many consider the top EDR solution on the market today.  
  • Dependence on a single vendor for critical cybersecurity functions

What Should Companies Do as a Result of This Outage? 

1. Consider a N-1, N-2 Software Deployment Strategy 

Implementing an N-1 or N-2 deployment strategy involves maintaining additional instances or versions of critical software to ensure redundancy and resilience. Not familiar with N-1 and N-2? Put simply, N-1 and N-2 strategies are your insurance policy against the “oops” moments that often come with bleeding-edge tech. The “N” stands for the current version and -1 means you go 1 version over, and -2 means you go 2 versions over, you get the gist. In essence, you are always one version off and this strategy ensures you stay stable. 

Pros: 

  • Enhances system reliability and uptime 
  • Reduces the risk of widespread outages due to faulty updates 
  • Allows for more controlled and phased rollouts of updates 

Cons: 

  • Delays in deploying security patches and updates which will create more exposure 
  • Increased complexity in managing multiple versions 

2. Consider Multiple Vendors for Cybersecurity 

Diversifying cybersecurity solutions by engaging multiple vendors can mitigate the risk of relying on a single point of failure. 

Pros: 

  • Reduces dependency on a single vendor 
  • Increases resilience against vendor-specific issues 
  • Enables leveraging different strengths and capabilities of various solutions 

Cons: 

  • Increased complexity in managing multiple security solutions 
  • Potential for integration challenges 

Agio Portal Now

What We’ve Done So Far 

While this incident did not affect Agio, we’re learning and making changes as a result. Here’s what we’ve done so far: 

We built out our Account Configurations to make them more robust. This will allow us to quickly identify customers impacted by an incident with a product or vendor. Additionally, we will be able to quickly communicate with the subset of customers impacted (or potentially impacted). As a first step, we have recently increased the visibility of the Account Configurations by exposing them on the AgioNow Portal. 

Not an Agio client? Contact us today.