24/07/2024

The aftermath of CrowdStrike

The questions IT and Business Continuity Teams should expect.

After every major IT incident, it’s reasonable to expect lots of questions from senior leadership about why it occurred, could it affect you and how you can stop it impacting your organisation.

“We’ve recently increased the budget for security, how has this happened?” will be a common response, as will the inevitable follow up – “I thought we had bought XYZ to stop things like this from happening.”

We saw a similar reaction to WannaCry and NotPetya. When an incident is so large it crosses from technology press to national news, it leads to questions (and engagement) from the wider business.

IT Teams will be questioned about continuity, recovery plans and practices. Likewise, Business Continuity professionals will be grilled about technology and other products that could have a similar impact on operations.

This document is a primer to help prepare for these discussions. In doing so, you can instil confidence in your operations and enhance your ability to respond to future incidents.

Why do we use CrowdStrike and how do we change to an alternative?

If you use CrowdStrike, there will be plenty of people now asking – “why?”.

Until Friday 19th July, few outside the tech team would have even heard of the company. So most stakeholders only know that they caused a global and widely-criticised IT outage.

They don’t have the longer relationship IT teams have, who are familiar with CrowdStrike as a reputable and well-respected security product.

It was a bug in an update. It shouldn’t have happened. It should have been caught but it wasn’t. It has happened to other technologies before and will happen again.

All of which do not mean that the product itself is bad. If there is a history of pushing bad updates, it is reasonable to consider changing solutions. But a single issue is not necessarily a good reason for it.

If you don’t use CrowdStrike, be clear that although you were fortunate to not have been impacted this time, you could face a similar situation if your antivirus provider made the same mistake.

What was the cause of the issue?

Many have confused two separate issues that happened at the same time. There was a Microsoft 365 and Azure outage in the Central US Azure Region on July 18th, which affected all Availability Zones. Microsoft resolved this issue by the early hours of July 19th.

The larger incident was caused by a bug in a ‘Channel File’– a file which stores information on potential sources of malware – from CrowdStrike’s Falcon cyber security product. The two issues were unrelated but ultimately had the same cause – a flawed update.

The CrowdStrike issue was so significant because antivirus and cyber security software have kernel-level access at the core of the operating system. A fault at this level can cause the entire system to fail, leading to the infamous ‘Blue Screen of Death’ (BSOD).

Why didn’t we test this update?

Internal IT Teams do not test these kinds of updates. This issue was caused by an update to a Channel File, which may occur several times per day. They are tested by the vendor.

Cyber security vendors are given a high level of trust and access to operate, so they are expected to have thorough testing processes. As this incident has proven, no testing and release process is infallible.

How could any software have so much control?

Executives learning about this incident over the weekend will have been studying the fallout and seen the debate on social media from armchair experts about kernel-level access and drivers.

Be clear that this level of access is required for cyber-security software. However, it inevitably introduces risk because a third-party has the potential to unintentionally cause an outage.

This has been a debate for decades. Microsoft have tried to limit access but have been denied, partially on the basis of EU rulings on anti-competitive practices. But while there is a legitimate debate around risk, it is not likely to change quickly – if at all.

Such a deep level of access isn’t uncommon for cyber-security or other similar products – it’s a prerequisite for them to do their job. They must be able to be updated to protect against vulnerabilities and threats as they emerge.

The alternative is to not use antivirus software, which would introduce significantly more risk. You could use multiple products to spread that risk, but that adds a big overhead to manage.

Practically speaking, antivirus software represents a single point of failure, which you should be highly familiar with and prepared with a plan for mitigation if (or when) something goes wrong.

What was the Disaster Recovery (DR) plan and was it executed correctly?

CrowdStrike quickly issued a workaround that detailed the manual steps to resolve the issue and reboot the device.

If no workaround had been provided, the method to recover would be to invoke the IT DR Plan and restore from Backup or DR solution. Fortunately, in most (but not all) instances, the workaround worked and was the fastest option for recovery.

Unlike in cyber-attacks, there was no doubt that organisations should be able to recover, the challenge was how long it would take.

How long will it take to recover?

There are two answers here. The full recovery to return completely to BAU is likely to take days and will often cause concern. But this isn’t the whole story, as it takes much less time to get your business back to an operational state.

It won’t operate at full capacity, but sufficiently well to meet your obligations to customers and wider business demands. This is your Minimum Viable Business (MVB) or Minimum Business Continuity Objective (MBCO).

By communicating both your MVB and MBCO, you demonstrate a business-centric view of your technology recovery, rather than technology-centric.

What other IT products could have a similar impact due to a failed update?

This question can be split into two halves. There are systems that would have a wide-reaching impact on your organisation because a failure would affect many or all of your systems.

Systems that could cause similar issues to your organisation

Single Sign-On (SSO)

Multi-Factor Authentication (MFA)

Microsoft 356/Google Workspace

AV and other security tools

Systems that affect entire industries (concentration risk)

Microsoft (both 365 and Azure)

AWS and Google Cloud

Payment card systems

Content Delivery Networks and Domain Name Services

The rapid growth of complex hybrid IT estates means that it’s incredibly difficult to have a full understanding of the risks and vulnerabilities inherent to their design. Even the most mature IT functions have blind spots, as demonstrated by this widespread demonstration of concentration risk.

Were we aware of this risk?

This is a common risk – there have been many previous examples of bugs in software updates –so we would expect it to be present in risk registers. If it isn’t, it should be added.

UK Regulators have been writing about Third Party Risk Management and outsourcing risk for some time. The financial sector in particular has been particularly focused on concentration risk due to the high degree of dependency between businesses and suppliers.

The PRA and FCA have expressed serious concerns about the impact on the UK financial sector of a massive disruption to Microsoft Azure or AWS. PRA expects companies to think about resilience when adopting public cloud services and other new technologies. This includes data security (e.g. backups), business continuity and exit plans (i.e., how to move out of a failed service).

How do we mitigate this risk?

Ultimately, this is likely to be a risk that you are forced to accept because it is out of your control. Using multiple antivirus products from different vendors to diversify your supply chain is an option, but increased management overhead will make it less practical.

You may have the option to chose when updates are deployed or even to stagger the roll-out. This is a balance of risk.

Delaying means you are less likely to be affected by bad updates because problems can be found and fixed. The downside is that updates released to protect against newly discovered vulnerabilities, and this extends your period without protection.

You may review your supplier’s history of issues caused by updates and decide thresholds that would make you move to another supplier. It is impossible for you to affect the likelihood of another incident, but you can reduce the impact on your organisation with your Business Continuity Plan and processes.

If the vendor provides a workaround, as CrowdStrike did in this instance, your ‘recovery plan’ is how your IT team will perform the manual fix. If the vendor does not provide a workaround, the solution is to recover systems to a prior point in time from Backups or DR.

Recommendations for Business Continuity and IT Professionals

On one hand, this was a ‘pure’ Business Continuity issue, in that you need manual workarounds in place for when critical services are impacted. Think of the airports issuing hand-written boarding cards and using white boards to update departure information.

But it is also a technology supply-chain issue. The first step is ensure that asset registers are up to date, and that you have a plan and process for how to revert to a version prior to any updates.

Depending on the complexity of your IT estate, this may demand more thorough impact and dependency mapping to identify seemingly small suppliers that could have an outsized effect on operations.

If you were affected directly by the incident – how did your response fare? Conduct a post incident review and identify areas for improvement. Were too many of the IT Admin team on holiday? Were communications to customers pre-prepared, and if they were, were they appropriate?

If you were not affected directly, but are concerned about how you might have responded, this is the ideal time to use the visibility and momentum of this story to organise an internal exercise.