17/01/2019

Resilience in review – Looking back at the major outages of 2018

2018 has not been short of major incidents. Let’s take a quick look at some of the key causes of downtime in the headlines over the last year.

Firstly, how did it compare to 2017?

2017 was the year of ransomware with WannaCry and NotPetya outbreaks causing downtime across the globe. It hit global brands like Maersk, WPP and Renault, but here in the UK, we remember the effect it had on the NHS.

Our personal experience of ransomware in 2017 was a little different. WannaCry hit the headlines because it infected so many well-known organisations – but it was able to spread due to unpatched vulnerabilities. Organisations who kept up to date weren’t affected. Of our hundreds of customers, we had only one instance of WannaCry (and it was for the NHS).

During that weekend, we carried out multiple recoveries from ransomware – it’s just that they weren’t WannaCry. This is the worrying part of the story. General awareness of ransomware jumped in 2017 when the NHS was on the news, but there are still thousands of organisations suffering from ransomware infections.

In our Data Health Check survey in 2018, 25 per cent reported ransomware infections, and other surveys have reported even higher figures.Our Data Health Check reported significant improvement in security and continuity measures too. Testing of BC & DR plans is up, as is specific testing against cyber threats.

In 2018 we’ve seen a wider range of issues hit the headlines. It’s worth reviewing these problems to see if we are susceptible to the cause and assess how our plans and recovery methods would fare.

Financial services outages

Several prominent financial service firms got unwanted publicity this year. Perhaps the highest profile were TSB and Visa.

TSB

In April, TSB bank suffered a systems migration failure. TSB’s owners, Spanish banking group Sabadell, had attempted to migrate TSB customers’ records onto their own platform. The result was nearly two million customers were locked out of their accounts, and led to a surge in fraud cases targeting TSB customers. Incurring a cost of c.£280 million and heavy reputational damage, CEO Paul Pester was forced to resign five months later. According to Bacs, TSB lost 21,790 customers.

Nicky Morgan, chairwoman of the Treasury Select Committee, said; “Millions of customers have been affected by the uncertainty and disruption caused by failures of banking IT systems.”

“Measly apologies and hollow words from financial services institutions will not suffice when consumers aren’t able to access their own money and face delays in paying bills”.

Whilst outages can happen at any time to any business, the key learning is in TSB’s response to the crisis. The effects of the outage meant real people couldn’t access their own cash, which led to real world problems, and quickly. The rise in fraud attempts by opportunists bidding to capitalise on the confusion was an added shot below the belt.

TSB’s failure to equip its customers with tools to overcome these issues in the short time led to a spike in resentment and a huge amount of business lost. Not foreseeing the attempted scams also left customers feeling cheated.

It’s key to have a plan not only for minimising downtime, but also managing customer expectations. Having a plan that is grounded in the same reality that your customers occupy is important to minimise reputational damage.

Visa

Unlike TSB, Visa’s outage in June was caused by a hardware problem. When they tried to failover to their secondary infrastructure it didn’t work because there was a ‘partial failure’ of a switch, rather than a complete failure.

Visa Europe CEO Charlotte Hogg, said,

“…it took far longer than it normally would to isolate the system at the primary data centre; in the interim, the malfunctioning system at the primary data centre continued to try to synchronise messages with the secondary site. This created a backlog of messages at the secondary data centre, which, in turn, slowed down that site’s ability to process incoming transactions.”

The result was customers could not use chip and pin services on their debit/credit cards. Payment systems are critical and have a lot of time and money invested to keep them up. Given the millions of transactions that happen every day, any drop off in reliability results has consequences. Around 5.2 million payments failed on the day of the outage.

This caused less chaos than the TSB, outage, in no small part because it was resolved within the day. People tend to carry cards from other service providers and/or cash, so dependency on Visa cards was, to an extent, mitigated by consumers.

Amazon – prime day

What’s more powerful? The scramble for discounts from the world’s largest online retailer? Or the ability to scale up web servers from the world’s largest cloud computing company? In this case, the former. Amazon’s prime day surge of traffic outstripped AWS’ ability to scale up to meet the demand. Customers across the US were faced with looping web pages, shopping carts emptying themselves and offer buttons directing back to the homepage.

On the plus side, the dogs of Amazon were a nice touch.

Extreme weather

At the end of February, through to early March, the UK was hit by a cold snap called The Beast from the East, and Storm Emma. This was followed by Storm Callum in October. This unusual intensity of bad weather was hard on businesses, particularly construction, with the UK reportedly losing £1bn a day as people were unable to get to work.

Expired Certificates

Earlier in December, Ericsson came under fire for providing faulty software to O2, causing mass scale disruption. It was found an expired certificate related to Ericsson’s packet switches was the root cause. Around 32 million people lost service, resulting in a strong backlash against O2, particularly on social media. Users threatened to switch networks, as the telecoms provider’s reputation suffered. The total bill could reach as much as £100m.

In the aftermath, O2 quickly clarified compensation packages. This proved a cautionary tale in supply chain management, as O2 are now suing Ericsson for tens of millions in damages.

We saw something similar with Marketo in 2017. Marketo (marketing automation software used by numerous tech companies, including Databarracks and Microsoft) suffered an expired SSL certificate. That meant emails and webpages sent out by all those marketers didn’t work. Marketo was recently purchased by Adobe for $4.75B. CEO Steve Lucas sent an explanatory email to customers, but again, there was backlash on social media.

Honourable mentions - Cloud outages

Unlike previous years, we’ve not seen major problems with the largest cloud providers. It hasn’t been without issues, though. In March, storms affected the AWS US East Region.

Users were locked out of Office365 for a big chunk of the day in the UK on April 6th.

Microsoft Azure also had troubles in November with its MFA.

Yes, it’s impossible to have a specific plan for every incident and every black swan event. But we absolutely can and should plan for the impacts of those incidents.

These incidents can impact our premises, our people, our technology or our suppliers.

With that being said, we shouldn’t just ignore incidents. When we write our BC and DR plans, we should plan for the impacts. But we should test those plans with specific incidents to see how they would fare against these real-world events.

Also, one of the best ways to improve your resilience is to carry out reviews immediately after you suffer an incident. There are always lessons to learn about how to improve your mitigation methods, or how to communicate with staff and customers. It is worth looking at each of these high-profile incidents. Consider how exactly you would handle them. Could you quickly identify the partial failure of a switch? Are your certificates managed and renewed? Is your emergency communication plan fit for purpose?

Here’s to another year of keeping businesses resilient, as this world keeps surprising us all.