12/06/2019

The Disaster Recovery metrics you need to know

For measuring any project or action, metrics are how we define success. Different metrics are needed for different actions – Mo Farah wouldn’t benchmark a marathon using the same factors Rafael Nadal would in a tennis match.

So in terms of Disaster Recovery, how do you gauge the difference between success and failure?

As part of your broader Business Continuity (BC) planning, you set times for recovering particular services and business areas. Your IT Disaster Recovery Plans defines these times in detail – with exact recovery times of individual servers and assets.

You should use these metrics:

Recovery Point Objective (RPO)

The RPO is the time (or point) to which you recover to. It’s easy to work out – it is how often you back up or replicate your systems. A single daily backup means a maximum RPO of 24 hours. If you back up or replicate every 4 hours, you minimise the RPO to just 4 hours. So in the event of invoking Disaster Recovery, you’ll lose the last 4 hours of work. It pays to have as low an RPO as possible, to minimise having to recreate work.

Recovery Time Objective (RTO)

This is how long a recovery takes. If the recovery starts at 2pm and lasts until 4pm, the RTO is 2 hours. The RTO translates into downtime for a business. So a fast recovery in 15 minutes means your team aren’t missing a lot of work. But an RTO of 24 hours means your team are twiddling their thumbs for a day, waiting for systems to come back.

Recovery Time Actual (RTA)

RTA is the time that passes when your Recovery Time Objective is executed in the real world, and all the variables that entails. We all have that friend where we add an extra half hour onto the original meeting time.

The amount of time between an RTO and an RTA can differ significantly. For example, say your email goes down and you have an RTO of 3 hours for that server. You might need to wake up your IT Manager in the middle of the night; there might be a public transport strike slowing your journey to the office; it takes more time to log-on remotely and actually start the recovery. It will take longer than 3 hours before email is back online. RTA includes those real-world factors.

You don’t want to suddenly find a big disparity between your RTO and RTA - so the way to get these two metrics as close as possible is to test.

Test Frequency

How often do you test? It won’t be much help if disaster strikes and the last one you performed was in 2004.

The most important metric for testing and exercising is whether you actually do it. Tests are often missed, or put off, in favour of other priorities.

Regular testing is important to keep your processes up to date. A good test in the last six months is better than a great test two years ago.

A DR test, like any other, is one you pass or fail*. You test for specific elements of the recovery. Does a generator work? Does the emergency notification system work? These questions have binary yes/no answers.

IT Disaster Recovery fits into this category too. For example, if your IT can't be restored at a separate site, you fail your IT DR test.

*Always having a binary pass/fail element can be unhelpful. Click here to see why.

There is no way to reach maturity in your business continuity and disaster recovery programmes without measuring your success. For businesses getting started with BC/DR, our first recommendation is to get the plans written-down. Documented plans mean you have a repeatable process. The next step is to introduce metrics to see if your recoveries are actually meeting the objectives from the business.

Knowing where your response strengths and weaknesses lie lets you iterate (or overhaul) the relevant processes and build a resilient business.