01/08/2024

Building resilient systems: High Availability vs. Disaster Recovery

Practically speaking, High Availability (HA) and Disaster Recovery (DR) are two distinct elements of resilience that make up a greater whole. The key differentiating factor is that one is pre-emptive, while the other is reactive.

While High Availability focuses on preventing downtime, Disaster Recovery is a reactive process intended to restore normal operations as quickly as possible following a significant failure. It presumes that disruptions will happen and prepares for swift recovery when needed.

Together, they form a comprehensive approach to IT Service Continuity Management.

High Availability – pre-empting disruption

High Availability refers to a system's ability to remain operational and always-on, leveraging built-in redundancy and fault tolerance. HA is pre-emptive, involving strategies and technologies that detect potential failures and take immediate action to prevent downtime.

Key components of High Availability:

Hardware redundancy: Redundant storage, error-correcting memory, backup power sources and other technology ensures that a single hardware failure doesn’t bring down the entire system.
Software redundancy: Techniques such as clustering, load balancing and self-healing systems distribute workloads across multiple servers, enhancing the system's ability to withstand failures.
Environmental redundancy: Utilising data centres that are geographically or virtually dispersed ensures that localised issues don’t affect the entire system.

By incorporating these redundancies, HA systems provide the resilience needed to keep critical applications and services running without interruption.

Disaster Recovery – responding to disruption

Disaster Recovery encompasses the tools and procedures designed to enable the recovery or continuation of vital systems and infrastructure. As the final port of call in a disaster, some form of DR should always be in place – no matter the availability of existing systems.

DR strategies typically involve maintaining secondary systems that can be failed over from either another data centre or the cloud.

Core concepts of Disaster Recovery:

Recovery Time Objective (RTO): The maximum acceptable delay between the interruption of service and its restoration. It represents the targeted duration for recovery after a disruption (e.g. the recovery takes 15 mins).
Recovery Point Objective (RPO): The maximum acceptable amount of data loss measured in time. It determines how far back in time data recovery should go (e.g. we recover to 4 hours ago).

Two halves of resilience

To minimise interruption – be it from a minor fault or major disaster – organisations must first ensure that their infrastructure meets all of their needs for availability. This ensures that the impact from minor faults or disruptions is not felt by end users or your customers.

But in the event of more significant disruption, procedures for Disaster Recovery should be in place that meet exact requirements for both RTO and RPO.

Changing priorities

Historically, architecting a system for High Availability was an expensive undertaking, but it solved most causes of IT downtime. Forward thinking and uptime-focused organisations invested heavily in HA, and as a result, reduced their spend on DR.

In many instances, “DR” for these organisations simply meant recovering from backups, rather than failing over to replicated systems with recovery points prior to infection. But emerging cyber threats have highlighted the flaws in this approach.

Today, cyber is the leading cause of downtime and data loss, and organisations which have invested disproportionately into HA are unprotected against system-wide attacks and breaches.

Security, availability and recoverability are not absolutes – there is a risk-cost balance to be struck. With even the best designed HA systems subject to long recovery times in the event of an attack, the question is how best to allocate resources to maintain business continuity.

To this end, IT Service Continuity Budgets must either be rebalanced or increased to accommodate the adoption or improvement of dedicated solutions for Disaster Recovery.