Region required Failover?

Not Just Restart
3 min readSep 5, 2022

Single region applications with multi-zone replication provide a highly available region, Few business apps may have continuity requirements over large distances (e.g., having a primary and secondary separated by hundreds of miles).This can be called as BCM (Business continuity management) or DR (Disaster Recovery).

The main goal for business continuity (BCM) for a single-region application is fulfilled by having second region that is used for failover events.

To satisfy this requirements, primary and standby regions may need to be located in the same country or union of countries.

If there is no requirement, then the failover region may be located anywhere where the latency increase on failover for serving response time is satisfactory.

In this model application data is synchronously replicated within a primary region and for in-region failures. It can asynchronously replicated to stand by region that is distant from the primary region. This means a non-zero RPO and therefore potential data loss of recent updates on failover, the approach is used by Enterprise applications with availability or regulatory needs that require a replica in another region.

Live traffic is always served from the primary region, and if the primary region becomes unhealthy either due to infrastructure or service problems, then the standby region is used.

Some apps owners prefer manual failover to the standby region. In this scenario, the DNS entry for the primary region is manually substituted with the VIP or IPs of the standby region when failover occurs.

In an alternative DNS Load Balancing (DNS LB) is used for automatic failover. If DNS is not used at all, then clients have the list of IPs for both the primary and standby region, and they are configured to use the current primary region.

For this model, the deployment is regional aside from the DNS LB. The DNS LB assigns traffic to the primary region, but if there is an issue with the primary region and a failover needs to occur, then DNS LB assigns traffic to the standby region.

Health checking is done by the DNS LB sending probes to load balancers that represent a region (the Load Balancer).

If the health checks fail, then the application can failover to the standby for availability. DNS LB can also be used on an application owner’s Virtual Private Cloud (VPC) for service-to-service communication as a service discovery mechanism.

For an application with more than one region to be operational, there needs to be an understanding of the full health of the service stack within a region. If there is a regional issue for just a single layer (e.g., Load Balancer 1, all Front-Ends, Load Balancer 2, all Back-Ends, or regional SQL), then the region would be unhealthy and a failover would need to occur to keep the service up and running.

Application owner needs to build up an understanding of the health of all layers and be able to auto trigger failover if a given layer is having a regional issue.

When a failover is triggered, the primary for the database is switched to the new primary region. If this is an unplanned failover, then recent updates to the database could be lost. For a planned failover, the failover can be coordinated to ensure all of the latest changes from the primary are made to the standby database before switching over.

This model need to ensure that the standby region functions well even though most of the time the standby region is idle. The best practice is to perform planned failovers on a timeline that makes sense for the business.

Health probers should be used to not only continuously check the health of the primary region but also the standby region.

This can designed to have a primary region with a single zone and failover region with a single zone. This targets applications that are limited by number of licenses or limitations in architecture and is an improvement over the Primary Zone with Failover Zone model for applications that need cross-regional business continuity.

--

--