Is that ok to run your apps in single zone?

Not Just Restart
6 min readSep 2, 2022

Is that good idea to run stack only on single zone? It is totally depends up on criticality of your business. Low request and it can shutdown the service after 6PM and restart at 8:30 AM. Then it is Ok to design it. Critical application which is having high traffic it is not considered to run only on single zone. This Blog we will discuss about Zonal Aka Single Zone and Primary Zone with Failover (Secondary) Zone.

Zonal (Single Zone)

Running an application on single zone it is not considered as high availability. Zone is considered as a single failure domain from both software issues as well as other types of disasters (e.g., fire). So applications that need supercomputer-like connectivity, as well as applications that do not need high availability, leverage single-zone deployments.

High Performance Computing (HPC) and Tensor Processing Unit (TPU) Pods are examples of Cloud applications that are deployed and run in a single zone. These applications typically require very low latency and high bandwidth usage. This is achievable within a single zone. They do not serve live traffic and can work with 99.999% availability, and they can restart from whenever service went into faulty state. The data for these applications can be kept in a regional data store, with the primary or one of the copies of the data stored within the zone where the data is being read and processed. Another advantage of keeping applications that have a lot of communication across VMs within the same zone. So that Cloud providers typically have an additional charge for egress between VMs across zones.

Developer wants to test their service they can go-ahead with single zone. Application will works well with single zone for developer testing workloads. This enables developers to continuously build and test their applications in the Cloud. It also may be suitable for use cases where downtime is acceptable or the application can be restarted elsewhere. A single-zone application should be considered sufficient for these use cases, but not for most production applications.

Primary Zone with Failover Zone

When Developer/Devops/SRE brings their applications from on-premises to the cloud, a first step often taken is to choose a deployment model to run the application in the Cloud with minimal changes. Some of these may be commercial applications that application owners acquired and may not be able to change. In addition, sometimes these applications come with per-instance licenses that can be prohibitively expensive to deploy for redundant extra copies. As a result, single-zone architecture continues to be a valid option for these applications.

Single Zone apps required Availability and redundancy. To achieve this in that region they have primary zone and Failover zone, we can consider Failover/Recovery zone has backup zone. If primary zone has issue and we can restart the apps on Failover region and ensure apps are healthy, We can route the traffic to Failover zone. Many enterprise applications are built to run in a form of primary/failover configuration this topologies is known as Highly Available (HA), and this is an established pattern used in enterprise and on-premises deployments over the years.

Let’s take an example of a single-license application running in the Cloud that wants to have failover support. Consider there are two VMs in two different zones (A and B), where one is the primary and the other is used as the failover/recovery. In this example the application owner has to pay for every instance running, so the application is only running in the primary zone and not the failover zone to cut costs. In this case there are generally three options for how to connect to the application VM for license renewal

Static IP address (floating IP address)

This static IP is used for the license renewal and can be either private IP or public IP. The static IP address initially points to the primary VM (for this example assume it is in zone A), which runs the single application. When zone A goes down, either manual or script[1]based reconfiguration will kick-off and the application is started in zone B with the same static IP address. In this case clients can continue to connect to the same IP address, whether they use DNS resolution or connect directly to the IP.

List of Static IP addresses

List of IP addresses are used in a round-robin fashion in case connection is lost. The exact logic to pick one address from the list depends on application client-side behavior.

Dynamic IP Addresses with DNS

If the IP for license renewal is not static, then DNS is used for resolution. In this case, DNS is configured to point to the primary VM in zone A. When zone A goes down, the DNS configuration is updated to point to the VM in zone B. The tradeoffs around DNS and how it relates to failover deployments are discussed below.

Let’s take another example, which is a basic application deployed in a primary zone with a replica for failover purposes in a secondary zone. Here we have a Load Balancer (LB), which denotes not just one instance, but a highly available replicated setup. The setup has a replicated compute workload named as FE Front end and the cloud-managed database that holds the application data replicated across zones. Most databases will work in this configuration, and for this example it is a MYSQL database.

Let us consider zone A of region A1 to be a primary zone, and zone B of region AA a failover zone. The primary instance of the MYSQL database is placed in zone A and all read and writes happen to this instance. The MYSQL database is configured with a standby in zone B, and the data is replicated from zone A to zone B by the managed database. The Front-End in zone A and Front-End in zone B are configured identically with the same virtual IP address (10.3.2.1) to access the MYSQL database. This means the Front-End service does not need to change the IP address of the MYSQL instance when failover occurs. The load balancer is configured to have a primary set of compute instances (VMs or containers) in zone A and failover instances in B.

Lets inject fault here Now let us assume that zone A fails. Every second, the primary Front-End and SQL instance in zone A responds to a heartbeat signal from the monitoring system. If multiple heartbeats are not detected by the monitoring system, then an alarm will sent and failover is initiated by the SRE or a script that has been automated. With failover initiated, the Front-End in zone B now serves user traffic, and the standby MYSQL instance in zone B is configured to now act as the primary MYSQL instance using the same virtual IP address (10.3.2.1). The load balancer will react to the failure in zone A by moving traffic to zone B, because it was configured to failover to the FE in the other zone based on health check status.

Once the traffic is being served from zone B and the primary SQL instance is now in zone B, the availability is re-established for a single zone application on failover. Health checking is an important part of the failover process. As part of the health check status, the application’s services need to decide if they are healthy or not, and this greatly depends on the service.

Each instance of the service needs to determine its health based on error rates, over usage of resources, such as CPU and memory, or other custom metrics, and declare itself unhealthy as part of a health checking response. When zone A comes back, the traffic is not sent back to zone A by the load balancer unless the SRE to fail back. The deployment will now be in a steady state with zone B as primary and zone A as failover, until a failover is performed to make zone A the primary again.

Good practice here is to reserve the capacity needed for failover in the failover zone and ready to go in case of a failure and to routinely failover the application between zones to ensure failover works when it is needed. This practices is knows as DR fault testing or Business continuity Management.

SRE should focus on doing automation here, Whenever Primary Zone is in outage and Automatic Recovery script should kick off and update the Load balancer pool for routing to Failover region.

--

--