What do you mean by availability?

4 min readSep 2, 2022

In today’s world Internet dominance is too high, Service are required to be running 24/7. Users are expecting to complete transaction on 20 milliseconds, this is just a number. Still some companies are improving user experience and reducing defined target value example 20 milliseconds. In an SRE world Experts are required to understand who is the end user, How frequently it been used, what should be availability is that just 99% or 99.99% or 99.999%.Becasue by adding extra 99.99 has different meaning and it is bound with the tradeoff between clients and service provider, Low latency. Some companies they migrated apps to cloud environment and still some modules they wants to run on-premises environment. When running application on cloud they should consider security, identity, data recovery, data and traffic management, cost optimization, and much more.

What do you mean by Availability?

The level of availability each part of the application is totally depends up on its business purpose. Some applications only need three nines (99.9%) availability, which means the service can be unavailable for at most 43 min a month. Some other applications need four nines (99.99%) availability, which means the application can only be unavailable for at most 52 min a year. Then there are those critical applications that need five nines (99.999%) availability, where they can only be unavailable at most 5 min a year. To achieve these levels of availability, it is important to understand what is needed for each part of the application and invest in closing the gap between current and desired availability for each part.

Investing on availability comes with the cost, but it is often crucial to the long-term success of the business, since availability directly or in-directly influences the reputation of the business and the satisfaction of the application’s users.

Let’s group together into the overall availability of the application the following:

(a) Time to access the application.

(b) Time to get a response with valid results.

(d) Assurance that data is stored and maintained with integrity.

(e) Application’s ability to scale and handle peak traffic demands.

Availability should be best designed from the scratch for an application. Adding availability as a feature later can require re-architecting the application and it is something full rewriting. A key part of the design is how the application addresses fault domains and how it provides redundancy and scales across those fault domains to maximize availability. A fault domain is a set of infrastructure parts that together represent a single point of failure. To increase availability, applications need to run and store their data across multiple fault domains (zones and regions) and have the ability to balance load or failover in case of failure.

Data needs to be replicated and backed up so that it won’t lost it, and checks should be in place to make sure data is not corrupted. Applications need to be able to quickly load balance across multiple instances of the application to scale to the largest traffic the application. This includes minimizing time for startup and shutdown, so applications can be restarted, and scaled up and out quickly.

Two important concepts for minimizing the impact of an outage

1. Shard the application.

2. Make sure all application updates are done incrementally and can be rolled back.

Applications may apply sharding across their users or data so that they are served across different fault domains .In this way, an issue with one fault domain a subset of the users/data, thus containing the failure radius (often called the blast radius).

In parallel code and configuration changes should be rolled out incrementally across the different fault domains to gradually introduce a change into production, with the ability to quickly roll back if any production issues are discovered to return the application to a healthy state. This allows code and configuration production issues to be discovered early on and reduces the impact to only those parts of the application running in the fault domains being updated. In addition to being able to quickly roll back recent application changes, having the ability to drain or shed load from the affected fault domains is often used to quickly mitigate issues.

These things collectively determine how large of an impact there is to the application and its users when there is an outage. Ideally there is no impact on users when an issue occurs, but if the best design and deployment practices are followed, when there is an issue then only a small set of users of the application are affected in one or a few fault domains.

Applications need to understand their dependencies, the availability and failure modes of those dependencies, and to evaluate the multiplicative implications across these dependencies on the application’s design and availability.

There are fewer dependencies a service has and it is better to avoid linking in code, calling out to other services and APIs that bring in unknown dependencies. As part of the overall manageability, separating out parts of the application into its vital and non-vital services, identifying the availability targets of each, and continuously improving the vital parts, are important. If a service is vital, then for its vital components, all of their dependencies (recursively down the call chain) should be either highly available or the component should be able to function in the absence of the dependencies. Examine availability for each service in the application independently and for the application as a whole.

In Next blog will discuss more about different deployment archetypes and will examine the availability applications can achieve with each archetype with the focus on overall availability as described in this area.