Cut the Cost-SRE

6 min readSep 2, 2022

Optimize the cost

In an SRE world, Cost optimization is core important task. SRE have to ensure their resources are properly utilized. If utilization is not at the defined at the mentioned threshold, they have to optimize it by reducing the resource count reducing core or memory else it can be Re-architecture your service. In a cloud world we get resource on-demand basis and resource cost is very high. SRE have to be very clear while procuring resource. They should know what would be the usage and what type machines are required to run the business. In a same time SRE has to be careful while downsizing it. Business should not affect due to the low resources. No one can’t predict on request count, Sometimes there could be sudden surge due to business or DDos attack could have been happened. SRE have to be ready for 24X7 while increasing resource or downsizing.

Type of resources

Compute based — If your application required good CPU mostly (service/application)

Memory based — If your application required good memory example: in-memory caches.

IO based — If your application required good IO example: Databases

Medium based — If your application is used for general purpose.

GPU based –If your application is used for Machine learning or Gaming.

From Where I should start Optimizing

Understand your service architecture and there are certain points you have to check before optimizing it.

1. How many servers are deployed in each zone?
2. How many Service load balancer are deployed.
3. What type of resource have deployed.
4. Check the trends for last 30 days.
5. What is the Request count per day or hour?
6. What is the maximum usage of the resource?
7. What is the load sharing rate?
8. Last optimization record.

How many servers are required?

Developers usually worried about availability of their services. Due to that they deploy too many servers which above the capacity. But when service is ready to take live traffic we tend to see less cpu or memory usage. Request rate would be low. In a cloud environment everyone started using auto scaling methodologies if certain conditions are met ASG (Auto scaling group) will spun new resource. SRE has to make sure service should run in dual cloud architecture. If any one of the AZ goes down, service should not get impacted. if request rate too low, SRE should optimize by downgrading the resource configuration or number of host and in a same time request rate is high, SRE should not hesitate to add resource configuration or number of host.

Sudden surge

Say for example you are hosting e-commerce website and you announced huge sale is going to happen on this date. Too many people got notified and many people are ready to shop on your site. Sale day got started. Many people are started using. SRE started seeing too many request count on dashboard. On other end CPU on business machine, Memory on caching layer and IO on data base layer is too high. At some point your website started loading slow. To handle this chaos SRE should not hesitate on adding resource on load balancing pool. If your website is running on cloud environment proper auto scaling group has to be in place. Once conditions are met automatically should create resource on load balancing pool. ASG will help once the load is reduced automatically resource will get reduced. This Auto healing approach will be helpful during sudden surge.

Low Load on Compute machines

My service is running on tomcat and it is compute based machine. It has 2 machines with 4C32G configuration on AZ1 and AZ3 availability zone. So AZ1 has 8C64G and AZ3 also has same resource and configuration. Totally it has 16C128G. Actual CPU usage is just 1%. Request rate is too low. Same scenarios were there last 1 year. Consider 4C64G would cost 3$ per month. For 16C128G for 12 months it would be 12*12 it would be 144$ per year. SRE should understand this high cost. SRE should first cut one resource on each AZ. AZ1 should have 4C32G instead of 8C64G and same should be there for AZ3 as well. Once first level of optimization is done, SRE should take a look on CPU usage and have to see any improvement on usage. If no improvements have to reduce the configuration from 4C32G to 2C16G. Make it as default resource usage configuration. By doing this SRE can cut half of the cost. Per year have to pay 72$.

Low Load on memory based machines

In- memory caches technologies required memory based machines or memory optimized machines. For example redis is in-memory cache solutions. SRE have to focus on Memory usage here. Consider you have got 4C64G machine. Your Redis is consuming only 8G out of 64G. SRE should take an action on optimizing. SRE should take look on overall memory usage by all the services running on machine. If usage is too low and redis is consuming just 8G and other service are consuming only 4G. SRE action plan should be making 64G to 32G first. So the actual configuration should be 4C64G to 4C32G. Once the first level of optimization is done and the observability for 1 week. Still usage is not mentioned target. SRE can go further cost cut. 4C16G or 2C16G should be the correct.

Consider memory of the redis is very high and SRE is clue less about of this weird behavior. Here SRE should take a 360 degree view. Have to check Redis health, check over all request count for the business. Make sure App tier which is running on compute machines has high CPU usage once it is confirmed and same behavior is there on memory optimized machines. SRE either should add the configuration up or add the resource on cluster to balance the load. If you are running on cloud environment ASG will help.

Some SRE architects reserve the memory for redis. For example 8G is the specification. 4G to use redis and rest other 4G server will use it for other purpose.

Low Load on IO based machines

These types of machines are mainly used for data bases. DB does more IO operations. It does too many write and read operations, in parallel it has to sync the data with slave nodes. Due to these IO based machines are required to run the data bases.

IO is tightly couple with the CPU. If CPU usage is very high on the machine, SRE has to check mysql process state. If it is R running state or D blocked state. Which is clear enough mysql operation going on? Heavy read or write operation must be in place. SRE should be capable enough to identify what SQL is getting executed or any slow query is going on. And in parallel how many established connection are there. So here replicas have to be added more. It can be either read or write replicas. In Cloud environment this will auto heal it by automatically adding read or write replicas. Optimization for Data base will be risk while downgrading core/memory. So the best approach would be adding the replicas.

Low load on GPU based machines

Nowadays AI or ML (Machine learning) based solutions are very popular. It can’t run on normal compute, memory or IO based machines. GPU machines are required and it runs top of GPU driver, Mostly NVIDIA driver will be in place. Machine learning algorithm source code it requires GPU based machines, since it requires lot of data for testing and training. It is the heart of machine training and it demand to input larger continuous data sets to expand and refine what an algorithm can do. The more data, the better algorithms can learn from it. This is particularly true with deep-learning algorithms and neural networks, where parallel computing can support complex, multi-step processes.

Here the optimization technique will be quite different. SRE has to focus on GPU util, Persistence — m and memory usage. Nvidia-smi is the command to check GPU resource usage. If the usage is too low, SRE has to downgrade with the proper GPU flavor and it should be in aligned with the request usage. Here SRE has to work along with MLops team. It should not impact the business. Sometime SRE has to install driver and service if server got rebooted due to kernel patch.

Dashboard and Chatops rules

SRE has to create dashboard to monitor the service resource, Business transaction failure, latency issues, and success rate failure. Proper dashboard has to setup at PaaS layer to monitor CPU, Memory, Disk, Disk read and write IO and NIC bandwidth. There are few opensource monitoring technologies to monitor at PaaS level. Example: Grafana, dynatrace etc. Some SRE configure Chatops based rules. They will run API call to pull the metrics for each resource. If certain conditions are met SRE will take an action for optimizations.