Intentionally Break the system-SRE

7 min readSep 2, 2022

What exactly it means?

Break the system. In an SRE world, engineers have to break the system and this they have to do when they are seeing high traffic. This will help SRE to identify where the reliability is lacking and how they can improve reliability and they can make system more resilience and reliable. This practice is called chaos Engineering. Most of the SRE widely uses these practices. Main intent of doing this is to identify the risk and gaps. Consider your service is running with only single availability zone. If that zone goes down what would be the impact? How can we cover it?

Most of the well established companies they implement and follow this approach, Breaking the system is not very task and it will impact too many things like business, user experience, and availability. Lot of approval is required to take and implement this methodology. What is the correct time to implement this? Some companies practice this methodology on every quarter, yearly, sudden mid night. Will take about more on this topic.

BCP (Business continuity process) nothing but of testing the environment, By doing this will allow SRE to do lot of automation and good idea’s will get created. SRE can explore lot of open source solutions to fix the gap.

How do you break the system?

This is something important to take a look. Netflix chaosmonkey is one of the opensource project in the market and it is pioneer on chaos engineering. They have too many faults to inject on systems. After observing they have categorized 4 layer’s PaaS (Platform as a Service) layer, SaaS (Software as a Service), Network layer, Data Layer and DDos Attack. We can see each layer and how exactly it is helping to improve the reliability. SRE have to be more cautious while injecting and removing faults.

Faults at SaaS Layer

Software as a service, Faults will be injected at the service level. Sudden Shutdown, restarting service, particular java process making to consume more cpu usage, making to use more memory use. Making service to hang or crash. Deleting tomcat code and it’s directory and deleting software binary directory.

SRE should focus on auto healing. Service is down or crashed, CPU and Memory is high, there should watchdog in place it should automatically bring up the service. If source code process directory is delete. Automatically Even trigger approach has to be kicked off to restore it. Say for Example if your environment or configuration file is managed with puppet and ansible. It should get restored. SRE should place proper or genuine alarm. Recovery time is pretty important here. It should get auto resolve as per defined SLA.

Faults at PaaS Layer

Platform as a service, Faults will be injected at Platform layer such shutting down the machine, sudden reboot of machines, System Crash due to high memory or cpu, Sudden high CPU, Memory IO, Increasing Inode capacity, making mount as fully utilized, removing mount.

SRE should have valid alarm in place, if any threshold is crossed. There should automatic healing process should get kicked off. Shutting down Physical machines or virtual machines there should be auto reboot. This can be achieved by event driven approach. Host is down, alarm gets triggered and remediation should get started it if it is cloud infra API call should get called and server have to be restarted. SRE should be capable for removing impacted node from Load balancer for a short period of time. This should be automated by trigger API Call to remove the IP from ELB pool.

Consider High CPU or MEM most of the cloud environment has auto healing approach such as auto scaling. If Cpu is high it will add new VM based on the condition and respective image will get created. High IO Cloud infra has too many solution such credit will get burst and System gets optimized at some short period of time. Auto troubleshooting method should get placed.

Consider Mount volume is removed or fully utilized with 100%. SRE got the alarm. There should be auto resource creation request should be kick off. This can be achieved by event driven approach by Alarm. Scripts should be ready to optimize the mount if it is 100% or another script should be ready to resize the volume. If mount is removed. There should be a script to remount it back.

Most of the PaaS layer related Faults once got kicked off, SRE should be vigilant on monitoring the Traffic and business transaction. If all faults are in place and Business transaction success rate should be 100% and delay is also under threshold, All auto methods are processed, then we can consider Environment is stable and reliable.

Faults at Network Layer

Here Faults could be pretty different. Un-plumbing the IP, network packet delay, Tcp connection suspension, NIC card down, ELB down, Shutting down DNS service. Here SRE should be cautious and skilled enough to troubleshoot the issue. If ELB is down, Your entire service won’t get request to serve it. There should proper Alarm while request count got dropped from certain decent number. Consider by default your service gets 1 lakhs request per minutes, if it got reduced to 10,000. Then ensure all your nodes are healthy which is under your scope and check the ELB logs. If it is fault restore the ELB service. To avoid this kind of interruption, most established companies they deploy ELB minimum three Availability zones.

In parallel there is some other Faults like un-plumbing IP, this alarm may trigger you can recover with rebooting the server by event-driven approach. NIC card is down there should be script should be in place to bring up the interface.

TCP connection suspension, packet delay, packet loss. These faults would be bit crucial, if any important transaction going on if TCP connection got suspended or delayed, there should session replication in place, so that another node will get ready to serve by checking the session data. This is something your code must be capable of handling it. Some faults improvements have to come from Developments side.

Faults at Data Layer

Data is one the most important layer, if is not available at proper time entire code logic will fail and could see lot of business transaction failure, too many Alarm. SRE should be ready to handle it. Data can be generated with too many sources. It can be processed for too many things. Such as a like streaming, caching, Storing and securing. There are some products such as redis, kafka, Object storage systems, ETCD, RDB, NoSql databases which support the save the data. Either it can be temporary or permanent storage.

We have to make data tier as a dual Cloud Solution. Which means it should run both the Zone’s. Making one zone as master and another as slave. Some of the products have leader and follower strategy. Data should always persistence and environment should be reliable. If master is down Slave should become as master. Same as like if leader node is down and after election one of the follower should become as leader.

Well this is from data tier. Your code should get it connected to the data systems. It can be DB, Redis or MQ systems like Rabbit MQ/Kafka. Code should know where exactly it is writing to. For example by default your code writes at Master Node. If Master node is down, it should be good enough to change the endpoint. Which means by dynamically it should point slave node. Same for another products like Redis or Kafka.

SRE should understand this gap and update the architecture by writing some good quality code, Make auto switching by using open source technologies. At least SRE should shutdown the data layer systems by shutting down or restarting it. Have to ensure no business transaction failure during that failure. If you look at the Kafka system each node it is totally bound mount which is under partitions. We could able to see lot of lag. That will make transaction failure. Same with cache systems like redis. If Redis is down, caching layer will impact badly. Customer may face latency and user experience issues.

Most of the cloud Infrastructure systems are reliable today. Especially for data layer they have better solutions to deal this kind of issues.

DDOS Attack

Sometimes SRE should do DDos or DOS attack. To make sure how system is behaving, it should automatically spun up new nodes, Load should automatically get shared. SRE can improve system start up time and have to manage All KPI in parallel such as Success rate drop’s and delay. Once the attack is completed. Low load resource should get to released automatically. Usually developers with tester does this task in the name of load test. But doing on the live production environment which is challenging and SRE will lot of suggestions to improve the service and system quality.

Consider due to attack system startup took longtime got to know success rate is dropped to 80% which means 20% of customer lost the transaction and delay was increased from 10 Milli seconds to 20 Milli seconds. Some user observed user experience issues. Here SRE should identify the gap and improve it.

Intentionally Break the system-SRE

What exactly it means?

How do you break the system?

Faults at SaaS Layer

Faults at PaaS Layer

Faults at Network Layer

Faults at Data Layer

DDOS Attack

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by NotJustRestart

No responses yet