Get Rid of Manual Work

I know shell and python scripting
Scripting is one of the most important skill set. As we come across most of the Job description’s SRE should have knowledge about coding language. Why is that?
In an SRE world there are too many thing are done by manual. We have to automate it and make a life simple. There is some task which takes lot of time to implement. Too many Linux commands are required to execute, manually need to open log files, ensure no error on logs, editing configuration files, restarting and applying patches …etc.
Event — Driven Approach
From Incident perspective if any Incident got triggered and got to know some service went down. To remediate it we have to login server, required to check the process, logs and need to restart. Ok fine if we get this task on day time we won’t hesitate to do that. But consider if this scenario happens at mid night 1 AM. Our reaction would be weird rite? SRE should focus on doing something auto healing.
Sometimes auto healing required event-driven approach. Your Incident should kick off remediation scripts to auto recover. Well established companies started using this approach to remediate the issue which is caused by Incident. Too many success stories and cases have been shared in public forum about this. We must adhere to learn and implement this approach.
I would like to share some technologies which are helping to accomplish this approach. Stackstorm, Resolve, pagerduty, vRealize Orchestrator. These Technologies has sensor in build by default it will listen to the events and if any event triggers it is good enough to execute the task and callback to resolve incident.
Chatops with Event — Driven Approach
Chatops is tightly coupled with Event — Driven Approach. In today’s world we use chat box on daily basis, it can be anything like chatting with friends, ordering groceries, querying to get relevant information or sometime we may need work to do be done by someone. Too many use cases. Same approach we can implement on SRE space. Just keyword is good enough to execute. Query logs details, resize mount, purge the log, restart service, close the alarm, reboot the machines, remove the faulty node from pool, Pull system CPU/Memory stats, Optimizing inode, Installing agents … etc, These operations can be executed with help of Chatops. Just Keyword is the matter here. Your server IP and type of operation (reboot /resize /restart).
As I said earlier Chatops is tightly coupled with event-driven approach. Here Rules should be configured on these Stackstorm, Resolve, pagerduty, vRealize Orchestrator technologies Keyword should be passed and that regex has to match and execute the task. Say for Example 10.0.0.0 reboot this keyword on Chatops will reboot this IP (10.0.0.0). Rule should be well defined it should match the patter what exactly keyword we pass it on Chatops.
How this approach will help SRE?
Just consider Mount threshold has reached to 95% and you got an alarm around 3 AM and you are in deep sleep. You got an automated call stating like this IP 10.0.0.0 mount was reached to 95%, Please take a look. From the sleep you may need to login server, optimize the mount and clear the Incident, back to sleep. Time taken to solve the incident 20–30 minutes. If you could have implemented event-driven approach, Senor would have listen to the event and remediate it. You may not need to come out from your deep sleep. Time taken to solve the incident 10–15 minutes at max.
Consider you are not at the work location. Your server is running with low memory and you got an alarm, to remediate it, you may need to reboot the system. How do you approach this scenario? Chatops with event-driven will help on this scenario. Just you can give key word 10.0.0.0 reboot automatically system will reboot it. Once it is fully up. You can run one more keyword 10.0.0.0 restartAppservice Rules will recognize your keyword and execute the action.
How to deploy Code?
Well we can’t say this is not SRE space and it is Devops space. Some organization they have their own Devops team and their task is to create pipelines to automate pulling code from SVN, compiling, building, pushing to testing environment, executing test cases, pushing to UAT , executing test cases and Finally code is ready to push production environment. Some companies manually push the code and some does in automated way without manual intervention. This is clear cut Devops approach. SRE has to check the Dashboard and ensure no drop on success rate, delay. If any abnormalities due to code SRE may need to talk with Devops get the code roll back.
Half Boiled Devops Approach
Some organization has their own pipeline. By default it will pull the code from SVN, compiling, building, testing and they push code in repository. Even sometimes SRE manually required running the Jenkins jobs.
SRE should have knowledge about SVN, Config management tools (Puppet/Ansible), Infra as Code (IAC) (Terraform). SRE may need to create own playbooks or Plan/Task and execute the task for pushing the code on production environment. This will create lot of manual effort. Each time have to ensure proper variable, repo address location, Inventory configuration. Some smart people will create a skeleton and re-use it on each release.
On these scenarios best approach would be event-driven with Chatops we can just give the keywords like code version ID, Host, variables. Once rules are matched and action will get triggered to push the code on the IP’s.
How we can leverage more?
Just consider your Environment is running on AZ1 and AZ2. You have your Tomcat IP’s on load balancing pool. First you may require AZ1 IP out of pool, Deploy it, ensure service is stable, and put the IP in Pool. Same steps to run on AZ2 IP’s.
Yes of Course you can do it place the keyword 1 10.0.0.1 outofpool, API call to ELB should pool it out. keyword 2 Deploy App1 1.0.1 10.0.0.1 PROD stackstorm should call Ansible/Terraform API call to push the code on requested server, keyword 3 10.0.0.1 checkapp1 to check application service health, Last keyword 4 10.0.0.1 poolin Once again API call to ELB should add the ip in pool.
Simple Approach with Less time consuming
Example your service can be tested with 127.0.0.1:8080/test if we do curl to this URL it will respond with 200 (Success), Suppose if it respond with non 200 like 500. Your service might have hanged or down. To remediate we have to restart the service. Crontab is one the service which is running on every Linux machines. If we set the time and task will get executed at that time. Same approach can be done here. Create one simple script and extract the out of the URL status code. If 200 Service is Ok else restart the service. We can place this script in cron entries and get it executed every 5 mins.
Sometimes event-driven or Chatops based approach is time taking process. It has to call multiple API system to get script is executed, Normal Script execution over Crontab will not cost much time. At some extent this is good approach and we have to ensure system that we are executing every 5mins should not consume more CPU and it should not make system crash.
Code it
Some companies they have run book automation. Which read the content and automatically execute the step’s based on the run book. Some they don’t want to do run book automation approach and they want bash shell / python scripts to be executed. It is mandatory to learn coding language.
Stackstorm one of the most popular automation tool. Action’s nothing but of scripts which in background will get executed. Event-driven or Chatops with Event-driven approach is based on actions but actions is depended with scripts (either it has to be written shell or python).
In a Cloud Infrastructure everything is based on API. Starting from creating resource to decommissioning resource. Most of the SRE have knowledge about cloud infrastructure. But they should go one-level and try to understand how the infrastructure is build with code. IAC (Infra as a code) is one of the important feature. Terraform, Cloud formation template’s these technologies will really help to resources.
Event-driven and Chatops with Event-driver options will definitely will bridge the gap, Like provisioning resource. Just for an example we can create VPC by Chatops Keyword 1 create VPC 10.0.0.0/16. Keyword 2 gets VPC.
Everything is automated?
No we can’t say. In most of the organization SRE practice Business continuity Process. They test the BCP with Netflix ChaosMonkey. This Approach is to make sure system or service should be reliable. Example if your service is down it should automatically come. System is down (event-driven) should get kicked-off and system has to be up. AZ1 hardware is failure and automation code should automatically takeout faulty IP from Load balancing pool. DB is down Code should automatically connect slave DB later it will become Master. Same like this SRE should identify potential impact and fill the gaps.