How Chatops saved me
I commute to office by company cab, To commute my work location by default it will take 45 mins. One day I got an alarm message stating like one host in some XXX cluster is down. At that time I am the only person to take care that alarm. My dongle is not working and I can’t login to the system. That host is container based one. Which was orchestrated by Kubernetes, some containers was well managed and it got started up another working node of the cluster. There are some statefull set type of Pods it got impacted. It is capable to run on the same nodes and got to know node affinity was in place. Mysql DB was running on that statefull pod. Impact was success rate was dropped by 2%. Some users observed connection failure to DB.
Long back we create Chatops based rule and if we give the keyword something like “10.0.0.0 startup” Ip and startup. Server was deployed on cloud environment, that Chatops based keyword will call server startup API. I have managed to use that keyword, Server has come up fine.
Once again after 5 mins I have texted another keyword cluster-health XXX (Cluster-health and cluster name). All nodes and pods in the cluster were turned to running state.
Once I reached the office, I looking to automate more. If Host goes down it should automatically come up and once it is completely up. Some health check has to happen. So I choose event trigger with Auto recovery approach. We tested in test environment and implemented this approach, we manually shutdown the node from the cluster. Auto recovery has setup based on alarm rule matching patter. Say for example Your Alarm Pattern “HOST IP down, IP: 10.0.0.0 Cluster” I have configured this regex pattern to kick off the auto healing API call.
So once the alarm is in place, it checked the regex pattern and it was looking auto recovery approach. It called startup API to bring up the node. Once the node is fully up it ran the remediation script to validate entire cluster nodes and pod are in active and running state. It closes the alarm.
To triage more about this case and got to know it was the hardware failure on physical machine. That was impacted too many Virtual machines on the cloud infrastructure, Cloud SRE team was identifying with the vendor. As a SRE we had to take some steps to cover up this kind of failure. I concluded few things on meeting. Once node is down very immediately auto recovery script should cordon that node. This is prevent try to schedule new pod on that faulty node. Once the node is Up and auto recovery remediation script should check the things are all Ok, Once all the signals are success, script should have to un-cordon it to make pod scheduling normal.
We added suggestion to Devops team statefull sets type of pod should have minimum 3 pods running on different nodes and not all replicas should be running on same node. Node affinity configuration has to be removed.
SRE side to fill the gap we have added auto traffic switching to Standby or slave DB pods which is running on another Availability zone. This it can be achieved by storing key value on ETCD and to enable the monitoring. If any trouble found to connect master DB or any health was failed on Master DB side. Service code should automatically switch to Slave DB. There were some re-architect was going on to make entire DB module to run out of cluster on bare physical or virtual machines. Some suggested running on separate cluster.
Traffic and dongle issue was in other side but chatops was helped to fix the issue. Cloud API calls were exposed so that SRE can code it and make it auto recover.