Troubleshoot at App tier SRE
Troubleshooting is something core and most important skill set for SRE. As I said on earlier blogs SRE should have some understanding about infrastructure, some advance level feature knowledge just need to know how exactly request gets in your network and how it is getting processed. By default SRE should have knowledge about E2E i.e. (End to End) knowledge.
MTTR (Mean time to recovery)
If something got broken in your flow, SRE should have to handle it. In an SRE space something called MTTR (Mean time to recover). Which talks about how exactly events are getting recovered or restored. Some companies do have their own SLA. If it critical application like OTP, customer facing pages like login page, landing home page. These Slices of pages MTTR SLA will be very high it should be Maximum 10–20 mins. SRE should fix it or at least route entire request to Failover region. MTTR will calculate how much time is taking to restore the issue.
Some SRE have their own strategy to handle live incidents. Some knows well about their Infrastructure and they do have their clues what could be the issue and what would be the cause of outage. People who are new to SRE, They struggle lot during these times. They don’t know where exactly they should start troubleshooting. Some people ask to check the logs, health of the application, server health. Too many things have to follow.
Alarm/Incidents
Alarms/Incidents which notify SRE to take some action. SRE should not ignore on those Alarm/Incidents. If SRE miss that bus it could cause the huge damage. Each Alarm or Incidents have some severity consider Critical/Major or P1 incidents should handle immediately and Minor/P2 or P3 is somewhat relax to take action. But action has to take else it may turn as P1 after 2–3 hours. For example disk mount usage is 70% and your alarm system triggered P3 or P2 alarm/Incident, if SRE fail to take action and usage got increased 90% another P2 incident would trigger and still fail it take action. Mount usage increased to 99% and P1 incident got triggered. If Still SRE fails to take action and it will impact business.
SRE should configure very few and genuine alarm. Incident or Alarm should have clarity about what is abnormal. Some incident tools have the remediation plan also attached. Beginner level SRE can follow it and remediate it. If not there it is SRE responsibility those plans should be in place.
There are few list of alarms should be in place and it is mandatory.
Platform layer
- 1. High CPU.
- 2. High Memory.
- 3. Inode.
- 4. Disk.
- 5. Network.
- 6. High IO.
- 7. Host abnormal.
- 8. Host down.
- 9. NIC card failure.
- 10. IP down.
SaaS layer
- 1. Service port down.
- 2. Service Hang.
- 3. High CPU and Memory due to service.
- 4. Success rate drop.
- 5. High failure rate.
- 6. High Latency failure.
- 7. DB connectivity issue.
- 8. Connectivity issue with interface systems.
- 9. High API call failure.
DB layer
- 1. DB CPU usage.
- 2. DB Disk usage.
- 3. DB Memory usage.
- 4. Process percentage.
- 5. QPS.
- 6. Mysql Slave behind master.
- 7. Mysql Slave.
- 8. Lock.
- 9. Slow and long query.
- 10. Master host status.
Network layer
- 1. ELB connection count drop.
- 2. TCP connection drop.
- 3. High Request count.
- 4. High connection drop.
- 5. Network I/O rate.
App level troubleshooting
When Application fails to function or App failure rate is high, SRE should investigate two layers one is at platform layer and another one is at service layer. When we are checking at PaaS (Platform layer) server specified related indicator should be Green, For Example CPU, memory, Disk, Network related Indicator should be in Green state or healthy state. When we look at SaaS (Service layer) service level indicator should be in healthy or green state. For example Service port, Service API health check, Connectivity check with upstream/downstream system, other dependent or interface system should be green and Connection with DB, Redis or kafka should be in green state.
Automate or Pull Golden indicator rule
Most important thing is that SRE has to automate all Manual checks. For example, Paas/Saas layer indicator should be checked automatically by scripts. Most of the SRE uses APM or AIOPS based solution dashboard to monitor these golden indicator. If SRE in NOC room with big screen yes they can monitor in a single shot. Consider SRE has to check during off working hours at time Chatops based solution is required, either to kick off the auto checking scripts or to pull the indicator rule results from AIOPS or APM dashboard.
When was the last upgrade?
Sometimes Application may not function properly after upgrade in few hours. We can see same behavior on APM or AIOPS dashboard such as few KPI indicators like success rate would drop, failure would goes high or Delay may increase. These kinds of issues SRE has to work closely along with developers and Devops. If it is impacting your KPI only option is code revert and SRE should not allow running faulty on production environment. If your new code is consuming more memory or CPU resource, it is fully SRE responsibility to check. There should be valid or proper comparison before and after. If resource usage is too high and developer agreed and new code is required extra resource than estimated resource, SRE should expand the capacity.
What happen when application was failing to process and this was observed code deployed last weekend? This is quite rare scenario. SRE should have monitored it for 24hr after version is upgrade and if failing process after 1 week. Then troubleshooting should be stretched along with developers, May need to check the depended data layer systems. Sometimes user’s cache would have not expired in caching at layer it or it could be reverse too. Some Kafka lag also would be causing this issue. SRE have to work along with developers and Devops team.
Dependent system failure
Consider your service is platform based solution. Which means other service to process may need platform based solution service. For Example service A is running at different business unit and Service B in platform layer, As per business logic need service A need to communicate with Service B. consider service B at platform layer is down. Dependent Service A logic will fail to execute. Same with downstream dependent system as well. In a service mesh based environment SRE has to come up better architecture to fill this gap. As of now Proper failover mechanism to be followed by SRE to cover this issue.
Data tier issue
Some time code would be working fine and SRE can’t identify issue at the service layer. Service and its PaaS layer golden indicator would be green and still service is fail to load or it would be too slow. Sometime huge number of transaction may happen or some long query would be running, Mostly slow query related issue could happen. This kind of issue SRE should have to place the alarm in place and trigger it. If any issue happens they can directly concentrate at the data tier, instead of checking at the service or network layer. This will improve MTTR. Solution for long running query or slow query this something escalate to Developers to get it optimized on code. DDL or DML statements could cause.
Again SRE have to analyze from when they are seeing this abnormalities? Is that caused by new code or DB schema should need to update along with the new code. This kind of analysis SRE has to do when they are touching RCA space.
If the SRE found recurrent similar issue, Quick Optimization is required with help of developer. Either SQL or code needs to revert back. Once again Statement for SRE should be clear “SRE should not allow faulty code to be running on Production system”
How to improve MTTR?
Mean time to recovery is important here, Instead of analyzing the issue on layer. Fault should be identified very quickly. Rich and meaning full dashboard will help on these situations. APM or AIOPS based dashboard is required and SRE has to develop it. Dashboards along with Golden indicator quickly help to locate the issue. On other side SRE required to check the logs on service transaction failure, If you are running too many containers and not sure which containers to check. Splunk or elastic search will help to search the logs by executing some queries. Example A->B->C->D, Consider this is the flow you are not the SRE for C and D service. If C service system is down, Flow would have broken, for that you need to check the B service logs. B has 6 pods. Logs were recorded on 3rd pod. If we don’t use the Splunk or elastic search it will take time to identify the issue. To quickly analyze we need log management kind of tool. Service mesh like istio will help to cover this kind of issue. Will write about on detecting the abnormality before users.