Detect abnormality before it impact users-SRE
Detecting abnormality before users, this is one pretty challenging task. Why do i say this? Because constant monitoring is required, It is for your service and resource which is in your scope and which is not in your scope. Flow is important here. If there is block in the flow, SRE should jump in and identify the block and clear it. If the service is used by internal team members then it is Ok, but if the service is used by your customers it will impact your business.
SRE should be capable of fixing the issue before it impact users. SRE should have good understanding about the feature and how it is closely bound with resource. If platform resource has issue, it can get it fixed. Issue with feature and got failed for some users due to cache or wrong data issue, Then it is crucial for SRE to identify it. Sometimes had to work with developers to ensure are we dealing with genuine issues or not.
Let’s understand how to detect the abnormality in a certain advance level. Alarm, KPI, Incident, APM (Application performance management), AIOPS, Indicator, Splunk and Kibana these terms are common for SRE. There are certain well know impact full scenarios. If server is running with High memory usage which is close to 95% and we know it may crash and server will get rebooted. To avoid certain situations SRE will place alarm with respect certain threshold would have configured it. If it breaches and we may need to take remediation step.
Indicators
Indicators can relate with cold. There are certain symptoms we may get. First sneezing, throat pain, throat irritation, nose running and finally we may end up with mild fever. Symptoms are similar to indicator; If any abnormality we have to remediate it else it will cause huge damage. Here when it was started sneezing we have to take pre cautions else it may end up in mild fever. Same with SRE as well Indicator detects server memory usage is 80%. There won’t be any issue while serving request and responding to user till 99%. If we fail to address and it will crash. Users may get impacted and fail to server and we may end up in outage for certain minutes.
Let’s take another example openAM which hold users username and password along with that authorization privilege which holds it. If openAM server has issue, let say log purging is going on and it Zip old logs, it has around 10000 files. Log purging is started. Gzip process started consuming too much of IO and CPU usage went very high. Gzip process went from R to D state. IO is too high. Server health is turned to RED in APM/AIOPS when GZIP process moved from R to D. Till now user login feature has no impact and suddenly success stated getting dropped and failure rate is too high, which indicate users are facing abnormalities. SRE configured alarm and addressed it by killing the zip process.
So here is the Gap, SRE shouldn’t be something got triggered and addressing it. There should be something ahead of it. AIOPS/APM tools which helps to address this kind of issues. We know rope is too thing we someone tries to climb it and they may fall. AIOPS tool should indicate as RED after seeing the thickness of rope. There should be something end to end complete dashboard along with the golden indicator should be configured.
How your Dashboard should be
When I say end to end dashboard, Consider your request is flowing from ELB à service A (nginx server) à Service A (java server) à Store and get data in caching on Redis server à timer service server sync data from redis to MYSQL à service A Mysql server. Service A need to get key from Key management service for decrypting.
Your dashboard should have mentioned below
ELB active connection and new connection.
Service A nginx (success rate, latency rate, failure rate).
Server A java service (success rate, latency rate, failure rate).
Service A redis (Allocated memory, used memory, hit rate, DB size, No expiration key size, TPS, Read TPS, Write TPS).
Timer service sync (success rate, latency rate, failure rate).
DB (TPS, Read TPS, Write TPS, disk usage, Process percentage, mysql slave behind master, long query, slave status).
Service A interface to Key management service (Interface connection count, interface success rate, interface latency rate, interface failure rate)
This KPI metrics dashboard should be in place. It is up to SRE to configure it and they can add more KPI metrics. But if we add too many metrics, Sometimes our checks may goes in wrong directions and it will impact your MTTR. It is good to keep all these metrics in single page. If SRE search Service A in AIOPS or any other APM. All KPI metrics can see in single page.
Golden indicators
Dashboard will show metrics about how live request is getting processed. Golden indicator indicates certain object that should be monitored and alarm should be configured. Say for example Service A need to connect Key management service. KMS service is under maintenance and they have two 4 nodes in their load balancing pool. Due to maintenance activity only 2 nodes are taking request. There could be change for over load on 2 nodes. Indicator on our side show as Yellow and it is overloaded and one node got crashed. Our indicator should turn as RED. How we can handle this kind of chaos. During these activities from KMS side they can route traffic to blue/grey region and it should have 4 nodes to balance the load. Once this is done it would smooth for transacting and our indicator may turn as Green.
There are certain core golden indicators are required to configure. Certificate expiry date if it has validity upto 90 days only then Service A certificate should turn as RED/YELLOW. Resources section for service A modules It should have all core System related metrics such as FD limit, CPU, Memory, Zombie, Inode, Disk, fstab, qdisk, OS patch level, Service software patch level, service and service health level etc, if any abnormities found in last 2 mins Indicator should turn as RED/Yellow. There should certain threshold must be configure to change the color and similar alarm should get trigger to take action. Consider and take from our previous example by executing gzip and process turned from R to D, there exactly it indictor should turn from GREEN to yellow and alarm should get trigger. Same with security as well any abnormalities with Patch version, some weird query executed against to your services, safe or danger commands execution and DDos attack symptoms these indicator should be configured.
SRE required to take create separate dashboard for Golden indicators, System Performance and business transaction. Dedicated one or two monitor is required to monitor these indicators. If we place these monitor on wall and it is too good.
How to take care these indicators during off working hours?
Well this can be achieved by chatops or event based approach. SRE can pull these metrics by calling certain API or run the scripts and output all those content over chatops. Inspection script can also execute to ensure no abnormalities.
AIOPS with ML based failure detection
There are certain best ML based approach is there in market to detect the failures. Let’s take an example sudden surge on request count and direct impact would be high resource usage. If we configure auto scaling technique will add another node in pool. But SRE don’t know when request count will goes high. Ok let’s take example New product is getting launched. Only leadership team knows too many people will login on Service A to purchase that product. They prediction time would be 1 PM, system configuration, these data can deploy it on system, AIOPS based solution or ML based algorithm will consume it, train and test it for predicting when to scale up the resource. So how many extra resources are required for that event. These indicators can project on dashboard and alarms can be configured when certain actions need to executed. There are some other methods like if request rate is increased by 20%, Auto resource scheduling can be in place. If resource is not utilized it won’t auto scale it further and if it is utilized will auto scale it. Too many use cases are there for advance auto detection by using ML approach.
When interface system goes down can forecast how it will impact success, latency and delay rate. Same can be implemented when service A goes in maintenance mode how it will impact business, how long service will have latency issue when there is no Failover region is configured.
ML based auto detection will identify the risk involved on your service, when new patch released and not applied, Vulnerable port is opened, Well identified port is configured those will auto scan it and can configure alarm to remediate it. These will get scanned frequent interval of time. Same KPI values can get added in golden indicator so color turned to RED or YELLOW SRE can take action.