Are you SRE?

Who is SRE
SRE stands for Site reliability Engineering. Which Means Web services which is serving Business has to be up and Running fine for 24/7. If any Issue comes from Network layer/System layer/DB layer/App layer SRE should be capable of handling issue and fixing it within mentioned SLA. If issue is from Paas (Platform as a Service) Layer SRE has to fix the issue without affecting business. SRE should have deep knowledge about Network/App/Data layer. Constant up skilling is mandatory, Should have hands-on new technologies. Good understanding on Devops culture.
SRE has to do 50 % of Maintenance work and another 50% is for Automation work. SRE have to identify repeated manual task and automate it. It can be either code delivery or Incident/Alarm auto healing.
If the service is having issue and failing to serve the request to the customer, SRE should be capable of routing the traffic to healthy nodes from faulty node, checking the logs and Analyzing with Developers. Ensuring Stable code is deployed in Environment.
KPI is one the core component of SRE, It has to monitor it 24/7. KPI stands for key performance indicator.
SRE has to focus on KPI metrics such as Request rate, Success rate, Delay, Incident Closure, Code delivery delay. To monitor those metrics proper dashboard has to be developed by SRE. If any abnormalities found on this KPI metrics SRE have to take action and remediate it at earliest.
Say for example service success rate has to be 100% all the time, Due to some issue if success rate is dropped from 100% to 99.80% then SRE has to fix the issue first and restore the service, Make sure service success rate is back to 100%.
Lets take another example delay should not cross more than 10 milli seconds, Due to some issue user’s are facing some latency on their side and your KPI value was jumped from 5 milli seconds to 15 milli seconds. SRE has to identify the lag and restore the service.
Sometimes SRE has work at architectural level. Take an Example your Service is running only on single Availability Zone (AZ1) and service risk factor is 100% . If AZ1 is down entire service will be down and it create outage. SRE has to make it as dual cloud solution.Similar loopholes SRE should be capable of identifying it and fixing it.
Prior to the SRE
Prior to the SRE there too many independent teams used to support the business and each individuals are specialized on their domain. They are listed below.
1. System Administration
2. Network Administration
3. Middleware Administration
4. Database Administration
System Administration
What they do:
SysAdmin they mostly focus on troubleshooting system core related issue, applying changes ,upgrading OS, Applying patches.
What they don’t do:
They dont focus on checking application health ,database,network related issue’s. They wont even care about business services which is running on server.
Network Administration
What they do:
Network Admin they mostly focus on troubleshooting core network related issues, upgrading network components, checking firewall’s. They make sure network firewall is closed and secure.They do take care about business services transaction.
What they don’t do:
They dont focus on checking application health , database,OS related issue’s.
Middleware Administration
What they do:
Middleware Admin they focus on Application health and performance, have to ensure connection to the database, MQ and another dependent service should be stable. Deploying code, Restarting service, Sometimes they have to take care about OS as well, If service CPU usage is very high Middleware Admin has to optimized it and same will apply for High memory usage aswell.
Middleware Administration some time have to work out of our working scope. Take for Example OS has to upgrade with higher version for that Middleware admin won’t do it, They submit Change request to Sysadmin they own the change and implement it.
Another example consider application service need to connect to another component for that Middleware Admin won’t open firewall directly. They submit change record to do that.
One last example. Say you may need to create a DB or Take a backup. Definitely Middleware Admin wont do it, DBA will take care of it.
What they don’t do:
Middleware Admin they dont touch core dependent components. They closely work with DBA, Network engineer and SysAdmin.
Database Administration
What they do:
DBA they focus on creating database, schema’s, update DDL/DML statements,taking backup’s. They make sure database is up and running fine. If any issue’s with database service or host . DBA will be the first person to handle it.
What they don’t do:
They wont check Application service is up or down, Network and OS is OK or Not OK.
How SRE helps?
Say for example webpage is not loading. My Technical duty officer is responsible for providing feedback to the higher Management about outage.
His Steps to restore the issue:
1. Open a call , Pull sysAdmin/network/middleware/DBA in a single call.
2. First i have pull one System Admin guy ask him to check System OK or not OK.
3. Next I have to pull Middleware Admin guy i have to ask him to check Application is OK or not OK.
4.Next I have to pull network Engineer i have to check with him and make sure network is OK or not OK.
5. Next I have to pull DBA on call and i have to ask DB is OK or Not OK.
Finally DBA found mysql service is down and they restarted the mysql service. User refreshed the page and now page is opening fine.
To troubleshoot one single issue we need 4 different people from each domain.
Just Say for an Example I’m an SRE. I know linux/ Database/Middleware service and Ihave some network knowledge. Here TDO (Technical Duty officer) no need to pull 4 different people in a single call instead of that he can call me directly and i can troubleshoot the issue. I can provide instant update and we can restore the service as soon as possible.