Core Platform Service and SRE pains points
Platform services which is core to the business, we should consider those applications are shared service applications. It is used globally by all other applications. This particular blog will discuss more about how these services are categorized as platform service and how they supporting business, what would be impact if platform service goes down and type of pain points for platform service. Why exactly platform SRE mostly gets lot of pings and calls from other service SRE or developers to get the work done.
How do you categorized this service goes in platform bucket
Service which helps other service and most commonly used by all other service which is consider platform service. Example: In a Financial industry customer data which get saved in common data base. There are around 1000 applications are running in your environment. Each applications get the data from database. Too many IO operations may go on Database and many sql executions may take place. Due to that there would be chance of slow query execution, crashing data base. If developer builds one API based service A to get certain customer data from DB (read) and another API based service B is build to put certain customer data to DB (write). Business services no need to get data directly to DB. Instead of that they can call platform Service A or B.
If still this is complex or to improve reliability, business service can have redis there they can keep frequently accessed customer data with certain expiry. Business service can call Service A to feed data from DB to redis during low peak hours. To sync the data from DB to redis need some timer kind of service. Business service can call Service B just to put the data on MYSQL, Redis and timer job service can call only to platform Service A.
There are too many services which help business or other service to full fill certain business logs. Let’s take another example if some financial firm to wants to attract their customer and they wants to sell credit card to them. This is only for certain specific customers. Developer can create analysis app it should have potential to pull the data from standby Databases. Business service can call Analysis App by running few API call with certain conditions and they can feed to credit card based Ads on portal.
KMS is widely used service which holds the Private key for certain service. If business service wants to execute certain business logic, require to decryption few data to accomplish that business service will give call to KMS service and gets the key and by using it will decrypt it. KMS hold too many private keys for all service. If you see certain logic it just send data.
Platform service doesn’t have any major upgrades. It has certain logic will keep executed umpteenth time and it will help many business services. Routing service intend to feed the location based news. If you open chrome on London, it give calls to London news database service. This will feed London latest news and same with India or another country. It will just route based on the source IP. Agreement page is another core platform service. It will just throw terms and agreement if your user agrees it and then it will land to any other service. So all apps has to call the agreement service and when user tries to install new Application. Data sync is similar app, if user bought new application; they no need to manually sync data from old to new phone. Old phone data’s would have automatically synced to your cloud. Cloud will just redo the job the interface is Data Sync service. To cut long story it is very simple, Service which helps all other service in a repeated mode and it is considered as platform service.
What would be the impact if platform service is unavailable?
Definitely it is going to be disaster and business will get impacted badly. Many business SRE will see the failure rate would be going high. Some task based service which runs on mentioned interval time’s those will have less impact and it will be ok on next run. Some business flow may have dependency with platform service and that will get impacted badly. Since it is common service and too many services will use it. Platform SRE has to put eyes on resource usage. Mostly if resource usage is going immediate or emergency expand is required. Let’s take example new model phone is launched. Data Sync and Agreement service is most widely used service. Resource for that service may be high at some point of time. SRE has to optimize it during that time frame.
While doing migration SRE has to caution and they should not impact other service. Most of the URL domain based configuration, if IP is changed for that domain. DNS cache may play around and it will call old resource, this will have direct indication on high failure rate. Users will impact. SRE can wait till old DNS cache is getting expired and remove the old resource.
Some business service use direct platform Service ELB IP on their code. If ELB goes down or SRE migrate to another ELB, IP may get’s changed. Business service will hit old ELB, it would be lucky if resource is available on that pool. On these kinds of scenarios Service SRE don’t have control, Platform SRE has to publish the domain based URL for their service and business team should adhere with that information.
Platform Service should have backup availability zone and they should run all their service in A-A (Active-Active) based solution. If service is going on upgrade and traffic should migrate it properly to working zone or node. If SRE fail to do there would intermittent failure on business transaction. In a Distributed environment request flow should hit only active node. SRE has to keep this statement in mind. Formal notice should be given before doing any upgrade or migration on platform service. It is base for all Business.
Pain points
Platform SRE has too many pain points. Sometimes has to differentiate genuine or non-genuine issue. Could see too many client related failure error code on logs. Business teams will pull Platform SRE on call, SRE need to justify and have to ensure no issue from platform side. Sometime Platform SRE need to have to core functional knowledge and has to work like L1 service desktop engineer.
Business Service uses platform service very drastically and that will leads to have huge resource usage on platform side and some time it will get crash on node level and data tier level. One particular service will create issue and it will impact rest other business service. As I said earlier resource usage have to monitor it at service and data tier level. On time expansion and reduction should be happen. Here Platform SRE have job to automate.
Platform services can be as common service or tool. Common service will be easy to handle at some point, if it is a tool and it is used by all other SRE/Developers/Devops/SecOps/MLops. Then the real pain gets started. Consider it you are hosting Jenkins as a customized UI form and named as a job Service. The intend of having this tool for automation and it has only shell execute plugin for executing shell script on 10,000 servers. All 10K server will different subnet. Let’s have the first problem all 10k servers new patch has to apply. Only 3k server are done and Timeline to complete this task is today, Rest 4K server patch script was executed at a same time. Load on slave job too high , due this other scripts are started failing or it will hang it at some point entire URL was started loading slow. There should some proper action to be taken by platform SRE, Either by limiting resource or increasing number of slave nodes to balance the load. This it has to work along with developer.
Same scenarios will apply for Common based tool which connects shared database. If the DB IO rate is very high due to some particular transaction will create outage for all connected service.
Platform SRE have to be careful when they are migrating tool based solution, ACL or security groups have to be taken care properly. If 10K server subnets is not added in Inbound rule for allowing to connect on new subnet which is attached to ACL. Then the impact is nearly outage. No server can’t get connected to job service and jobs will fail to execute. SRE have to add these points on migration check list.
Similar notice have to give all business SRE when Tool based platform service goes for upgrade, migration, any network change or any other minor change. It impacts their delivery not business. Platform SRE can’t take light. If any tool is down, it will create only internal outage.