July 29, 2021

7 Min Read

Site Reliability Engineering: An Introduction

Prithvi Raj

Prithvi

Raj

Community Manager at ChaosNative

prithvi@chaosnative.com
Introduction

Reliability is not an option but rather a prerequisite for every system. If assumed as a luxury by any chance, one needs to rethink the priorities and place reliability at the top of the list. The value of reliability, resiliency, security, observability & scalability gave rise to the term called Site Reliability or Site Reliability Engineering coined by Google back in the 2000s.

Complexities in infrastructures are never-ending and are increasing every day, demanding practices that involve not just software development but a tinge on the operations practice alongside.

To achieve highly scalable & reliable systems, enterprises need to initiate practices that combine Software Development with IT Operations, and the outcome is Site Reliability Engineering.

What is an SRE?

Site Reliability Engineering hence is the practice of developing, automating, shipping, and maintaining software in production through implementation strategies which are a mix of what the development, systems & operations teams do.This discipline existed a long time before Site Reliability Engineering or SRE was coined, previously, it was known as a mixture of traditional IT engineering and DevOps.

It involves maintaining the right balance between implementing new features and catering to the system requirements following the old ones. As systems transition into a brand new phase with more functionalities, it is vital to manage them through solutions that are highly scalable and manageable while dealing with hundreds and thousands of complex systems.

All-in-all SRE practices help achieve reliability goals while adopting newer methodologies in this paradigm shift from traditional systems to newer Cloud-Native Systems.

Site reliability engineering helps organizations define the new features that can be launched and when to launch them by using service-level agreements (SLAs) to define the required reliability of the system through service-level indicators (SLI) and service-level objectives (SLO).

Site Reliability Engineer

A Site Reliability Engineer is the persona practicing the principles defined under Site Reliability Engineering. A Site Reliability Engineer is responsible for ensuring that the code is correctly

  • Deployed
  • Configured
  • Monitored
  • Observed
  • Managed
  • Safeguarded

The above check marks help to achieve proper availability, latency, system management, change management, capacity management, and emergency response of systems while moving into being called production-grade or production-ready.

The following activities are carried out by Site Reliability Engineers as part of their overall goals:

Disaster Management:
Outages and Downtimes are pretty typical for an SRE to face. But the uphill task at hand is to ensure that these outages are managed in due time, and there exists proper preparation for many such potential disasters as and when they happen. SREs need to put a disaster recovery plan in place to avoid mishaps of any kind. Identifying disaster mitigation strategies and automating the overall process to achieve success in each task under discovery mitigation of possible disasters is crucial.

Incident response:
Production level incidents are pretty common and can occur on any application at any point in time. With the induction of new features and more modern methodologies in an application, the risk of incidents grows exponentially, which might be a matter of concern while addressing client requirements. What is essential for an SRE is to minimize the impact of the production level incidents.

The metrics that are used to measure the speed and efficiency of incident response, such as:

  • Mean time to detect (MTTD), which measures the average time needed to discover a problem
  • Mean time to resolve (MTTR), which measures how long it takes to fix a failed system
  • Mean time to failure (MTTF), which is the average amount of time a defective system can continue running before it fails; this is similar to uptime and helps teams plan for future replacement of system components before they stop working
  • Mean time between failures (MTBF), which measures the average time a system or component is working properly

Maintaining SLAs, SLOs & SLIs:
To achieve the desired reliability and maintain high standards, defining certain Service Level Agreements (SLAs), Service Level Objectives (SLOs) & Service Level Indicators (SLIs), and maintaining them are key. These factors help an enterprise understand a client's requirement better and create a systematic plan to achieve the eventual deliverables. To understand how SLAs, SLOs, & SLIs drive SRE best practices let us define them first!

Service Level Agreement or SLA is the agreement or contract between the provider and the client featuring the metrics that act as promises to the client and consequences in case of failure on the provider’s side. If the client's expectations aren’t met then the potential consequences are penalties, extensions, service credits, etc.

Service Level Objective or SLO is the individual objective or promise under the SLA made to the client. SLOs include specific metrics such as uptime or response time, defined by specific values or a range of values, which define the client success percentage. They are driven mostly by the customer requirements rather than present performance.

Service Level Indicator or SLI is an indicator or quantitative measurement of the objectives defined under the SLOs. If an SLO is defined as identifying the response time then the value of the response time is defined as an SLI. Hence, an SLI is directly measured by the users and helps define what exactly needs to be measured.

Thus, defining and maintaining SLAs, SLOs & SLIs are the fundamentals to further define other major factors like error budgets, external factors, etc., and meet Site Reliability best practices.

Formulating an error budget:
To leave room for agility for the team, one needs to identify failure as an inevitable occurrence in the future. Formulating an Error Budget is an SLO which not only helps in innovation and risk mitigation but also helps identify the extent of error your services can incur before client dissatisfaction.

An error budget is defined as the maximum amount of time that infrastructure can incur failure without contractual consequences. To calculate the error budget, we have to use the SLI equation: SLI = [Good events / Valid events] x 100

Now the percentage is expressed as SLI, and once an objective is defined for each of the SLIs, that is your service-level objective (SLO), and the error budget is the remainder, up to 100.

An example of error budget formulation is that, if your Service Level Agreement (SLA) specifies that the systems will function 99.99% of the time before the business has to compensate clients for an outage, that means the error budget (or the time your systems can go down without consequences) is 52 minutes and 35 seconds per year.

Efficient Planning:
As a system goes through constant changes and various complexities, organizations need to assess growth, spikes & possible faults that can happen over time and need preparation. An effective planning mechanism needs to be put in place to put out old & worn-out software, ensure quality & stability, conduct timely security & version checks, and cater to dependency needs. To prepare for these events, SREs need to forecast the demand and plan time for reaction. Vital facets of capacity planning include regular load testing and accurate provisioning. Conducting load tests regularly allow visualizing system operations under the various possible usage scenarios. Also, adding capacity can be expensive, so knowing where one needs additional resources is the key to planning.

Addressing past issues:
Issues that might have possibly affected the system in the past, have the potential to reemerge and impact the system again. Thus, it is crucial to address the whys, hows & whens of previous issues which have impacted the system as drastic outages!

Conclusion

Site Reliability Engineering is a must for every organization looking to adopt reliability engineering practices and enhance their DevOps outlook. People often confuse SRE with DevOps where SRE is the subset of DevOps but the other way isn’t true. SRE practices are all about improving system performance while adhering to customer expectations and requirements, adopting new risk mitigation and disaster recovery strategies while keeping monitoring & measurement as utmost priority.

Chaos Engineering is just an important factor running hand-in-hand with SRE practices to achieve the goals set by the management.

Enterprises need to keep patient and allow the process to bring in value. Site Reliability Engineering is much more than just reliability. It is about bridging the gap between the Developers & the Operations team. It is about analyzing data and taking action accordingly. It is about building a culture. The culture eventually helps achieve customer success and satisfaction.

Try

ChaosNative Litmus Cloudfor free

What’s included in the free tier?

  • 2 Agents
  • 60 workflow runs/month
  • Slack Support
  • 30 days retention period
Create free account
Litmus 2.0
CNCE: For Non-Kubernetes Targets