The Definitive Guide to

Site Reliability Engineering (SRE)

Learn about site reliability engineering and its benefits, what do site reliability engineers do, and the differences between SLOs, SLIs, and SLAs.

Introduction

While companies deal with an ever-growing IT infrastructure that supports cloud-native services, SRE or site reliability engineering has become increasingly important. One of the reasons is that the way engineering teams ship and operate software has drastically changed.

On-going update processes, quick releases, and fixes in a disparate environment have spurred the adoption of DevOps principles and a shift from centralized department silos to new engineering culture. This culture supports and lives the “you build it you run it” philosophy.

In order to stabilize their new IT architecture and strengthen their competitive advantage, enterprises hire site reliability engineers. SREs, help product engineering teams in optimizing their workflows by leveraging different sets of engineering principles. Their main goal is to create highly reliable software systems by continuously analyzing the existing infrastructure and finding ways to optimize it with software solutions.

In this guide, you’ll learn more about the role and the benefits of site reliability engineering, the fundamental principles used in SRE, and the difference between a site reliability engineer and a platform engineer.

What is Site Reliability Engineering?

Site reliability engineering or SRE is a software engineering approach that helps manage large systems through code. It is a site reliability engineer’s task to create a resilient infrastructure and efficient engineering workflows by applying SRE best practices. This also involves the use of metrics and software tools for monitoring and improving operations.

Even though SRE seems to be a relatively new role in the world of cloud-native application engineering and management, it was born even before DevOps – the movement that successfully combined software development and IT operations.

In fact, it was Google who first tasked their software engineers to make large-scale sites more reliable, efficient, and scalable by applying automated solutions. The practices that Google’s engineers started developing in 2003 form the basis of the full-fledged IT domain that SRE is today.

In a way, site reliability engineering takes on the tasks that operations teams would handle in the past. However, operational problems are not solved manually but with an engineering mindset. With the availability of advanced software and tools, SREs can build a bridge between development and operations and create an IT infrastructure that is reliable and allows for quick implementation of new services and features. Thus, site reliability engineers are particularly important when a company moves from a traditional IT approach to a cloud-native approach.

Next, learn more about the specific tasks of a site reliability engineer and what kind of skillset this role typically requires.

What does a Site Reliability Engineer (SRE) do?

A site reliability engineer usually has a background in software development and substantial operations and business analytics experience. All are necessary to solve operational problems with the help of code. While DevOps is more concerned with automating IT operations, SRE teams focus more on the planning and design aspects.

They monitor systems in production and analyze their performance to detect areas of improvement. Their observations also help them calculate the potential cost of outages and plan for contingency.

SREs usually split their time between operations and the development of systems and software. Their on-call responsibilities include updating runbooks, tools, and documentation to prepare engineering teams for a future incident. If the latter occurs, they usually conduct thorough post-incident interviews to find out what’s working and what’s not.

This is also how they collect valuable “tribal knowledge”. Since they take part in software development, support, and IT development, this knowledge is no longer siloed but can be used to create more reliable systems.

Much of a site reliability engineer’s time is also spent building and deploying services that optimize the workflow for IT and support departments. This can also mean creating a tool from scratch that is able to level out the flaws in the existing software delivery or incident management.

Next to making sure that there are fewer incidents, SREs also determine what new features can be implemented and when this is possible through the help of service-level agreements (SLAs), service-level indicators (SLI), and service-level objectives (SLO).

In the next section, learn more about the SRE key metrics SLA, SLI, and SLO and how they are used in site reliability engineering.

Microservices @ LeanIX - then, now and tomorrow

Video

Microservices @ LeanIX - then, now and tomorrow

Live Recording - EA Connect Day 2020 

Per Bernhardt - Staff Software Engineer - LeanIX

 

20 Key Questions a Microservice Catalog Answers

Poster

20 Key Questions a Microservice Catalog Answers

Download this LeanIX poster to see the 20 key questions a microservice catalog can answer.

Maximize the Development Efficiency of Your Microservices Landscape with LeanIX

Webinar

Maximize the Development Efficiency of Your Microservices Landscape with..

Watch this on-demand webinar hosted by The Open Group, where LeanIX shares insights on how we can help bring order and clarity to your complex microservices architecture.

Efficiently Navigate your Microservices with LeanIX

Webinar

Efficiently Navigate your Microservices with LeanIX

Watch this interview-style webinar on how to build reliable software using a microservice catalog - including a product demo

The difference between SLOs, SLIs, and SLAs

As mentioned earlier, site reliability engineers use three metrics to monitor and measure the performance of IT systems and ultimately increase their reliability: They draft service-level agreements (SLAs), service-level indicators (SLI), and service-level objectives (SLO). These related service-level metrics not only help companies create a more reliable system but also gain more trust with customers.

Learn what the different concepts mean in practice, how they depend on each other, and why they are so important for successful site reliability engineering.

What are SLIs?

SLI stands for the service-level indicator. The Google SRE Handbook defines it as “a carefully defined quantitative measure of some aspect of the level of service that is provided.”  This means that SLIs are used to measure the characteristics of a service to provide input for a service provider’s goal.

Product-centric SLIs gauge behaviors that would greatly impact the customer experience. In fact, there are four golden signals used as the most common SLIs: latency, traffic, error rate, and saturation.

When SRE teams set up SLIs to measure a service, they usually define them in two stages. 

  1. They determine the SLIs that directly impact the customers.
  2. They determine SLIs that have a direct influence on the availability or latency, or performance of the service.

The formula used to calculate SLIs is SLI = Good Events * 100 / Valid Events – an SLI value of 100 is ideal, whereas a drop to 0 means that a system is broken.

It is important to create SLIs that match the users’ journey. This means that one single SLI is not able to capture the entire customer experience as a typical user might care about more than one thing when using the service. At the same time, creating SLIs for every possible metric is not advisable as you would lose focus on what is really important.

As a rule of thumb, site reliability engineers try to find the most important pain points along the user’s journey that could then lead to a total redesign of a system.

Once the SLIs are set up, an SRE connects them to SLOs, which are key threshold values against each SLI quantifying the availability and quality of service.

What are SLOs?

SLOs or service level objectives are used as an objective measure for a service’s reliability or performance goals. The Google SRE Handbook says that they “specify a target level for the reliability of your service” and “because SLOs are key to making data-driven decisions about reliability, they’re at the core of SRE practices.”

While SLIs are product-centric, SLOs are customer-centric. SLOs are measured by SLIs, so the two have an intricate dependency. The natural structure of their relationship is defined as follows: Lower bound SLOs ≤ SLI ≤ Upper bound SLOs.

However, choosing the right SLOs is a complex task. Usually, targets should never be based on the current but a historic system performance. Sometimes, SRE teams also make the mistake to aim for perfection by choosing targets that are way too high. There is also no need to specify absolute values. However, it’s a good idea to keep a safety margin in SLOs by setting a historical average. When adopting SLOs, it’s important to start slowly and then work your way up as their adoption requires a cultural change.

SLOs should be seen as a unifying tool that creates a common language and shared goals across different teams. And you are much more likely to succeed if all key stakeholders are on board. However, many companies are focused on product innovation and don’t see the connection between business performance and reliability. Common roadblocks are siloed data and the wrong assumption that once SLOs are created, they don’t need to be regularly re-evaluated or adjusted.

What are SLAs?

SLAs or service-level agreements are contracts on service availability and performance. Just like SLOs, SLAs are a customer-centric metric. In the Google SRE Handbook, an SLA is defined as “an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.”

An SLA kicks in as soon as an SLO fails. Typically, you can expect penalties and financial consequences for not meeting the deliverables. If your company breaches a term agreed in the SLA, it usually needs to reimburse its customers.

SLAs help establish transparency and trust between the service and its users. In a way, they’re a lot like SLOs but for external, not internal use. Compared to SLOs, SLAs are not as conservative, meaning that the value of reliability is always slightly lower than the historical average of an availability SLO. This can be regarded as a safety measure in case the average is too high due to very few incidents in the past.

In conclusion, site reliability engineers need to work with all three metrics to create a stable, reliable infrastructure. After collecting SLIs (the metrics in the monitoring system), they define thresholds of these metrics based on internal SLOs in order to prevent a breach in external SLAs.

Free poster

How to measure Cloud Success

Download this poster to see if you have a good cloud strategy in place and how you can measure the overall success of your cloud environment.

Get your free copy
Cloud Success Poster to show you the overall success of your cloud environment.

 

The difference between Site Reliability Engineering and Platform Engineering

Site reliability engineers and platform engineers have very similar roles and overlapping objectives. At smaller companies, it is even possible that the two operate interchangeably. However, as the number of developers increases, the lines between SRE and platform teams become a little less blurry.

Thus, a platform engineer focuses more on optimizing the workflow by deploying certain infrastructure components. That way, product engineers can build and ship applications faster. In order to avoid bottlenecks, platform engineers also need to re-calibrate existing workflows and make sure that the right people can access them.

A site reliability engineer on the other hand is more concerned with the overall health of a system, measuring its reliability and setting reliability goals (SLOs).

The close collaboration of both teams with development, operations, and support teams leads to better products, faster shipment, fewer incidents, and happier developers and customers.

Benefits of Site Reliability Engineering

There’s no doubt that adopting a DevOps culture helps engineering teams to collaborate in a more productive way and ship software way faster. However, it doesn’t necessarily increase site reliability and performance, which is why many companies are trying to fill SRE positions. But how can your business exactly benefit from site reliability engineering? Here are the 6 most compelling arguments that speak for hiring an SRE team.

  • Improved metrics reporting: SREs provide more clarity by monitoring and measuring productivity, service health and the occurrence of bugs. They are able to translate metrics into tangible elements (like the average length of downtime) and their relation to lost revenue for the business. Once areas of improvement are exposed, it’s easier to address them with appropriate solutions.

  • Proactive troubleshooting: Many businesses focus on innovation and the deployment of new features to stay ahead of the curve. However, fast development and shipment also means that there is a bigger chance for bugs and undetected vulnerabilities. Since SREs work proactively, they can find and fix issues before they reach the end-user and thus, save the company trouble, time and money.

  • More time for creating value: Working with a more reliable system and not having to fix issues once they’ve reached the end-user, frees up time for development teams. This means they can focus on creating new features. And the fact that SREs detect potential issues means that developers can resolve them in advance and improve their output.

  • Cultural improvement: Thanks to site reliability engineering, there is an ongoing awareness of the system health and its vulnerabilities. The process of continuously looking for the best solutions drives benefits across teams, departments and services and encourages collaboration. This shared sense of responsibility not only improves company culture but also the product itself.

  • Increased automation: A site reliability engineer will always look for the best way to modernize and automate workflows for product engineers. However, they’re also improving their own workflow for detecting system vulnerabilities by using modern tools and alert systems. This reduces the time it takes to find, highlight and repair bugs. So over time, the system is becoming increasingly reliable through automation.

  • Meeting customer expectations: While DevOps is more concerned with internal processes, SREs are focused on improving the customer and client experience. By using metrics like SLAs, SLOs and SLIs, a site reliability engineer is setting clear targets for meeting customer expectations. This will lead to more reliable products and significant improvements in terms of ROIs.

 

Conclusion

There are numerous reasons why cloud-native businesses should consider hiring a site reliability engineer or a whole SRE team. They are a valuable addition to any existing DevOps culture as they bridge the gap between developers and IT infrastructure.

Through continuous monitoring and analyzing application performance, they detect issues early on in the process and contribute to the overall product roadmap. Plus, development teams spend way less time on escalations and can dedicate more time to building new features and services.

Free Poster

20 Key Questions
a Microservice Catalog Answers

Cataloging microservices helps DevOps teams to visualize their microservice landscape including details on ownership, dependencies, and business context.

Get your free Copy

EN-Cloud-Microservice-Catalog-Poster_Landing_Page_Preview

Answers to frequently asked questions on Site Reliability Engineering

What is site reliability engineering (SRE)?

Site reliability engineering or SRE is a software engineering approach that helps manage large systems through code. It is a site reliability engineer’s task to create a resilient infrastructure and efficient engineering workflows by applying SRE best practices. This also involves the use of metrics and software tools for monitoring and improving operations.

What are the benefits of site reliability engineering?

The benefits are improved metrics reporting, proactive troubleshooting, more time for value creation, cultural improvement, increased automation, and meeting customer expectations.

What does a site reliability engineer do?

A site reliability engineer solves operational problems with the help of code through planning and design aspects. Engineer monitors systems in production and analyzes their performance to detect areas of improvement. Engineer's observations also help calculate the potential cost of outages and plan for contingency.

What metrics do site reliability engineers use?

Site reliability engineers use three metrics; SLIs, SLOs, and SLAs to monitor and measure the performance of IT systems and ultimately increase their reliability.

What is the difference between site reliability engineering and platform engineering?

Site reliability engineer is more concerned with the overall health of a system, measuring its reliability and setting reliability goals (SLOs).

Platform engineer focuses more on optimizing the workflow by deploying certain infrastructure components. That way, product engineers can build and ship applications faster. In order to avoid bottlenecks, platform engineers also need to re-calibrate existing workflows and make sure that the right people can access them. Site reliability engineers and platform engineers have very similar roles and overlapping objectives.

EN-Cloud-Microservice-Catalog-Poster_Landing_Page_Preview

Free Poster

20 Key Questions Microservice Catalog Answers

Download