While companies deal with an ever-growing IT infrastructure that supports cloud-native services, SRE or site reliability engineering has become increasingly important. One of the reasons is that the way engineering teams ship and operate software has drastically changed.
On-going update processes, quick releases, and fixes in a disparate environment have spurred the adoption of DevOps principles and a shift from centralized department silos to new engineering culture. This culture supports and lives the “you build it you run it” philosophy.
In order to stabilize their new IT architecture and strengthen their competitive advantage, enterprises hire site reliability engineers. SREs, help product engineering teams in optimizing their workflows by leveraging different sets of engineering principles. Their main goal is to create highly reliable software systems by continuously analyzing the existing infrastructure and finding ways to optimize it with software solutions.
In this guide, you’ll learn more about the role and the benefits of site reliability engineering, the fundamental principles used in SRE, and the difference between a site reliability engineer and a platform engineer.
Site reliability engineering or SRE is a software engineering approach that helps manage large systems through code. It is a site reliability engineer’s task to create a resilient infrastructure and efficient engineering workflows by applying SRE best practices. This also involves the use of metrics and software tools for monitoring and improving operations.
Even though SRE seems to be a relatively new role in the world of cloud-native application engineering and management, it was born even before DevOps – the movement that successfully combined software development and IT operations.
In fact, it was Google who first tasked their software engineers to make large-scale sites more reliable, efficient, and scalable by applying automated solutions. The practices that Google’s engineers started developing in 2003 form the basis of the full-fledged IT domain that SRE is today.
In a way, site reliability engineering takes on the tasks that operations teams would handle in the past. However, operational problems are not solved manually but with an engineering mindset. With the availability of advanced software and tools, SREs can build a bridge between development and operations and create an IT infrastructure that is reliable and allows for quick implementation of new services and features. Thus, site reliability engineers are particularly important when a company moves from a traditional IT approach to a cloud-native approach.
Next, learn more about the specific tasks of a site reliability engineer and what kind of skillset this role typically requires.
A site reliability engineer usually has a background in software development and substantial operations and business analytics experience. All are necessary to solve operational problems with the help of code. While DevOps is more concerned with automating IT operations, SRE teams focus more on the planning and design aspects.
They monitor systems in production and analyze their performance to detect areas of improvement. Their observations also help them calculate the potential cost of outages and plan for contingency.
SREs usually split their time between operations and the development of systems and software. Their on-call responsibilities include updating runbooks, tools, and documentation to prepare engineering teams for a future incident. If the latter occurs, they usually conduct thorough post-incident interviews to find out what’s working and what’s not.
This is also how they collect valuable “tribal knowledge”. Since they take part in software development, support, and IT development, this knowledge is no longer siloed but can be used to create more reliable systems.
Much of a site reliability engineer’s time is also spent building and deploying services that optimize the workflow for IT and support departments. This can also mean creating a tool from scratch that is able to level out the flaws in the existing software delivery or incident management.
Next to making sure that there are fewer incidents, SREs also determine what new features can be implemented and when this is possible through the help of service-level agreements (SLAs), service-level indicators (SLI), and service-level objectives (SLO).
In the next section, learn more about the SRE key metrics SLA, SLI, and SLO and how they are used in site reliability engineering.
As mentioned earlier, site reliability engineers use three metrics to monitor and measure the performance of IT systems and ultimately increase their reliability: They draft service-level agreements (SLAs), service-level indicators (SLI), and service-level objectives (SLO). These related service-level metrics not only help companies create a more reliable system but also gain more trust with customers.
Learn what the different concepts mean in practice, how they depend on each other, and why they are so important for successful site reliability engineering.
SLI stands for the service-level indicator. The Google SRE Handbook defines it as “a carefully defined quantitative measure of some aspect of the level of service that is provided.” This means that SLIs are used to measure the characteristics of a service to provide input for a service provider’s goal.
Product-centric SLIs gauge behaviors that would greatly impact the customer experience. In fact, there are four golden signals used as the most common SLIs: latency, traffic, error rate, and saturation.
When SRE teams set up SLIs to measure a service, they usually define them in two stages.
The formula used to calculate SLIs is SLI = Good Events * 100 / Valid Events – an SLI value of 100 is ideal, whereas a drop to 0 means that a system is broken.
It is important to create SLIs that match the users’ journey. This means that one single SLI is not able to capture the entire customer experience as a typical user might care about more than one thing when using the service. At the same time, creating SLIs for every possible metric is not advisable as you would lose focus on what is really important.
As a rule of thumb, site reliability engineers try to find the most important pain points along the user’s journey that could then lead to a total redesign of a system.
Once the SLIs are set up, an SRE connects them to SLOs, which are key threshold values against each SLI quantifying the availability and quality of service.
SLOs or service level objectives are used as an objective measure for a service’s reliability or performance goals. The Google SRE Handbook says that they “specify a target level for the reliability of your service” and “because SLOs are key to making data-driven decisions about reliability, they’re at the core of SRE practices.”
While SLIs are product-centric, SLOs are customer-centric. SLOs are measured by SLIs, so the two have an intricate dependency. The natural structure of their relationship is defined as follows: Lower bound SLOs ≤ SLI ≤ Upper bound SLOs.
However, choosing the right SLOs is a complex task. Usually, targets should never be based on the current but a historic system performance. Sometimes, SRE teams also make the mistake to aim for perfection by choosing targets that are way too high. There is also no need to specify absolute values. However, it’s a good idea to keep a safety margin in SLOs by setting a historical average. When adopting SLOs, it’s important to start slowly and then work your way up as their adoption requires a cultural change.
SLOs should be seen as a unifying tool that creates a common language and shared goals across different teams. And you are much more likely to succeed if all key stakeholders are on board. However, many companies are focused on product innovation and don’t see the connection between business performance and reliability. Common roadblocks are siloed data and the wrong assumption that once SLOs are created, they don’t need to be regularly re-evaluated or adjusted.
SLAs or service-level agreements are contracts on service availability and performance. Just like SLOs, SLAs are a customer-centric metric. In the Google SRE Handbook, an SLA is defined as “an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.”
An SLA kicks in as soon as an SLO fails. Typically, you can expect penalties and financial consequences for not meeting the deliverables. If your company breaches a term agreed in the SLA, it usually needs to reimburse its customers.
SLAs help establish transparency and trust between the service and its users. In a way, they’re a lot like SLOs but for external, not internal use. Compared to SLOs, SLAs are not as conservative, meaning that the value of reliability is always slightly lower than the historical average of an availability SLO. This can be regarded as a safety measure in case the average is too high due to very few incidents in the past.
In conclusion, site reliability engineers need to work with all three metrics to create a stable, reliable infrastructure. After collecting SLIs (the metrics in the monitoring system), they define thresholds of these metrics based on internal SLOs in order to prevent a breach in external SLAs.
Site reliability engineers and platform engineers have very similar roles and overlapping objectives. At smaller companies, it is even possible that the two operate interchangeably. However, as the number of developers increases, the lines between SRE and platform teams become a little less blurry.
Thus, a platform engineer focuses more on optimizing the workflow by deploying certain infrastructure components. That way, product engineers can build and ship applications faster. In order to avoid bottlenecks, platform engineers also need to re-calibrate existing workflows and make sure that the right people can access them.
A site reliability engineer on the other hand is more concerned with the overall health of a system, measuring its reliability and setting reliability goals (SLOs).
The close collaboration of both teams with development, operations, and support teams leads to better products, faster shipment, fewer incidents, and happier developers and customers.
There’s no doubt that adopting a DevOps culture helps engineering teams to collaborate in a more productive way and ship software way faster. However, it doesn’t necessarily increase site reliability and performance, which is why many companies are trying to fill SRE positions. But how can your business exactly benefit from site reliability engineering? Here are the 6 most compelling arguments that speak for hiring an SRE team.
There are numerous reasons why cloud-native businesses should consider hiring a site reliability engineer or a whole SRE team. They are a valuable addition to any existing DevOps culture as they bridge the gap between developers and IT infrastructure.
Through continuous monitoring and analyzing application performance, they detect issues early on in the process and contribute to the overall product roadmap. Plus, development teams spend way less time on escalations and can dedicate more time to building new features and services.
Cataloging microservices helps DevOps teams to visualize their microservice landscape including details on ownership, dependencies, and business context.
What are the benefits of site reliability engineering?
What does a site reliability engineer do?
What metrics do site reliability engineers use?
What is the difference between site reliability engineering and platform engineering?