To continue our discussion on the topic of Site Reliability Engineering (SRE), I interviewed Bastian Spanneberg, Director of Site Reliability Engineering at Instana — a company which automates application performance monitoring for cloud architecture. Bastian is based in Munich, Germany and has many years of experience as an SRE. He's also a frequent writer and presenter on the subject as well as on various other topics related to cloud governance.
Our conversation touched on the main tasks of SRE, recommendations for companies on how to introduce the practice, and what comprises a typical working day.
Below are some of the highlights from our conversation. If you would like to discuss anything, please reach out to me at email@example.com.
LeanIX: You mentioned that Site Reliability Engineering (SRE) is about common sense. Elaborate briefly on this.
Bastian Spanneberg: There are a lot of great new ideas and approaches in the SRE community. In saying SRE is about common sense I don’t want to downplay this in any way. But throughout my entire career, I have said this about many topics, including Agile methodologies, continuous delivery, and DevOps. Let me clarify what I mean.
Take Agile and continuous delivery as examples. It is naturally a good idea to work in small increments and get feedback fast on what you do. Adopting this allows you to deploy smaller changes to production, and with smaller deployments, risks are minimized and problems can be avoided or found earlier. It's the same with DevOps. The people who design an application architecture are likely the ones who best know most failure cases of said architecture, so letting them also operate the product sounds like a good idea. I would argue that if you look at these things unbiased you would consider these approaches common sense (at least nowadays).
The same can be said about SRE practices:
- Putting a heavy focus on automation to avoid manual, repetitive tasks
- Accepting failure as a given in a distributed system and designing your monitoring and alerting accordingly, using Service Level Objectives
- Taking problems and failures as an opportunity to learn and improve instead of blaming individuals
I believe SRE is a natural development for operating large-scale, distributed systems. It takes ideas from other movements and disciplines and applies them to the field of operations. As it came into existence at the likes of Google, Facebook, and Netflix, it also builds upon a solid foundation of practitioners' experience.
LeanIX: What’s your standard day as an SRE looking like? In particular, how do you go about collaborating with developers/DevOps?
BS: My SRE colleagues are distributed over three time zones to allow for more pleasant on-call schedules. This allows us to be on-call only during standard working hours (minus when covering for those on vacation). Mornings typically begin with a handover meeting with the team that is just ending its day so we can discuss what happened in the past hours, which tasks need to be taken over, and anything else that’s pressing.
If you are on-call that day, your main focus is to respond to alerts, answer incoming tickets, and support engineering teams where needed (e.g., by watching for pings for the SRE team in Slack). The better we do our job, the less this will eat up our time.
The next priority is to improve our platform — and this can happen in a lot of different ways:
- Revisiting Service Level Objectives or alerting rules to improve the signal-to-noise ratio when on-call
- Automating recurring tasks
- Performing reviews of data stores and other infrastructure within the platform for capacity planning
- Participating in post-mortems to learn from incidents
- Working with product engineering teams to prepare the rollout of new features of components
It doesn’t tend to get boring, to say the least 😀.
LeanIX: How do you get a common understanding around SLAs, SLOs, error budgets, and other trade-offs with your stakeholders?
BS: We work closely with the product engineering teams to understand existing and new components, pick appropriate Service Level Indicators, and formulate reasonable Service Level Objectives from there. This is actually a great way to create a better understanding in both directions.
Our company was founded by engineers and has a strong engineering focus, so luckily there isn't a lot of arguing needed with other stakeholders to convince them of this approach. Everyone understands that failure is a given in a complex and distributed system and that the way you approach operations needs to be aligned. But if you need to have discussions, I’ve always found it useful to have numbers ready (e.g., when explaining how much more another 9 of reliability will cost). Numbers can help a lot in making trade-offs.
LeanIX: What would your advice be for a company that is just starting to establish SRE as a discipline?
BS: First of all, I’d say don’t lose faith in yourself. You usually already have certain responsibilities when getting started, meaning that adopting SRE will add even further responsibilities or cognitive load, at least for a transition period. So in the beginning it might feel like things move slowly or not at all, but you need to be persistent. Maybe you also need to work hard to move other topics to different teams so you can focus on your core topics.
These things can take several months and can be, at times, hard and frustrating. This process is similar to digital transformation projects. It’s less about a certain technology or toolchain than it is about culture and mindset. Changing people’s ideas takes time and perseverance.
SRE can also be a job with lots of context switches — especially so if you are only a small team. Yes, this can be tiring, but you need to put in the effort to get rid of as many sources of context switching as possible.
The bottom line is that I don’t think it is possible to ever reach an end state. It’s a continuous journey, and you have to adapt to the changing landscape of your business and keep applying the SRE principles to your changing reality. This is also why this profession is so interesting. You always have to learn something new to keep on top of the game.