What is SRE (Site Reliability Engineering)?

What Is SRE (Site Reliability Engineering)?

Historically, development teams and operations teams have been at odds.

Development teams wanted to add slick new features to products. Operations teams wanted to make sure these features didn’t cause dysfunction.

This began to change in 2003 when software engineer Benjamin Treynor invented site reliability engineering (SRE) while working at Google. He made his software engineering team responsible for some ops tasks, creating the concept of SRE and helping to resolve issues between development and operations.

How does SRE work?

SRE is performed by site reliability engineers, also known as service reliability engineers. These professionals are typically software developers who have gained some operations experience. They can also be IT professionals who have development skills.

SRE teams set service-level agreements (SLAs) for each service in a system. The SLAs define the system’s required reliability, helping teams figure out which features they can implement. 

Within each SLA are service-level indicators (SLIs) and service-level objectives (SLOs).

SLIs are metrics that measure a specific aspect of a service level. Examples of SLIs you might want to monitor could be availability, error rate, or system throughput.

An SLO is simply the target you want to hit for an SLI. For example, you might shoot for a 99.9% availability over the course of a year.

The difference is the downtime level. The downtime level is known as the “error-budget,” which is the maximum amount of error allowed in the system. 

By acknowledging the error is inevitable, you can then plan for the errors, making it easier for the development team to release new features. See, the development team can release whatever features they want—whenever they want—as long as they stay within the error budget. As soon as they step outside of it, they must reign in errors before moving forward with new features.

A vital part of an SRE’s work is automation. SREs often have to automate away repetitive manual tasks, called “toil,” so they can focus on long-term, value-adding work. Kubernetes can be helpful with this.

DevOps vs. SRE

DevOps is a philosophy and set of practices that combines software development and IT operations. It consists of five pillars. Google defines these pillars as:

  • Reducing organizational silos
  • Accepting failure as normal
  • Implementing gradual changes
  • Leveraging tooling and automation
  • Measuring everything

If DevOps is the “what,” SRE is the “how.” It’s simply an implementation of the DevOps philosophy. In fact, SRE meets all five pillars of DevOps:

  • Reducing organizational silos: SREs share ownership with the developers, and they use the same tools and techniques.
  • Accepting failure as normal: SREs quantify failure using SLIs and SLOs. They assume that errors will happen, but set a maximum allowable amount to balance failure against new releases.
  • Implementing gradual changes: SREs encourage smaller, more iterative deployments of new features in order to reduce the cost of failure.
  • Leveraging tooling and automation: SREs automate away manual tasks, as mentioned above.
  • Measuring everything: SREs use metrics (SLIs) to quantify service levels. Consequently, they can keep errors down.

Benefits of SRE

Let’s take a look at a few of the ways SRE can offer some major advantages to your organization. 

Provides clear metrics

Clear metrics allow SRE teams to highlight areas of improvement, such as reducing security vulnerabilities.

SRE teams can also use metrics to calculate impact in other areas, such as revenue. For example, they could look at how much revenue they lose per minute of downtime.

Improves code

The development and SRE teams share the same talent pool. If the development team writes poor code, more talent is allocated to the SREs to fix these issues—leaving fewer people available for the development team.

As a result, the development team is incentivized to write better code. When their code works well, they could gain more teammates, bringing them the resources they need to create better features.

Frees up time and resources to add value

Better code, fewer bugs, and more efficiency creates more time to add value to the product. 

The developers can create better and more exciting features that are less likely to cause problems. On the other side, operations can spend more time testing and performing upkeep.

Put these together, and you have a better product for the customer.

The bottom line

SRE is quickly taking hold as an essential part of many organizations. It can help close the gap between operations and development. Consequently, you can deliver better applications faster—without sacrificing the reliability of those applications. 

Cortex can help your SRE team track service quality better so they can find more areas where service can be improved. Schedule your demo today to learn more about what we can do.