Site reliability engineering is an approach to engineering that merges aspects of IT operations, system management, application monitoring, and software development to ensure the reliability of an organization’s applications. Site reliability engineering, also known as service reliability engineering, has proven especially powerful for improving the reliability of scalable systems.
Before the dawn of site reliability engineering, development teams and operations teams were often at odds with each other. Benjamin Treynor introduced the concept of SRE in 2003 while leading an engineering team at Google. His team was tasked with keeping sites as reliable and serviceable as possible, so he directed team members to spend half their time on operations tasks and half their time in development. This way, he reasoned, they would have a better understanding of how their software worked in reality. Eventually, this team would evolve into what we understand as SRE, the folks who make sure a site or service is available.
In Treynor’s words, the incentive of a development team is to get features launched and to encourage users to adopt the product, while the incentive of an operations team is to make sure “the thing doesn’t blow up on their watch.”
SRE has the ability to bridge these teams and resolve the tension between them by considering each side equally, but what does this look like in practice?
What is SRE, and how does it work?
Site reliability engineers are usually developers who have gained some experience in operations or IT professionals who have gained some software development skills. SREs define reliability standards for services, which teams can use to determine whether a feature is ready for implementation.
One of the core tenets of SRE is the error budget. Treynor conceptualized the error budget from a simple observation: “100% is the wrong reliability target for basically everything,” in part because a user cannot distinguish between 100% available and 99.99% available. The error budget, as defined by Treynor, is one minus the established availability target for a product — so if the system is 99.99% available, the error budget has 0.01% downtime that can be “spent” on anything.
Error budgets should be unique to each product — a budget depends on whether users pay for a product, what users are satisfied with, and what users’ alternatives are, according to Treynor. Ideally, the entire error budget would be spent taking risks with new features to get them launched quickly, a premise that captures Treynor’s whole model. SRE balances development and operations by attributing portions of the error budget to phased rollouts or 1% experiments, in a way that puts less of the unavailability budget at risk.
In essence, it’s easier for the development team to release new features by planning for errors, and the best way to plan for errors is to acknowledge that errors are inevitable. As long as the development team stays within that margin of error — aka the error budget — they can release whatever features they want whenever they want.
Wait, isn’t this the same as DevOps?
You may be wondering at this point, If SRE combines development and operations, how is it any different from DevOps? Although some responsibilities may overlap between these teams, they are distinct entities with unique roles.
Generally speaking, DevOps focuses on improving pre-production processes to make it easier to develop and release software, while SRE focuses on making the software easier to run in production.
DevOps is a set of practices that aims to enact better workflows within the development pipeline, enabling engineers to build more quickly and with less hassle. Its core principles include reducing organizational silos, accepting failure as normal, continuously implementing gradual changes, leveraging tooling and automation, and measuring everything.
SRE, on the other hand, is the practice of automating IT infrastructure tasks with software tools. If DevOps is the “what,” SRE is the “how.” SRE can be thought of as an implementation of the DevOps philosophy. SRE puts DevOps principles into practice by bringing distinct teams together, quantifying failure, encouraging smaller and more iterative deployments of new features, automating manual tasks, and using metrics to quantify service levels.
SLOs, SLAs, and SLIs
One of the most critical parts of an SRE’s job is defining service-level objectives, service-level agreements, and service-level indicators.
Service-level objectives, or SLOs, are essentially an evolved form of error budgets. The SRO undergirds all discussions about the reliability of a system. Any changes that are considered should be framed in the context of meeting the SLO. Because service reliability is associated with increase costs, an SLO should be defined as the lowest level of reliability that can be maintained for each service. The SLO is critical not only for developers, but for managers and leaders, too — the SLO helps stakeholders decide whether to invest more resources in improving a service’s reliability, or whether greater velocity of development can be allowed for a service.
Service-level agreements, or SLAs, are contracts between the developer and user. SLAs define the guaranteed availability of a product; in a sense, SLAs turn SLOs into a formal agreement. If the product is unreliable and fails to meet certain requirements, some penalty will be paid; in some cases, this may be in the form of compensation to users.
Service-level indicators, or SLIs, are metrics that assess different aspects of a service level. For example, you may establish SLIs for availability, error rate, or system throughput.
You can think of an SLO as your goal, an SLI as your measurement of how well you meet that goal, and an SLA as a promise that you’ll achieve the goal you established.
Benefits of SRE
SRE has become a crucial part of software engineering, and for good reason. SRE can close the gap between operations and development, enabling your organization to deliver better applications faster, and without sacrificing quality or reliability.
Observability into service health
Because an SRE team closely monitors an application, they have a holistic understanding of how everything in a system is connected. That in itself isn’t enough, though, which is why one of the core practices of SRE is establishing SLIs, metrics that evaluate performance. Clear metrics allow SRE teams to highlight areas of improvement, such as reducing security vulnerabilities. As a result, it becomes easier for SRE teams to drive initiatives across the organization and measure progress over time.
These metrics provide clear business value, too — with the data collected through SRE practices, teams can figure out how much revenue is lost per minute of downtime, for example.
Incentivizes code improvement
Because development and SRE teams share a talent pool, the development team becomes incentivized to write better code. The SRE team is likely to get a greater allocation of resources if the code developed is poor, as it requires more from that team to address issues. As a result, there are fewer people and resources available for development.
On the other hand, when developers are generating improved code, they may gain more teammates — if new features are working well, the SRE team has less to do, and development can move ahead with innovating the product.
Frees up time to add value
If a product is buggy, developers don’t have time to devote to creating new features — instead, their time is dedicated to finding and fixing errors, processes that can take up a lot of time. With an SRE team, though, bugs are easier to find and address.
As a result of improved code, fewer bugs, and greater efficiency, time and resources can be freed to proactively add value to the product. Developers can devote time to creating better, more exciting features that are less likely to cause problems. Operations also has more time to test and perform upkeep, maintaining the application’s stability over time.
Everyone wins with SRE
Ultimately, SRE is beneficial for everyone involved. Users see the benefits directly when they engage with a more stable product that can support more rapid evolution. This leads to wider adoption and more business leads for your organization, allowing your company to grow and invest resources back into your product. Plus, SRE keeps development and operations in balance, maintaining the sweet spot between progress and reliability.
Cortex can help your SRE team gain insights into service quality and enable them to inspire progress through our Scorecards and reporting. Schedule a demo today to learn more about how Cortex can unlock service reliability at your organization.