Making SRE metrics work for your team

SRE metrics are standards that help you track important workflows and measure your organization’s performance, efficiency, and reliability. Keeping microservices architecture up and running requires constantly measuring its health using SRE metrics. To learn more about using SRE metrics, read on.

July 20, 2022

With the increase in complexity in software development today, it is becoming harder for teams to rapidly identify and resolve gaps and issues. Microservices architectures, which are a network of services working in tandem to deliver specific experiences to users, are also prone to delay or missed identification of such concerns.

Monitoring mechanisms need to keep up with these complex infrastructures. Site reliability engineering or SRE teams are built for this very purpose. In this article, we dive into an essential component of the modern software development workflow: SRE metrics. We explain how you can make them work for your teams.

What are SRE metrics?

Site reliability engineering teams are responsible for implementing the systems that will complement the developers’ contributions and ensure that your microservices architecture is reliable and scalable. These teams strive for the automation of manual IT operations to save time and energy and build long-term sustainable systems. They bridge the divide between product development and operations teams to rapidly deliver quality software in alignment with the operations teams' reliability criteria and error margins.

To this end, they use metrics to streamline their processes and keep track of important changes and workflows. SRE metrics are essential to their workflows. They give structure to the standards against which to hold your operations performance and make it easier to understand what is going on. Over time, software engineering and development teams have come up with various approaches to SRE and the metrics they use to monitor it. 

Why do SRE metrics matter?

SRE metrics are a vital component of the software development lifecycle for various reasons. At its core, SRE is an exercise in monitoring and visibility into your services’ health. The primary driver for building an SRE team is a commitment to reliability, as opposed to constantly releasing new features with little regard for quality. For a product to be consistent in its availability and usability for end-users, the team that builds it must be in the loop about the significant contributing factors. Asking teams to define and target certain relevant metrics is the first step in building a culture that prioritizes reliability and stability in the software development and delivery processes.

Using relevant metrics, the SRE team can gauge how the various workflows and services are connected in the development process. Your teams decide which metrics are important to your business targets and processes and hold themselves accountable to those metrics. In doing so, they can prioritize certain aspects and refine the vision for the product they are working to build. Monitoring a set of metrics not only provides insight into the work that needs to be done but also into the areas where targets are being met. In the second case, the SRE team is able to help other teams accomplish what they set out to do. The metrics reflect such nuances in the operational process of your teams, which can further inform your long-term business strategy.

SRE metrics also help developers and operations teams align on expectations and delivery. Well-defined metrics work as catalysts for communication between team members. Since everyone has access to them, they can together make appropriate decisions to resolve an incident or further improve services. Effective collaboration across the company also contributes to building a strong culture of steady and scalable software delivery.

What are the different approaches to SRE metrics?

There are multiple different trains of thought when it comes to defining and using metrics to improve reliability and deliver quality applications. There are no hard-and-fast rules to follow when choosing the metrics that you believe will be most useful to your teams in achieving their targets.

Before understanding what they bring to the table, let us look at some different approaches to SRE metrics. Their logic will itself explain the utility of the metrics. The following are a few tried-and-tested approaches that can help you set and maintain effective metrics:

Golden Signals

Most commonly associated with SRE are the four golden signals. These are generally used when monitoring your services and identifying any issues in the application's performance. Inspecting your services in this way helps you prioritize concerns and effectively resolve them to maintain the health of your software. The following are the four golden signals:

  • Latency: This metric measures the amount of time your services take to respond to a request. Request latency measures response times for requests between services as well as those made by an end-user. It is a useful metric for both successful and failed requests or errors. Defining a baseline response time helps identify failures. 
  • Traffic: This is the amount of activity or demand for the services from end-users. This metric is defined differently, depending upon the nature of your software. Monitoring traffic offers a valuable insight into the user experience of your application.
  • Errors: This is the rate of failed requests. Monitoring errors across the board, as well as for individual services, is essential to resolving them based on how critical their impact is on customer experience.
  • Saturation: This is a more global metric that helps determine the extent to which your system’s resources are being utilized at any given point. Because it is difficult to measure, having flexibility and solid utilization metrics such as CPU and memory usage is important.

DORA Metrics

Although these do not strictly fall under the purview of site reliability engineering but are instead developed in a DevOps context, DORA metrics are another robust set of metrics to monitor and improve your software’s performance.

  • Lead Time for Changes: LTC measures the time between when a commit was made and when it was deployed. In addition to providing insight into the time it takes for changes to be pushed to production, this metric also informs you about your team’s responsiveness in the face of end-users’ demands. By observing this period, you can identify the gaps and roadblocks in your team members’ workflows.
  • Deployment Frequency: This measures the frequency at which you push changes into production and how consistent you are in this process. After setting a target frequency, you can gauge how close or far you are from it. If you are deploying less frequently than desired, you can identify the obstacles that are making it hard for your team to deliver in line with expectations.
  • Mean Time to Recovery: One of the most important metrics, MTTR measures the average amount of time it takes for your team to recover from a failure or downtime. For this, you must know when an incident occurred and the time at which it was resolved. Failing to monitor this can result in long recovery times, which can significantly worsen the users’ experience.
  • Change Failure Rate: Another aspect to observe when deploying changes is the percentage of these changes that result in failures in your software. This is a valuable indicator as it points to your team’s efficacy in making quality changes.

How can you effectively run SRE metrics?

Deciding that you want to monitor some metrics and setting up dashboards does not suffice if you want to effectively integrate SRE into your team’s workflows. We recommend starting to build a culture of monitoring and visibility from day one that is both constructive and sustainable in the long term.

Steer clear of choosing certain metrics because other companies are going down that route. This is a common mistake when you are first starting out. Instead, choose the SRE metrics that are relevant to your vision and business objectives. If, for instance, you are in the social media field, then you likely will not gain as much from monitoring error rates as you will from keeping track of latency. Having response times that do not go beyond a few milliseconds may be more of a priority than letting a few requests slide from time to time. So, an efficient approach to SRE is to have clarity on what matters for your business and to track against those metrics consistently.

In a similar vein, do not feel compelled to stick to the four golden signals or the DORA metrics. While these metrics are packaged together for a good reason, ultimately, each business is different in its operations and workflows. Combining metrics across the SRE approaches is hardly a sin, but a smart way to ensure that your monitoring strategy is designed to serve your business needs. For one year, you might be more concerned with incident management than having your team resolve incidents rapidly. In that case, if you are already performing well on continuous delivery and other metrics that are important to your business, you may be interested in tracking the mean time to recovery from the DORA metrics and latency from the golden signals. If combining metrics across approaches brings you closer to your business targets, that is the right step forward. You need not take a limiting approach to the metrics that you are choosing to track.

Finally, make sure that you are giving flexibility to developers as well. It is your SRE team’s job to choose and track relevant metrics. However, they should not be the only ones keeping an eye on the metrics. Individual teams, including developers, should also be monitoring their own metrics. Based on their knowledge and estimates of the services, they should decide on certain metrics that are important for observability and follow through by tracking them independently. They will have ideas about what might result in high change failure rates, for instance, and that may be important to them in their software development role. This is a better approach than asking everyone on your team to look at the same metrics at all times.

Accountability is of utmost importance to this strategy. For instance, SRE teams in large companies often use Service Level Objectives (SLOs), Service Level Agreements (SLAs), and Service Level Indicators (SLIs) to measure how their operations are functioning. SLO is a numerical target set to measure system availability using which you can calculate the error budget, i.e., the total amount of aggregated errors your customers are willing to let go. Outages due to maintenance or other reasons burn a large amount of error budget as consumers start getting frustrated if the services don’t have enough uptime. When you let your developers select the SLOs and metrics they are interested in tracking, be sure to hold them accountable. Make it clear that you will play no part in defining the SLOs for their service but that you would like to be kept in the loop about their monitoring plans and mechanisms and whether the metrics are within acceptable boundaries. To enforce the set objectives, companies make a promise to their consumers indicating that they will meet the service availability targets and that promise is called Service Level Agreement (SLA). 

Similarly, Service Level Indicators demonstrate if the application performance is better or worse than the target value. To hold teams accountable, SLIs function as Key Performance Indicators (KPIs) where the variability between SLIs and SLOs is compared to identify the state of the services and devise strategic steps for performance optimization.

The SRE team can assume this role. In addition to tracking their own metrics, they can keep an eye on whether and how the other teams are choosing the integration of SRE ideas and tools into their workflows. They can be in touch with these teams about their practices and help them track and alert on their chosen metrics correctly in case they face difficulties in doing so. In this way, the SRE metrics that your team values must not be restricted to leadership or the SRE team. Instead, developers and other teams must be included in this exercise so that everyone involved in building your product has visibility into the metrics that will help them improve it. So, SRE should be focused on helping other teams and enabling them to track the metrics that are relevant to them as well.

Finally, monitoring SRE metrics, including for individual teams, is not a one-time affair. Like testing, it has to be introduced as a continuous process. This is so that you can see the impact of the actions you take after alerting on certain metrics. Priorities can shift with time, but choosing and consistently tracking metrics will be most impactful when done over the long term.

How can you get started?

Setting and tracking SRE metrics is crucial to ensuring that your services are healthy and that your teams have concrete and meaningful goals to work towards. Cortex’s service catalog offers the perfect opportunity to start building an SRE mindset in your team. It offers visibility into every service that comprises your infrastructure. You can, for instance, choose to observe how a service is performing against a specific SLO. With these insights at their fingertips, your team can define and start tracking their own metrics in no time.