SRE

Understanding Mean Time to Resolve (MTTR)

MTTR can provide insight into how effectively your team responds to incidents and how reliable your software is. Learn more about this valuable metric, how to interpret it, and how to leverage this information to minimize downtime.

By
Cortex
-
February 23, 2022

In DevOps, every second counts. System failures mean your engineers lose valuable time that could be spent improving software and developing new features. Whether or not we like it, incidents and outages are inevitable. What really counts is how your software development team manages these issues when they do arise.

It’s difficult to know how your team can improve—or even how they’re performing at baseline—if you don’t have data to analyze your incident management workflows. MTTR is a high-level metric that can not only indicate how effective your DevOps team is at responding to challenges, but how reliable your software is.

MTTR and how to interpret it

MTTR can stand for a lot of different things: Mean Time to Repair, Mean Time to Recovery, Mean Time to Restore, and Mean Time to Resolution are just a few interpretations of this versatile acronym. Each of these metrics examines a different aspect of addressing an incident. For example, Mean Time to Repair is the average time it takes engineers to fix a problem, calculated from the moment they begin working to the moment a change is pushed to production, but does not account for the amount of time between when an alert comes in and when work begins.

Mean Time to Resolution is the most high-level of these metrics—it considers the average time from when an alert comes in until the incident is effectively resolved, which includes postmortems and implementing process changes to avoid the issue down the road. While the standards for this metric can vary by industry, generally speaking the highest performing DevOps teams see an MTTR of less than one day.

Because it is such a comprehensive metric, a high MTTR might indicate problems with alerting, or that your engineers are spending a lot of time on repairs. For this reason, it’s important to not only look at MTTR over time, but to analyze each component of your incident management workflow: time to alert engineers, diagnose the issue, test fixes, ship to production, and conduct reviews and learn from the incident.

It may also be useful to examine MTTR in conjunction with other metrics. To determine if your DevOps team is facing challenges in production, evaluate MTTR along with Change Failure Rate (CFR) to see how many releases result in a downgraded service. Other DORA metrics, like Deployment Frequency and Lead Time to Changes, are perfect companions for MTTR.

To establish the reliability of your software, you can look at MTTR beside Mean Time Between Failures (MTBF), which calculates the average amount of time between incidents. If you’re updating your software often, compare MTTR with Mean Time to Failure (MTTF), which measures how long it takes before a program needs to be redesigned for functionality. To gain insight into your alerting processes, examine Mean Time to Detect (MTTD), which evaluates the amount of time it takes your team to recognize that an issue exists.

Why MTTR matters

Whether you’re a legacy organization striving to modernize, a start-up looking to gain an edge, or somewhere in between, MTTR is a vital metric for your organization. It can indicate whether your processes are standardized and can highlight crucial areas for optimization. Most importantly, MTTR provides insight into how stable your software is and how responsive your team is over time.

This metric is as valuable to your CFO as it is to your DevOps team. It can have a negative impact on your revenue stream if end users can’t access your service or if they have a poor experience. Downtime also means that engineers are dedicating precious time fixing problems, rather than improving the user experience and attracting more users. In an era of continuous delivery, it’s especially important that your team is able to dedicate time to customer satisfaction.

How to improve MTTR

Alerting is the first stage of responding to an incident, and should be one of the first areas to target when working to reduce MTTR. Make sure alerts are actionable and that DevOps team members have the tools they need to immediately respond. A clear escalation process is essential: define responsibilities for each member and train the team in one another’s roles so that the process never grinds to a halt if someone is unavailable.

Preemptive monitoring can help you get ahead of problems before they arise—by proactively checking for potential incidents, you can avoid unexpected downtime. This is an especially good idea if you frequently run into issues with a particular program.

The best way to improve MTTR is to standardize your operating procedures with runbooks. Without runbooks, DevOps teams have to respond without a clear direction and spend time messaging one another for information—they can’t act immediately. With runbooks, however, your organization’s knowledge base is centralized and accessible to all team members, enabling them to respond as soon as an issue comes up.

If you’re already using runbooks, consider automating responses. Automation not only improves your MTTR, but will give your DevOps team more time to devote to implementing long-term changes that improve the stability of your service.

Be careful not to incentivize the wrong behavior in a quest to minimize this metric. You could, for example, easily reduce MTTR by eliminating all alerting protocols, but that would also mean poor service for your end users, and likely poor team morale. Examine this metric over a period of time and aim for slow and steady growth, rather than a quick fix.

Improve MTTR with Cortex

In order to make any of these improvements, you need to foster a team culture that embraces collaboration. Team members should be committed to growth, and it becomes easier to cultivate that team spirit when everyone can see how they’re performing as a unit. Scorecards provide this view, creating a culture of accountability.

Scorecards also come with a host of features that make it easy to see exactly where improvements can be made. You can track the number of your alerts that have an associated runbook, for example, or how many automated responses you have set up. No matter what your ultimate goal is, Cortex can help you and your team members visualize and achieve it.

SRE
By
Cortex
What's driving urgency for IDPs?