SRE
incident management

What's new in on-call best practice?

Which on-call practices are hot in 2024? Some you’ll expect—others bring us back to basics.

By
Cortex
-
April 16, 2024

Already a quarter of the way into 2024, we’re seeing a lot of shake-up in on-call best practices. We’re excited to see AI in the mix, but we’re also seeing a renewed focus on existing and neglected best practices. Some current topics in on-call best practices include:

  • Increased interactions with customers
  • Incorporating AI
  • A renewed focus on incident management metrics
  • An emphasis on generative culture
  • How IDPs are changing the playbook

In this article, we’ll review some best practices and explore the 2024 trends.

Best practices

A solid on-call process and meeting your service-level objectives (SLOs) requires deliberate planning and a number of good practices. We won’t go into too much detail in this section, but below are some practices that continue to be popular and well-established on-call strategies:

Playbooks

Playbooks are documented procedures that provide step-by-step instructions for on-call engineers to respond to various incident scenarios, such as high CPU usage on a database, latency degradation of an API, or a website outage. Creating and following playbooks result in consistent and quicker responses, since even engineers unfamiliar with a given service can follow instructions to make it operational again, instead of guessing and wasting effort.

Clear service ownership

Each service should have a distinct owner, an explicit purpose, and clear boundaries. It should be defined, documented, and tracked in a centralized location like an internal developer portal (IDP), ensuring that there are no orphaned services. The team responsible for a service is also accountable for on-call rotations, support escalations, and meeting the defined SLOs, and as a result, other teams can more easily be on-call for a service they don’t own.

Retrospectives

Continuous learning and feedback is an essential part of modern DevOps culture. A retrospective, also known as a post-incident review or post-mortem, is a meeting or process that ensures that every failure is a learning opportunity, examining what happened over an incident.  Retrospectives should include all stakeholders, be blameless, and focus on solutions. 

Automated scheduling

Auto-generated schedules cut down on the time and effort needed to create and manage schedules. It ensures that scheduling is fair, reliable, and transparent, with no gaps or confusion about who is on-call for which services and when. Continue learning more about automated scheduling in our article on best practices for on-call scheduling and management.

What’s new in 2024?

This year brings some new features (did somebody say AI?) and a renewed focus on existing practices.

Customer reachability

It has become increasingly more common, especially for B2B services, for customers to have a shared, always-on communication channel like Slack for escalations. As a result, customers expect a higher level of engagement from support resources. If handled well, this direct line of communication can result in a better quality experience for customers; the white-glove experience makes them feel more valued and in control. We anticipate this trend to continue.

A tight feedback loop is important, especially if a service is still relatively new and you’re figuring out all of its potential failure cases. Ideally we don’t want our customers encountering outages and degradations without us knowing, and it’s certainly better in these cases that they can reach us so we can resolve incidents. Their feedback can also help us improve our processes and define better alerts going forward.

This arrangement can cause a lot of overhead, especially if your engineers are the ones actively managing the channels. Set realistic SLAs for your clients, especially if it’s for inconveniences versus entire outages. Consider creating a separate channel that’s only for emergencies and failures, so that the on-call engineering knows to be reachable there. 

New tools to increase emphasis on metrics

In their 2023 State of DevOps Report, the team at Devops Research And Assessment (DORA), outlines the most important metrics to evaluate DevOps teams. They outline mean time to recovery (MTTR), or what they now call “failed deployment recovery time,” as a key metric. MTTR is one of the key incident recovery metrics, which can encapsulate a lot about the performance of an engineering team. We’re seeing more DevOps teams track and seek to improve these metrics.

We’re also seeing an increase in the use of Internal Developer Portals that enable teams to track and set goals around metrics like change error rate and MTTR. More focus on performance metrics means that teams are focusing their efforts around proactively addressing potential issues and streamlining their procedures to optimize their incident response times, moving past just thinking about eventual recovery. The specific improvements needed will vary from team to team, whether it’s investing in more training for engineers, finding tools that best suit the team, or improving documentation. This drive for continuous improvement not only enhances the reliability and stability of services, but also encourages a growth mindset.

[Even] more AI

AI, specifically generative AI (gen AI), is a particularly hot topic these days, not just in tech. It’s no surprise, then, that it is changing DevOps tools and practices as well. Gen AI is enabling new capabilities in incident management tools.

Generative models make it easier for on-call resources to get access to the domain knowledge they may need to solve the problem, and even to 'brainstorm' solutions together, leading to shorter recovery times. Gen AI can also support engineering in drafting up the necessary documentation for a service that makes it easier to manage, including playbooks. For example, Rootly, a growing incident management platform, has some of these AI-powered capabilities: the AI can write incident summaries, create retrospectives, and provide troubleshooting assistance, which you can access through chat. Other platforms aren’t far behind. Pagerduty is in the early stages of creating a generative AI copilot to integrate into their stack.

Gen AI is also unlocking improvements in predictive analysis, which can help teams anticipate issues in advance. Alerts that trigger on sustained averages for metrics can’t always be precise. If limits are too strict, you’ll certainly miss incidents but if limits are too loose noisy alerts that start to be ignored. AI models can identify common issues in a service and their strongest correlations, which can help you set better thresholds and criteria for alerts.

When customers run into issues, chatbots can be useful in managing your responses more efficiently.  These chatbots, which have gotten more sophisticated with the addition of gen AI, can identify important customer issues and raise them to the team. We’re seeing lots of interesting development in the intelligent triage based on recognition models. The models are getting better at customer intent, for example when a customer calls or writes in to an automated service. They can also provide immediate self-service options to customers, potentially removing the need for the on-call to get involved altogether. 

The last two years alone saw over $40 billion in funding in AI worldwide, so we can expect more AI DevOps and on-call tools to hit the market as a result this year and next. As AI as a whole is still relatively new, its effects on on-call practices and DevOps are still developing. If you’re considering adding more AI-powered tools to your process, carefully evaluate what tangible benefits they will bring to avoid simply jumping on the AI bandwagon.

Emphasis on generative culture

As engineers work toward improving on-call metrics, they may experience increased stress or workload. Naturally these practices directly affect the overall performance evaluation of teams with an on-call expectation. Of course, we should ensure that metrics are used to guide efforts and improvement, rather than to scrutinize and pressure individuals. In fact, it’s more important than ever to have a supportive culture for your on-call teams.

In their 2023 State of DevOps report, DORA’s top recommendation is to establish a healthy culture. Teams with a supportive culture, what they call a generative culture, have a 30% higher performance than those without it. On-call work can be particularly stressful; responding to alerts at odd hours under pressure to fix an outage is always going to be taxing, so engineering leaders need to pay extra attention to on-call shifts and how they affect mental health and retention. Increased compensation for off-hours on-call shifts shows your engineers that you recognize the extra burden of the work, but this isn’t sufficient. In fact, employees are ten times more likely to leave due to bad culture than low compensation, according to MIT Sloan

DORA outlines some of the practices that lead to a positive culture, such as high cooperation between teams, sharing risks, and encouraging experimentation. They emphasize blameless post-mortems training the messengers instead of blaming them. Failure should be seen as a learning opportunity. When people feel safe, they’re faster to raise issues, and it results in higher performance. This year, we expect leaders hoping to improve their on-call process will work on making more supportive and generative engineering cultures.

How to think about updating your on-call process

When considering how to update your on-call processes, don’t do everything at once. First, benchmark against the industry standards. Get insights into where you’re falling behind the industry standards and consider those as areas for focus. If you’re not already collecting metrics on this process, start with choosing and collecting some. You can’t measure or compare your performance or progress without metrics! 

Use your retrospectives to see which parts of your on-call need better tooling or better process. When making a change to your process, wait and see how it goes before making additional changes so you can measure what has helped and to avoid burning out engineers with too many changes to their processes. We want to minimize disruptions and ensure sustainable improvement over time.

Improve and clarify your on-call process with Cortex

An increasing number of companies are embracing the trend of managing software within internal developer portals (IDPs), like Cortex. Cortex connects all the data related to your software including who owns each component, what the dependencies are, who's on-call, and what the runbook is. By centralizing this information, users can more quickly answer questions about issues to resolve issues faster.

Learn more about how Cortex can be part of your 2024 incident management workflow and book a demo today.

SRE
incident management
By
Cortex
What's driving urgency for IDPs?