Development
Microservices

Best practices for on-call scheduling and management

Properly setting up on-call scheduling systems creates arrangements that ensure developers are available to maintain and improve the microservices infrastructure. In this article, we'll discuss some best practices for scheduling and managing employee on-call systems.

By
Cortex
-
December 6, 2022

An on-call schedule forms the backbone of your incident response system in the event of an outage or when an issue is raised. This type of schedule does not keep end-users waiting and helps maintain the reliability and availability of your software. However, on-call management practices often induce worry and anxiety in team members. In extreme cases, it can even be a contributing factor in employee burnout. While essential for the smooth upkeep of your software, it should not come at the expense of your employees’ health or work-life balance.

How do you think about on-call scheduling at your company? In this article, we break down some considerations and best practices you can employ to the advantage of both your business and your team. On-call rotations need not be a source of tension amongst teams but merely a practice to continue improving the quality of your microservices architecture.

What is an on-call schedule?

An on-call work schedule is a series of arrangements for a team member to be available at all times to keep an eye on a service. This responsibility is usually done on a rotation basis, ensuring that everyone in the team is held accountable for the availability and reliability of the software.

Team members need to be on call so that they can quickly fix any bugs getting in the way of your product’s smooth functioning. Downtimes can severely hamper user experience, so developers must be on the lookout for any issues and handle them swiftly and with precision.

Why should you have an on-call schedule?

Assigning employees to shifts on an on-call schedule may at first glance seem like a cumbersome task. After all, a monitoring and observability system with automation should suffice. However, should the time come for a call, the system can hardly fix the errors by itself. Its job begins and ends with alerting your team when something is amiss.

An effective and sustainable on-call schedule goes beyond monitoring risks and errors to ensuring somebody is present to handle them within the appropriate time frame. In the long term, this is likely to help you save considerably on the costs of outages. The following are a few benefits of taking the time to put into a place a robust on-call schedule:

  • Software improvement and reliability: The moment you release software is not the end of the development life cycle, but a milestone. Only after your end-users begin interacting with and using the product can you glean insights that help you improve the software. Having a system in place that motivates your on-call engineers to actively monitor and respond to issues is one of the more effective ways of ensuring that the product is constantly being improved, and in a way that responds to the needs of its consumers. 
  • Rapid response: By virtue of somebody being on alert at all times, little time is wasted in figuring out who can handle a failure or bug in the software. This inevitably decreases downtime, as the team member on call can get started straight away when faced with an alert. As a result, customers directly impacted by the bug do not need to wait until the next business day to be able to flag an issue and even longer to resume their activities.
  • Team accountability: By having the very team that built the services stand guard, a spirit of accountability is fostered. Developers are encouraged to accept responsibility not only for what they have created but also for any unwanted consequences. Asking them to respond promptly to any failures in the software also has the added advantage of preventing them from repeating the same errors in the future.
  • Team transparency: On-call rotations are a joint effort. Even if only one or two people are on call at a time, there are ample situations when the team must work together. Some bugs may require the expertise of more than one developer or that of a developer not on call at the time. In such a situation, the team is motivated to collaborate to find the best solution in the least amount of time. Without a certain level of honesty and transparency, this is impossible to sustain in the long term.
  • High customer morale: An effective on-call schedule forms the basis of customer satisfaction with your product. Despite the possibility that an outage can occur, they are able to get in touch with a team member or are assured that the issues will be taken care of at the earliest. Consistent rapid error resolution is made possible with a well-designed on-call schedule.
  • Employee well-being: A sustainable on-call schedule does not overlook the needs and bandwidth of the team members. Alert fatigue, for instance, is a common problem that affects employees who have no respite from constant alerts. They become indifferent and run the risk of failing to respond to an error. However, if you take into consideration your employees’ well-being when designing an on-call policy and schedule, they are likely to be less burdened and stressed. They may also take a more favorable stance on on-call rotations, resulting in higher job satisfaction and productivity overall.

There are many advantages of having an on-call schedule. However, going the extra mile to think through its design and impact on both your team and the customers is worth the time and energy invested in it.

General considerations and best practices

Keep in mind that there is no universal approach to on-call rotations that meets the needs and practices of all companies. You must be willing to spend time carefully putting together an on-call strategy that best integrates with the culture you want to build at the company as well as the workflows of your team and employee preferences.

A particular service alerting constantly should not be the norm. If a service alerts, that is the number one priority. The first steps, for example, are figuring out why it is alerting and how to make sure it does not alert again. If it is alerting all the time, at what point can you turn it off? Despite the consequences for user experience and the quality of your software, dealing with a consistent noise produced by the alerts should not become part of the job for the person on-call. From a quality of life perspective, this can be disruptive and eventually lead to burnout.

The developers’ quality of life should feature high on the list of priorities when you are managing on-call rotations. For example, if a service alerts in the middle of the night, it is worth asking whether it needs to alert then. Could that have instead been part of an aggregated report that can be scheduled for 9 AM the following day?

If a service notification does alert during your time off or after work hours, in the middle of the night, for instance, that should only be because there was no other acceptable option. Treat it as something that must be fixed as soon as possible. It must not be the reason somebody is again forced to wake up in the middle of the night and look into it.

Treat your developers with respect and acknowledge that unfavorable situations arise, but it is not the expectation that people must immediately drop anything they are doing outside of work all the time. If that is an expectation created over time, your team will be increasingly disinclined to be on call. To avoid such friction, seeing as on-call rotations are part and parcel of the software development process, set realistic and respectful expectations. This ultimately contributes to your company culture.

Other practices like continuous testing and monitoring or thinking about security as an ongoing process, contribute to the overall quality of your software from day one. Inculcating these practices and setting standards for the entire team to meet will also reduce the chances of your services alerting frequently.

Let your team know that you want to avoid making a habit of employees being woken up at odd hours for the sake of software quality and better uptimes. All in all, having well-defined priorities and expectations will help maintain a balance when it comes to on-call rotation schedule management. The following are a few more factors to consider and best practices to include as part of your on-call rotation scheduling process.

Defining “on-call”

Part of the process of on-call employee scheduling is having a common understanding of what on-call means in your company and defining expectations around it. Usually, while you are on call, the expectation is that you are not engaged in any form of active development. You are going to be the point person for the team. So if other stakeholders have questions or if a situation arises that requires immediate attention, you are the person that must take care of it.

Does on-call mean you are accountable only for the alerts that come up during your shift? What is the expectation for tracking alerts? If something happens, how quickly do you have to be on your laptop or the workspace, including bug triaging and messaging relevant people using their contact information? Some companies, for example, have a thirty-minute policy, while others require the team members to have a handle on the situation in five. Other areas to consider include what is being triggered. Is it only active incidents that are being alerted on? What about support tickets? 

Thinking through the impact of an on-call rotation on your team members’ lives is also important, as that also affects the duration of the on-call period. How much effort is it to be on call? It can be fairly quiet, so while an alert might surface now and then, you can continue doing your own work for the most part. If it is, however, life-consuming, in that the person on-call has no room to breathe, it is probably time to reconsider the on-call schedule or policy.

Schedule durations

The biggest challenge is to define a schedule that does not interfere with your employees’ lives in a way that they have to schedule their own lives around it. How you decide on the duration of an on-call shift plays a significant role in the kind of work-life balance team members can envision having.

The duration depends significantly on the team size. For example, you might want to configure a one-week on-call rotation in a six-member team. That way, a team member will be on call for one week and then not again for another five weeks, as everybody has their turn. If you only have two people on the team, then you can consider setting up a two weeks on, two weeks off rotation. Or, both members can be on call on alternate days.

To make sure that your plan is robust, lay out all the shifts in the schedule. For remote teams, time zones are an important consideration. Neglecting to keep in mind the different time zones when putting team members on shifts runs the risk of wide gaps in the schedule. It can also be advantageous to have a team spread across the globe, as you are more likely to be able to put people on daytime shifts. A follow-the-sun schedule, for instance, only assigns on-call shifts to employees when it is daytime for them. This allows for smoother 24/7 on-call scheduling than if the entire team is based out of the same physical location, which nearly always results in someone being stuck with a late night shift.

We recommend prioritizing your employees’ needs to the extent possible by defining a schedule that allows your teams the flexibility to live their lives.

Project planning

On-call scheduling also plays a role in more general project planning. If you know, for instance, that a few team members are going to be on call for a week and that your current project spans three weeks, you need to account for one person not being available for the duration of that project. Without adequate knowledge of the schedule, this is nearly impossible to achieve and can result in otherwise avoidable confusion and delays.

Flexibility

To offer team members more flexibility without compromising on your on-call schedules, allow them to swap rotations if they want to. Giving people the freedom to, between themselves, reschedule and move commitments around gives them more control over their work-life balance. You can also go a step further and encourage employees to step in for fellow team members who may have a lot on their plate in a particular week or may be dealing with some personal emergencies.

Backups and escalation procedures

In a similar vein, having backups is good practice. Although this may seem like a luxury for smaller companies, it is best to cover all of your bases in case the person on call misses an alert or has to attend to an emergency. One way to go about this is to have junior engineers on on-call rotations so that they learn the ropes, while more senior team members can be on backup and step in only when necessary.

With an escalation policy in place, you clarify the requirements and procedure for escalating the issue in case the person on-call is not equipped to handle the situation. For instance, if a developer who helped build the service is the first to receive an alert but needs assistance with the hardware, they know which person from the operations team to call.

A call tree is a way to formalize the process of escalating a request. Based on certain rules, a call tree can notify and delegate tasks to team members based on the severity of the issue and the expertise required to fix it. The rules must be well-defined for the call tree to be effective.

Visibility and training

Ensuring that everyone has visibility into the on-call schedules will prevent avoidable scheduling conflicts. Other stakeholders will know who is on call and when encouraging them to align expectations and schedules, keep in mind team members’ availability.

Providing visibility into the incidents and issues is also useful. Team members are more likely to be able to identify services that are alerting at an unacceptably high level.

You have a better chance of having successful on-call rotations if you train the team specifically for that purpose. In addition to communicating expectations, it is also a good idea to discuss common issues and how to rapidly fix them. Employees must also have a channel to raise any questions about the process. This can also include allowing newer employees to shadow more experienced team members during on-call shifts and giving everyone access to incident reports of the past so they can use those as a point of reference when faced with similar issues.

Transparency

Speaking with your team to set expectations and checking in regularly are both effective practices. It is easier to define realistic expectations when you have a solid understanding of the bandwidth that the team has, and your on-call shifts, incident management, and resolution mechanisms are more likely to be effective as a result. Regular check-ins help you take stock of what is and isn’t working for the team so that you can put your heads together to refine the on-call schedule such that it works for everyone.

Additionally, any changes in the schedule should be communicated to the relevant members without fail. Last-minute schedule changes are not desirable, but effectively communicating them and having backups as part of the regular strategy can help soften their impact on the team. With enough notice, not only can employees plan their lives, but it also becomes less likely that a shift will be missed. Keeping everybody in the loop at all times and valuing transparency in communication is essential for creating sustainable on-call routines.

Service ownership

On-call schedules, especially in the modern world, generally center around service ownership. This means that there is a strong preference for whoever wrote the microservice to be on call for it. This practice is said to inculcate a sense of ownership, which is ultimately beneficial for the quality of the work done during the on-call shift.

In the case of level one, two, or three support, someone in the team will get paged if a service goes down. If the issue does not concern the said person, they have little incentive to spend time troubleshooting and fixing that service. Similarly, a noisy service creating alerts at a rapid frequency will not faze them if they are not directly affected by it. They are not incentivized to ask if the service should be alerting and waking up somebody in the middle of the night.

However, if the person being paged after hours is the one that owns the service, they are more motivated to not want to be disturbed outside of work and to take care of the situation. Not only that, but because they know the service through and through, it will be easier for them to resolve the incident and prevent it from resurfacing in the future.

As a result, from a service ownership standpoint, the same people should be in charge of the entire process, from designing the service to building, deploying, and being on call for it. The team that builds a service should be accountable for the entire service lifecycle as they are likely to be the ones most invested in its performance.

Enforce standards with Cortex’s scorecards

Maintaining documentation in a spreadsheet or keeping logs of on-call rotating schedules can become a tedious process, especially when you are managing larger teams. Our Scorecard feature simplifies the process of establishing expectations and enforcing them as standards across teams. You can define ownership at each service level, so that team members are held accountable and more incentivized during on-call shifts. You can also keep an eye out for services that are alerting at unacceptable rates so that you may prioritize fixing those bugs to prevent employee burnout over the long term. On-call rotations do not have to be a dreaded part of an employee’s job. With clarity and consensus around acceptable expectations, it can resemble any other duty they must fulfill.

Development
Microservices
By
Cortex
What's driving urgency for IDPs?