What is a runbook and why is it important?
A runbook is a well-defined set of instructions to address a particular situation. For example, a runbook could be a list of commands to debug a service, or a series of escalation steps when a server goes down.
Runbooks are important because they make knowledge easily actionable for someone without domain expertise. This ensures, for example, that the engineer who created the service doesn’t need to be the first line of defense in the event of an outage. Instead, if they create a runbook, anyone else can pick it up and take the right steps to fix the problem. This is especially necessary in a crisis situation where time is of the essence and relying on the availability of a particular person simply doesn’t scale.
Given their usefulness, runbooks are ubiquitous and are the foundation for any team concerned with reliability. Looking beyond the lens of software, we see that our world is filled with “mini-runbooks” for what to do when the unexpected happens: “Stop, drop, and roll,” “Break glass in case of emergency,” “Put your mask on before helping others on the airplane.” These statements distill the key actions needed to resolve the situation. They encode the knowledge of experts and ensure that even without one at hand, you can avert a crisis.
What are the qualities of a good runbook?
Runbooks are most effective when they are readily available, easily actionable, and up-to-date and accurate.
In a crisis, the best runbook you can have is one that exists. Teams should ensure that they proactively create runbooks for important services before the unexpected hits. Best practices may vary: whether it’s having a process to create a new runbook for every service, or whether it’s architecting your systems to ensure that they have common fundamental components that can be addressed by the same runbook. What matters most is that teams should not wait for something to go wrong before they create a runbook.
Building on the above, the second best runbook you can have is one that you can find. Runbooks should be accessible in a way that anyone can find the right runbook for the problem. Across different companies this can mean different things: powerful internal search tools, well-structured wikis, or even just an organized folder. The specific implementation can vary, but what is most important is that it's easy for anyone to find and look up the right runbook quickly when needed.
Runbooks should be specific. Assuming that the user does not have the same knowledge as the creator, steps in a runbook should not rely on inferred or ambiguous processes. First, a good runbook should have an obvious trigger criteria so that the user knows when it is applicable. This can be a well-defined start state, or even a series of diagnostic commands to ensure that the runbook will actually be useful.
Second, if the runbook is determined to be applicable, it should have a clear set of steps. Similar to a decision tree, this helps the user map through all common outcomes: If command X succeeds do Y, otherwise do Z, and so on. Without this level of specificity, users are likely to get stuck following the instructions and will not be able to use the runbook effectively.
While runbooks are meant to be used by those without domain expertise, they should be regularly checked and updated by those with the necessary knowledge. This means not only updating when the underlying functionality of the service changes, but also incorporating learnings from outages. Stale runbooks not only prevent you from solving the problem quickly, but also erode trust in processes overall and make your team less likely to use runbooks in the first place.
While not in our official list of “guidelines that start with A,” runbook automation is an important tool to be aware of. Essentially, many of the steps performed in a runbook, such as running diagnostic commands or triggering alerts, can also be automated as a service. This can help teams easily do initial triaging before having an engineer take a look. In some cases it may even result in an automated fix.
That said, runbook automation can be complex and is not necessary for all teams. If your team will only experience a few non-critical outages, a manual runbook could more than suffice. Hence the value of automation largely depends on the system’s complexity and the team’s bandwidth.
The principle of making knowledge accessible for teams is key to building scalable and reliable software. In addition to runbooks, there are many more things you can do: everything from knowing who the owners of a service are to ensuring there are objective well-known standards for quality. These are all ways in which effective distribution of knowledge can help teams succeed. This principle is at the heart of Cortex’s mission to help teams gain visibility into their services and deliver high-quality software. To learn more, check out our demo here.