Building reliable services: A guide to setting SLOs
The software development life cycle starts and ends with your users. While your team can refine the application based on their understanding and expectations, it is only by taking the users’ feedback into consideration that you can improve your product in a meaningful way. Engineering teams must ensure that they stay on top of any indicators that will allow them to view how satisfied the users are with the product.
As we head into the second half of the year, it’s crucial to evaluate what you achieved in the first half and set service level objectives (SLOs) for Q3. These help them visualize the reliability targets the product must meet. With appropriate SLOs for your services, your team is capable of gauging user experience and modifying the product to meet users’ expectations. In this article, we take a closer look at SLOs and understand how to set meaningful and effective ones.
What are SLOs?
Consequently, an SLO is not the expectations themselves but the degree to which you want the product to meet those expectations. For example, you could set the expectations at 100%, but your SLO would be set at 98%, i.e., you would intend for the service to do what it is supposed to do and meet users’ expectations 98 out of 100 times.
Service level agreements or SLAs are contracts between the developer and a user that lay down the guarantees and promises made concerning the product's functionalities. If the product is unreliable and fails to meet certain requirements, it means the team has not met expectations. This can result in having to compensate the users. SLOs are commonly used in production environments to ensure that the released version of the application is abiding by the SLA.
Why are SLOs important for your SRE strategy?
Without SLOs, it is difficult to measure how your users are responding to the service and how well the service is performing. Support tickets do not provide a comprehensive understanding of user experience; they only reflect any specific issues that users themselves have identified as requiring attention. The feedback from SLOs gives you a more precise and complete idea of what needs to be improved and what is working well. They help teams make informed business-level decisions based on data. They can decide which features or issues to prioritize. SLOs provide visibility without which you run the risk of experiencing major downtime and not identifying issues until after they have severely impacted user experience.
Additionally, there is an inherent conflict between development velocity (how often you are improving and changing the service) and operational stability. Frequent changes to your services make them more likely to break and be unreliable. Although it is unrealistic to expect complete reliability at all times, you need a certain level of stability that you cannot compromise on. SLOs step in to give you the information you need to achieve and maintain that stability even as development continues. Both development and operations teams are given common goals to work towards. This makes it easier for the teams to cooperate despite the seeming contradictions that undergird their relationship.
Finally, when you have reliable and thoughtfully set SLOs, you can also automate the monitoring of SLIs. This speeds up parts of your process that you have confidence in so that your teams can work on other matters.
How can you set SLOs?
SLOs must be defined clearly to be able to make effective contributions to your development process. Doing so requires careful thought and deliberation, in addition to strong monitoring mechanisms in place. In this section, we guide you through the process of setting SLOs to enhance the quality of your services.
Step 1: Identify user journeys
A user journey is a series of interactions that a user has with your product, usually with an end goal in mind. It essentially summarizes a particular experience that a user has with the software. By outlining their exact trajectory, it attempts to reflect the expectations that the user had each step of the way. You can see to what extent the user was successful in achieving their goal with the help of your application, and if there were any complications during the journey. As a result, user journeys are valuable information for you to have.
To set SLOs, you can start by identifying the various journeys users take when interacting with your software. This includes which specific services they are interacting with and how, as well as the expectations they have from the software and whether these are being met. After having mapped these out, your team will have a trove of information that can be used to discern any patterns or trends concerning the users’ behaviors. You must also start thinking about the degree of unreliability in these paths at which users will begin to be unhappy.
Step 2: Determine service level indicators (SLIs)
While SLOs are the targets that you hope to reach, service level indicators are quantitative measures or metrics that help determine the actual service level being delivered to users. You are probably already tracking all sorts of metrics, but we recommend selecting a few that are especially reflective of user experience to be marked as SLIs.
Many metrics, such as CPU usage, for instance, are only important to your teams. They do not directly impact the users’ experience of your product and will not make for helpful SLIs. Metrics that do well during outages and those that do not on regular days are not strong SLIs either. We also recommend steering clear of measures that display an excessive number of false positives or negatives. Only metrics that demonstrate a strong relationship with the degree to which your users are satisfied are worth considering when thinking about reliability and user confidence in the product.
To better choose and organize your SLIs, you can also begin by specifying categories in your process. These would be categories for which you would like to identify SLIs, such as storage or data pipeline.
Once you have chosen your SLIs, you must measure them with the data from your users. Be careful to extract data that is most relevant, i.e., from the parts in your application’s infrastructure that the users have close or direct interactions with. Services can simultaneously include internal parts that have no direct relation to the user, so we recommend not taking data from those into consideration.
Step 3: Set the SLOs
An SLO consists of an SLI, a period for which you will measure the metric, and a target percentage. The target percentage is the ratio of good events to total events in relation to that metric. It can also be a range, going from the bare minimum to the best-case scenario. Furthermore, the SLO is dependent not only on the metric but also on other conditions that impact the metric such as its value in the larger infrastructure. Note that it is imperative to keep the SLOs as simple as possible. Long-winded, complex SLOs are likely to attract confusion and delay any enhancements to the service.
In addition to reflecting on the metrics, it is advisable to use the information on past trends to determine the duration and target for your SLOs. The SLOs are not set in stone. So if this information is not available at present, you can make changes to the SLO as you go. You will do so regardless as you attempt to align it with your business needs and user happiness as closely as possible. Setting an SLO is a trial and error process to a considerable extent, especially when your team is learning the ropes.
Finally, ensure that you have an error budget in place. An error budget is a measure of the number of negative events you can have on a metric before your SLO is considered to have been violated. Identifying the error budget can also take some experimentation as you figure out how to define the scope of reliability for your product. An error budget acts as a fallback for your targets.
Step 4: Set up alerts
You can set up automated alerts at certain thresholds within your error budget to monitor each SLO. If you set it at 20% for instance, you will be notified when your application has used up 20% of the error budget you had set for it as part of the SLO. These alerts can vary by metric and must be set to highlight significant events that require your attention. An alert indicates that some kind of action must be taken so that you do not use up the entirety of the error budget and significantly impact your users’ experience of the product.
There are multiple approaches to setting alerts, all of which need you to consider the following three parameters:
- Precision: the number of significant events out of all the events that were detected.
- Detection time: the time it takes to send an alert after a significant event has been detected.
- Reset time: the time it takes to reset an alert after the issue has been resolved.
Ideally, the first two would be 100%, while the other two would be 0%. These levels are nearly impossible to achieve, but the considerations remain important when setting up alerts.
Of the various ways you can decide to set up your alerts, alerting on the burn rate is one of the most popular options. The burn rate is a measure of the speed at which you are using up your error budget. Therefore, you can determine which issues are causing more damage than others at a certain point in time. This is especially useful because it can help you prioritize solving more urgent issues over others.
What are some best practices you can follow when setting SLOs?
SLOs for individual services
When starting the process, set SLOs for individual services. This will help you gain clarity at that level, which can then inform global SLOs that span different relationships within your application. Working at the level of individual services at the beginning will also help you identify where SLOs are required and where they are not. Some SLOs, although possible to set, may not contribute anything meaningful to the larger application and goals. Take stock of the expectations of individual services first to see how, if at all, they fit into the larger objective of the application. Once that is done, you can look into SLOs that are common across services and can be approached as such. Note that having too many SLOs can prevent you from focusing on the most important ones.
Individual teams set SLOs
In addition to the focus on specific services, SLOs also benefit from being defined by individual teams. This is because the teams know their services best, and this exercise reinforces their commitment to the service level and ensures that their contribution to the application only makes it more stable and reliable. Only when relevant actors understand the SLOs can you comply with your SLAs.
Make them visible
To realize their full potential, ensure visibility for SLOs. While individual teams will set these, leadership and SRE teams need to be kept in the loop regarding SLOs across all the services. Having visibility helps to make sure that all the SLOs are relevant to the application’s reliability or the business goals. It also adds a layer of accountability to the process by encouraging teams to continue committing to these objectives as they build the product.
Wait to alert
Although alerts are an essential part of your SLO setting process, we recommend waiting to set them if you are just starting. Instead, you can create warnings in the beginning, until you reach the point where you are confident that your SLOs are meaningful.
With multiple teams setting SLOs, it can get chaotic to manage them collectively. By putting standards into place and emphasizing consistency, you can make the process easier for SREs and leadership as reporting is more streamlined. For example, you can choose and stick to one tool across teams.
Set SLOs judiciously
Aim for as much accuracy as possible when setting SLOs. Setting them lower might help you steer clear of violations, but it will also result in poorly informed product-related decision-making. This is because your SLOs will not be reflective of the potential of your application. Setting unrealistic SLOs, on the other hand, will prove to be cost and effort-heavy in exchange for minimal returns. You might, for instance, waste resources by overshooting the level of reliability that is sufficient to keep users happy.
Don’t limit SLO usage to production
While SLOs are vital to production environments, they are underutilized if limited to that stage of the software development lifecycle. Instead, set and use SLOs across your development process. This will help emphasize the goals for all actors involved in the process, such that every step made towards building the application is in line with the expectations laid out for it. It also provides ample opportunity for your teams and SREs to readjust or reevaluate the SLOs should the need arise.
You must be willing and plan to modify your SLOs or set different ones in line with the user data that comes in. At the end of the day, the point of setting SLOs and having visibility is being able to capture information that will allow you to improve your product. Your team’s needs and capacities may also change with time, which would also necessitate flexibility on your end. It is important to remember that service level objectives emerge out of existing conditions and aspirations as much as they play a role in forming them. If you are consistently unable to meet the SLOs, consider letting go of them or focusing on stabilizing your software.
How can you get started?
Setting SLOs is a process that requires you to reflect and deliberate on your existing product, workflows, and user happiness. Before you can identify user journeys and SLIs, you need to take stock of your existing microservices. Only by knowing how each service operates and contributes to the larger application will you be aware of how the user interacts with them and whether their expectations are being met.
Cortex has got you covered. Our service catalog is a platform that brings together all of your microservices so that everyone in your team can benefit from a zoomed-out view of your application’s internal workings. By giving visibility to your services, you will be in a better position to map user journeys across services and features. Not only does your field of vision expand, but highlighting metrics and the interactions between them also becomes a more straightforward process. Setting individual SLOs and identifying categories represented in your SLOs has never been easier. Furthermore, when the alerts go off, having this information at their fingertips will allow your team to respond quicker.
By making the details of your services more readily available, we want to help you set up accurate SLOs in no time. The ability to make informed decisions will, after all, be your greatest superpower.