How SREs can drive team adoption of the 4 golden signals
Monitoring a distributed system with potentially hundreds of microservices can be a challenge. In a world where it’s easy to get distracted by trying to measure everything, the four golden signals can help you stay focused and make sure you’re covering and alerting on the kind of performance issues that actually impact your end-users.
The four golden signals are:
- Latency — The response time for serving requests. Make sure to track failed requests separately, as factoring them into your overall latency number can be misleading.
- Traffic — The measure of the demand on your system, usually in the form of requests per second.
- Errors — How often requests fail. Beyond HTTP 500s, you’ll need to carefully define what a failed request means for your use case, such as failing to meet your SLAs or returning broken results.
- Saturation — The fraction of your resources, such as CPU and memory, that are utilized vs. available.
If you want to learn more about the four golden signals and how to measure them, there’s plenty of information out there (the Google SRE Handbook is a great place to start). But what most guides don’t talk about is how you go from education to implementation, and drive adoption of the four golden signals across your team.
Since the four golden signals are system-level metrics, all your microservices need to be set up to report into the overall number. As an SRE, this means helping your engineering team track the metrics you need to report on the four golden signals. And even more importantly, you need to close the loop by actively leveraging these metrics to drive progress and prioritization—instead of just filing away the data in a spreadsheet somewhere.
How do you get everyone on board with these key indicators of your system health? Here are a few tips we’ve found that work to drive adoption of the four golden signals across teams.
1. Go service-by-service and make sure that metrics are defined in a consistent way
Different service owners can easily have different interpretations of a metric, such as what qualifies as an error. And it’s just as common to end up with different naming conventions, where one service tracks “latency” while another service tracks “response time.” Without standardization, you’ll end up with an inaccurate view of the health of the entire system, and setting up automated reporting will be impossible.
As the SRE, it’s your job to make sure that there’s one source of truth for how the four golden signals are measured, from what’s actually being tracked down to the specific words that your team uses to refer to your metrics. Doing this work ahead of time will set you up for success as your number of microservices grows—you’ll have a playbook to hand any new engineer explaining exactly what they need to report, and how.
2. Automate, automate, automate
One of the key ways to exert influence as an SRE is to reduce the friction for engineers, and implementing the four golden signals is no different. By relying on your SRE toolkit, you can often find ways to automate and lighten the load. Many monitoring services (e.g. Datadog) even offer built-in support for golden signals, where you can tag your monitors with the golden signal that they’re tracking towards.
The end goal is a reporting system that’s self-service—where you don’t have to nag your engineering teammates to check whether they’re tracking the right things or ask where you can find their metrics.
3. Track and celebrate your team’s progress
Measuring the four golden signals will give you a way to track your team’s progress towards your SLOs and performance goals. But it’s essential that you make that progress visible and share it with the whole organization. Not only will this ensure that everyone gets credit for their hard, sometimes overlooked work on performance—you’ll also be able to point to the areas that need prioritization, and make the case for taking time to invest in those improvements.
We designed Cortex’s Scorecards to help you automatically track towards your performance goals across all your services, in one central place. With Scorecards, you can easily do reporting on the four golden signals and even bundle up your results at the team level. Just set a Scorecard rule tracking whether or not a service is passing its four golden signals, and then have your Scorecards reported by Teams. This can help you create a sense of shared accountability and celebrate your progress together.
The four golden signals are excellent metrics to track to monitor the health of your system. But it’s important to keep in mind that metrics are only as worthwhile as the results they help you achieve. At Cortex, we’re passionate about helping you turn metrics into actionable outcomes and making you more effective as an SRE. If you have any questions about how we can help, please don’t hesitate to get in touch at email@example.com.