A guide to the best SRE tools
SREs are responsible for evaluating tools that can help reduce toil on their teams and make their applications more reliable for end-users. There are a lot of SRE tools out there, and it can be hard to know which ones are the most important to consider. We’ve compiled this guide to highlight the key categories of SRE tools and help you find the right fit for your team.
Monitoring tools are used to generate valuable metrics and insights about an application and help SREs do everything from creating benchmarks to debugging outages. Luckily, there are monitoring tools for practically anything you might want to measure relating to your application. These tools usually cover one or more of the following areas:
Application performance monitoring (APM):
Models the response times perceived by end users of your application and compares this against performance benchmarks to detect latency and outages.
Examines all the incoming and outgoing traffic routed through your network to help with load balancing, debug client/server issues, and stop network attacks like DOS attacks before they occur.
Resource / infrastructure monitoring:
Looks at the consumption rates and SLOs for the components of your application, like the CPU load for your Kubernetes clusters, to help with resource management.
What to look for in a monitoring tool
No matter what type of monitoring tool you’re looking at, you’ll want to investigate the following important features:
- Alerts: The tool should be able to generate alerts when metrics approach and exceed the thresholds that your SRE team identifies.
- Automated incident response: Ideally, your tools will also try to automatically fix any detected issues where possible, helping reduce your engineering overhead.
- Logging: Your monitoring tool should create detailed logs that contain all the information you would need to debug an issue, with configurable highlighting for easier consumption.
- Visualizations and dashboards: Monitoring tools offer various kinds of graphics and charts to help you interpret the data at a glance or communicate with your stakeholders. Depending on your use cases, you might want to look for tools that support certain types of data round-ups and visual representations.
- Supported integrations: Your tool should work seamlessly with your application components and services.
Common monitoring tools
Here are some of the monitoring tools that we hear about often from SREs:
- Prometheus: A popular monitoring tool with an active community and broad range of features.
- Grafana: Helps you make dashboards and visualizations by integrating with other monitoring tools (often combined with Prometheus).
Tools you can purchase
- Appdynamics:A platform focusing on APM (though it also offers network and resource monitoring).
- DataDog: Offers a suite of monitoring tools to cover many different use cases.
- Splunk: A generalized tool for managing large-scale data and deriving actionable insights.
On-call management tools
SREs and engineers are often responsible for being on call during working and non-working hours, prepared to react immediately to resolve any issue that might threaten the system’s health. Being on call can be stressful, and you want to be careful not to burn out your team. Fortunately, there are many tools SREs can use to help reduce the burden and make being on-call a bit less painful.
What to look for in an on-call management tool
While most on-call tools also help your team manage the incident itself, we’ll be covering incident management in the next section. Here are some on-call rotation specific features to help you prioritize your assessment:
Scheduling on-call rotations:
Your tool should help you distribute the on-call duty equally and fairly across your team, with flexibility in case someone needs to trade on-call rotations at the last minute.
Sharing on-call calendars:
Sometimes an issue touches multiple components, and several on-call engineers need to be involved in the incident response. For that reason, it’s important that your tool provides a centralized way to view the on-call calendars across your organization.
Alerting on-call engineers:
This one is pretty obvious, but it’s absolutely essential. Your on-call tool needs to integrate with your monitoring tools to deliver alerts, and it must provide a payload with enough context so that the on-call engineer can address the issue. Your tool should also give you rules-based configurations for alert routing and controls to combat spammy alerts, which lead to burnout.
Common tools for on-call management
Incident management tools
Inevitably, failures will occur in your system, and the person on-call is going to need powerful tools to help fix issues in the moment and make sure that they don’t happen again. According to Google’s SRE Book, managing an incident successfully comes down to three things: (1) clear escalation paths, (2) well-defined response procedures, and (3) a blameless post-mortem culture. SREs can find tools to codify these practices and help your team communicate better and resolve incidents faster.
What to look for in an incident management tool
- Issue triage: Your tool should help you configure issue types and severity levels so that you know how to prioritize and when to escalate.
- Escalation pathways: Whether issues are manually or automatically escalated, you should be able to quickly get the right people involved and easily share all of the context they need to understand the issue.
- Runbooks / workflows: Depending on the issue classification, your tool should hand over the runbook of predefined steps the on-call engineers need to follow. Your tool should also be able to automatically execute workflows if certain conditions are met.
- Real-time communication: Your tool should integrate with the apps your team uses to communicate, like Slack, Jira, or even Zoom, to share information automatically and free you up to focus on the issue at hand. It should also create a historical document of the incident response that your team can refer to later on.
- Data-driven issue analysis: Ideally, your tool will give you all the data you need to understand what happened and focus on how it can be avoided next time, instead of placing blame on individuals.
Common incident management tools
All of the on-call management tools previously mentioned also feature some form of incident management. Additional dedicated tools for this purpose include Blameless, ServiceNow, and Netflix’s open source crisis management tool, Dispatch.
A main function of the SRE role is to automate away the mindless and busywork tasks of software development. Automated configurations can be used to ensure that toil is reduced and the same steps are always repeated when provisioning, managing, and destroying resources.
What to look for in an configuration tool
- Infrastructure as code: This paradigm means that your infrastructure specifications live in a file, which can be run to recreate your configuration in the event of an outage. Most automation tools will have their own language for declaring the desired configuration.
- Execution planning: Your tool should offer a preview of the effects of your changes before you run the automation, to make sure there are no unanticipated consequences.
- Automated workflows: The tool should allow you to enforce steps that are always followed when performing a task, like deploying an application, and let you run these workflows automatically.
- Supports container orchestration: If you use a container orchestration service like Kubernetes to deploy your services, your configuration tool should be compatible.
Common configuration tools
Microservice catalog tools
To make sure that incidents are avoided in the first place, SRE teams should look for tools that help them enforce high-level policies and best practices. A key tenet of microservice catalog tools should be governance over your service oriented architecture. Governance refers to what rules and processes the team follows, as well as who is responsible for reliability across different services and applications. This helps avoid tribal knowledge, and is especially important when teams work remotely and manage many different microservices.
What to look for in a microservice catalog tool
- Supports a production readiness review (PRR): A key process for SREs is a PRR: stepping through a predefined checklist to make sure that the software is ready to deploy. This might include things like ensuring that you’ve set up load balancing and end-to-end testing.
- Transparent ownership: This tool should create a catalog of services with their owners clearly identified, so it’s easy to look up who is responsible for what.
- Integrates with your existing tools: All the best practices you’re trying to enforce should be visible within one tool. This means, for example, that your APM set-up and on-call rotations should appear alongside the catalog of your services, so you don’t have to refer to multiple tools to know where to find important data.
- Automatic updates: When your configurations or policies change, you shouldn’t need to make manual edits in the tool.
Common microservice catalog tools
The most common governance tool we’ve seen is unfortunately a giant spreadsheet (or, worse, several spreadsheets!). Since this is the opposite of the SRE philosophy, we’ve created Cortex to fill the gap. Our microservice catalog makes it easy to see your services and owners in one place, and we foster accountability and show you how to improve using a gamified scorecard. Other alternatives include Spotify Backstage and OpsLevel.
With the right tools, your SRE team will be able to spend their energies improving your product’s reliability and performance instead of dealing with toil and overhead. We hope this article has been a helpful overview of SRE areas of focus and popular tools. If you’re interested in giving Cortex a try, request a demo here.