You’re the engineer on call, and you get woken up at 3 AM with a Pagerduty alert that the app is down. Groggy and annoyed, you look at the message and have no idea what service it’s referring to, or where to even begin to look.
After nearly an hour, you can’t figure out who owns the service or how to get it up and running. So you have to call that senior engineer on your team—the person who gets called every time there’s a tricky question or problem. Thankfully, they answer you in the middle of the night and help you bring the app back up. But in a month, the same scenario happens again. Only this time, when you go to call that senior engineer, they’re no longer there. They quit to work at a company that cares about reliability.
Leading tech companies empower SREs to build a culture of reliability and ownership—and prevent problems like these from happening. Even if you’re still trying to hire your first SRE, it’s never too early to start evangelizing the SRE mindset. Here are some simple steps to get started:
1. Evangelize the narrative of why the systems you work on matter
At the end of the day, your systems are there to serve your customers and users. Leadership and teammates who embody these values are incredibly important. People should worry about letting their customers down, not about being yelled at by their manager or receiving a bad performance review.
2. Set explicit expectations about uptime
When you say that your app needs to have “99.99% uptime,” it doesn’t have the same impact as “52.6 minutes of downtime a year.” Translate uptime numbers so they’re tangible—and on people’s minds every time there’s an outage.
3. Don’t brush problems under the rug
After the dust settles from an incident, it’s important to get together and talk about it. Be transparent about what happened, which person or team made the error, and what the implications to users were. The idea isn’t to point fingers or assign blame, but rather to bring the team together so that you can learn, evolve, and figure out how to prevent this from happening in the future.
Here are some questions you can use:
Who wrote the code?
How was it tested?
Should our automated testing have caught this?
Who code reviewed this?
Were they the right people to code review it?
Why wasn’t it caught in the staging environment?
Who released it and who went through that checklist that validated the release?
Having this process lets everyone on the team know that mistakes are part of the job, and there’s a standard way that problems are handled. This is a great way to help engineers feel free to do their best work without fear or avoidance.
4. Define individual owners for every piece of the system
If there’s any confusion about who owns what within your engineering org, it’s a massive problem that you need to address right away. Not only do you need an owner for every part of the system, they must also be empowered to fix issues and know that they’re the ones on the hook.
5. Invest in documentation, runbooks, and logs
The owner of each piece of the application is also responsible for making sure that documentation is up to date, runbooks are clear and easy to follow, and logs are readily accessible. When you’re getting paged at 3 AM about a service you’ve never heard of before, these resources are key.
6. Prioritize the delivery pipeline like it’s a key feature
You should always be thinking about how to build safeguards around your workflows, like automated testing or new tools. This may mean you need to go toe-to-toe with product managers to make sure the work required to build a robust delivery system isn’t getting kicked down to the backlog.
7. Get business buy-in by using terms they understand
Unreliable code is debt. And debt is something all leaders understand, regardless of their technical abilities. Stripe published a survey in 2018 that found that the average developer spends more than 17 hours a week dealing with maintenance issues, costing $300 billion per year globally. If you’re pushing to invest in SRE practices, those numbers might help you make your case.
Thankfully, having the right tools can make adopting the SRE mindset a lot easier. We built Cortex to be the go-to place for finding information about service owners, documentation, and runbooks. We also map out the dependencies across complex systems and provide templates for spinning up services. We encourage you to take a look at our website to learn more.
This article was based on a talk given at Conf42: SRE 2021 by Cristina Buenahora Bustamante, Founding Engineer at Cortex. Feel free to reach out with any questions to cristina@getcortexapp.com.