Microservice Catalog
SRE

How to drive ownership in microservices

Keeping track of service ownership across all microservices is extremely important but how should teams self-organize to achieve it and where's the best place to start?

By
Cortex
-
May 20, 2021

Microservices has become the de-facto architecture standard amongst modern software development teams. Netflix is widely known for its microservices architecture, and technologies like Microsoft Azure and Confluent place microservices at the center of their product's value-add. For teams who haven't adopted it, there is a wealth of information available on why transitioning from a monolithic application to one defined by independently deployable services is critical to building a reliable product at scale.

As much as it's doted upon in the industry, however, adopting microservices in and of itself isn't quite enough to claim success. To ensure that your application architecture truly enforces a separation of concerns and is resilient to failure in the long run, software development teams must build systems of ownership, accountability, and traceability around it.

In this post, we'll focus on service ownership. Why is service ownership important? How should teams self-organize to achieve it? Where's the best place to start?

Software as a Product, not Project

Product and engineering leaders at the world's most successful technology companies will agree that an application should be regarded as a product, not a project. Traditional teams who treat software as a project may have one development team responsible for writing some piece of code, and another operations team responsible for maintaining that code in production. If you're familiar with Amazon's "you build it, you run it" mentality, you know that this model is organizationally risky at best, and a disastrous barrier to success at worst.

Inspired by Amazon CTO Werner Vogels in 2006, engineering teams have since worked towards a world where developers who build a piece of software are accountable for operating it — accountable for responding to late-night outages, fixing bugs, enhancing testing, and assisting with customer support tickets throughout the course of its lifetime. This makes for high-quality code, increased agility, and stronger empathy with users. Thirteen years later, this model continues to prove successful.

The Service Ownership Dream

In the context of microservices, the developer accountability that comes with software as a product is what service ownership is all about. Let's say you're on a DevOps team that runs some combination of the following: A collection of APIs, front-end component libraries, a Grafana and Prometheus stack for monitoring, and a set of Docker images for users to deploy your application. Adopting a microservices architecture means decentralizing everything such that each service is independent and has an explicit boundary from the rest. But what does service ownership in this context actually mean — or not mean?

Service ownership means that there is a clear person or group of people who are ultimately held accountable for the success of each service. Successful service ownership does NOT mean that the owners of a service are or should be the only humans modifying its code. Larger teams typically host over 200+ services (some of which will inevitably have interdependencies) and it's critical that the knowledge around that service doesn't live in a silo. Service owners are simply responsible for making sure that the rest of the team has access to the information they need to properly modify it.

At Cortex, we'd say that if service ownership is done well, the following is true:

  • Every single service is defined, documented, and tracked in a single place. If you're not using a tool like Cortex, this should at the very least be a spreadsheet.
  • There are no orphaned services. Every single service has a distinct owner, and an explicit purpose and boundary.
  • The team responsible for a service is also responsible for on-call rotations and support escalations in case of a bug, failure, or incident.
  • Each service has a defined SLO, and owners of that service are held accountable for hitting that target.
  • Information is shared and well-understood across teams. Engineering managers, SREs, and Developers understand and are committed to this model.
  • GitHub repositories have clear code-owners that correspond to the owners of that service, and CI/CD workflows reflect that ownership.
  • Your organization has a culture of empathy and accountability. Team members assume good intent and own their impact.

This list might sound intuitive, but it's much easier said than done. If your product covers a large surface area, you might be surprised to find a service running in production that hasn't been touched in 5 years. If you're a high-growth startup, you might find that constant role changes and pivots in strategy make it hard to define ownership that sticks. It's easy to not invest in this process, but the long-term risks are high.

The Risk of Not Investing

Here at Cortex, we've seen teams struggle with two primary consequences of not investing in this process:

  1. Higher-frequency Incidents. If your services lack ownership, failures become both more common and that much more difficult to both diagnose and address.
  2. Unintentional Impact on Users. If you have an orphaned service that at some point added value to users but is no longer owned or maintained, you may risk an active customer installing that piece of software or leveraging that feature in a way that degrades their experience with your product and goes beyond what you officially support.

The risks here might sound obvious to avoid, but where's the best place to start in pushing for service ownership?

Where to Start

Here at Cortex, we work with teams of all sizes to make sense of and manage their microservices architecture. We encourage all organizations to make knowledge less tribal. If you've adopted microservices but haven't quite reached service ownership nirvana, there are some easy places to start.

  1. Create a spreadsheet that lists every single GitHub repository in your organization. You might want to track the following columns: Repository, Description, Owner(s), Last Updated, Keep (Yes/No), Link to Documentation
  2. Make note of all repositories that are no longer used or needed, and archive them or set a clear end-of-life date while your team evaluates.
  3. Identify 1 owner (team, or person) for each repository, and make sure they're listed as official code owners. If you're not sure who should be the owner, reach out to recent contributors.
  4. Lock that spreadsheet and share it widely. Pin it to the appropriate Slack channel, send an email to your team, etc.
  5. Create a dedicated Slack Channel for each service such that the rest of your team knows where to go for questions or issues. You can also integrate tools like PagerDuty to get alerted in case of a failure, or CircleCI for event-based notifications on builds.
  6. Write dedicated documentation for each service, whether that be via READMEs, Google Docs, or Notion pages. If a new engineer joins your team, they should be able to quickly identify all services and learn about each of their functions. Make sure to cover things like: What does this service do? What languages/technologies are used? How is it deployed? How is it versioned? What are its availability SLOs? What is the QA process for new changes? Are there documented performance benchmarks?
  7. Create a production-ready checklist for new services. What's the process for creating a new service? What criteria needs to be met?
  8. Keep track of notable incidents, outages or failures for each service — and hold the service owners accountable for writing a blame-free post-mortem that involves other stakeholders as needed (e.g. Customer Support, Product, etc.).

Anything on this list will go a long way in creating a healthy, agile development team and a reliable product. If these steps are helpful and you find yourself maintaining spreadsheets for more than a few months, consider a tool like Cortex to help you stay organized in an automated way. Spreadsheets are great in the short-term, but they require manual upkeep and can quickly become so out-of-date that they defeat their own purpose of keeping your team informed and accountable.

Beyond Service Ownership

There are a myriad of boxes to check for microservices architecture to be deeply impactful and successful. If you're thinking about transitioning to a service-oriented architecture, send us a note at team@getcortexapp.com — we'd love to chat. If Cortex isn't the right fit, we'd still love to learn from your experience.

Microservice Catalog
SRE
By
Cortex
What's driving urgency for IDPs?