Best Practice

How do you measure operational maturity?

An organization with high operational maturity can build reliable products and services faster. In continuation of our series on top software maturity scorecards, we're taking a closer look at how to measure and improve operational maturity using Cortex scorecards, initiatives, and out of the box integrations.

By
Jeff Schnitter
-
October 26, 2023

All of the most reliable software is driven by great operations. Your organization’s operational maturity is a measure of how consistently you apply best practices for building reliable software.

Without tracking your operational maturity, it’s extremely difficult to know where and how to improve—before it’s too late and an incident causes you to lose a customer. With an operational maturity scorecard for each service in your catalog, you can measure how your software stacks up both in detail and in aggregate.

Operational maturity is one of the top three scorecards we recommend organizations create. In this article, we’ll show you how to define and measure operational maturity for your organization using Cortex.

What is operational maturity?

Operational maturity is how well-developed your organization’s practices are when it comes to operating reliable software products and services. Operational maturity is a long-term view of your team’s performance, not a snapshot.

Startups tend to be earlier in their operational maturity journey than more established organizations. Even within the same company, different services, teams, or domains usually exhibit different levels of operational maturity.

The benefits of operational maturity include:

  • Fewer user-facing incidents: Reducing the number of incidents that your end users see increases the trust they have in your product or service, reducing churn. 
  • Higher iteration velocity: Operationally mature organizations thoroughly test their code, and are more likely to catch bugs and issues before they make it to production. This gives your teams higher confidence when making changes and increases your velocity.
  • Faster incident resolution: A high level of operational maturity means that when an incident happens, you have a better chance of resolving it quickly.  For an example, see how a customer used Cortex to slash their MTTR.
  • Reputational advantage: Implementing best practices and minimizing incidents builds your reputation as an organization that can be relied on. Reputation can be a deciding factor when comparing software vendors, and can help you win more customers.

How do you measure operational maturity?

To quantify operational maturity, identify the best practices that your organization should follow when operating software, and track compliance at a per-service level. Not all best practices apply universally, but below are a few that we believe almost all organizations should start from:

  • Presence of a README: A README is a file that gives basic information on what a repository contains and how to work with its code. A README is important for onboarding, managing incidents, and building microservice architectures.
  • Defined owners: Knowing who to contact in an emergency is critical. The owners of a service should be defined in a central, widely-known location so that issues can be escalated and resolved promptly.
  • Automated testing: A good test suite allows you to be confident that changes you make to your service are unlikely to introduce bugs. For maximum impact, this test suite should be run automatically as part of your build/deployment system.
  • Code coverage: This is a measure of how many code paths in your application are being automatically tested. High code coverage is important to ensure that issues don’t slip through the cracks. 
  • Defined on-call rotation: When an incident happens, the escalation path needs to be clearly defined and include people with detailed knowledge of the affected services. Without a defined on-call rotation for a service, incident resolution can be delayed.

In addition to tracking compliance with best practices, you can also measure operational maturity by looking at outcomes and outputs. Here are some examples: 

  • Number of known vulnerabilities: Security tools can scan your dependencies for known vulnerabilities. A service with many known vulnerabilities is likely to be operationally immature. 
  • Achievement of SLOs: Service Level Objectives (SLOs) define the uptime and performance of your service. Services with high operational maturity should miss SLOs infrequently, if at all. 
  • Mean Time To Resolve (MTTR): MTTR measures how long it takes for issues reported during on-call rotations to be resolved. Operationally mature organizations are well-prepared for issues and are likely to resolve them faster. 
  • Versions of critical libraries: Some libraries are security critical, and their updates need to be applied quickly. Measuring how far behind the versions of these libraries are in production can give a view on the maturity of your operations.
  • Number of user-facing incidents: The number of currently open user-facing incidents is an important indicator of operational maturity, or lack thereof. If you’re applying best practices consistently, then incidents should be rare and resolved quickly.

Service maturity scorecard

Tracking the above metrics in one place can be difficult. That’s where Cortex comes in—no matter where you get your data, you can ingest it easily into Cortex with our 40+ integrations and plugins feature. Once the data is in Cortex, you can use scorecards to automatically quantify your services’ operational maturity. 

Scorecards are sets of rules that you define that run against your services and tell you how well they’re doing. Below is an example operational maturity scorecard that uses data from Git (GitHub, GitLab, or BitBucket), SonarQube, PagerDuty, Jira and others. It rates services as bronze, silver, or gold (these are tiers that you create) based on metrics the organization has defined. This allows you to see at a glance how operationally mature your services are, and gives your teams achievable targets to aim for.

Scorecards pair with initiatives, a Cortex feature that helps drive organizational change.  Initiatives allow you to set due dates for one or more rules from a scorecard and automatically remind service owners of any rules that aren’t passing in Slack and JIRA issues.

Operational maturity is a superpower that can help you acquire and retain more customers. But to improve your operational maturity, you first need to measure it. Cortex’s scorecards give your organization a way to combine data from disparate sources to see where you need to focus your efforts. Initiatives then help you action those insights.

If you’d like to measure and improve your organization’s operational maturity, request a custom demo with Cortex today.

Best Practice
By
Jeff Schnitter
What's driving urgency for IDPs?