What Is Operational Excellence?

Engineering teams are great at innovating and delivering products, but the work that's required to maintain them over time and keep them running well tends to get deprioritized.

Planning processes are designed to move features forward, not to catch whether those features are generating too many alerts, degrading in performance, or creating compliance exposure over time. As a result, that class of work accumulates quietly.

Without a structured review process, these issues rarely surface until they become acute. A latency problem affecting three services looks like three separate problems rather than one systemic one. An SLA that's slowly eroding doesn't appear until someone in the business starts asking questions. A leader flags a problem and the team focuses on it for a day, but it drifts again once the next sprint starts. It's like the drawer in your kitchen you always swear you'll clean until something more important comes up.

Left unchecked, that compounds into incidents, missed SLAs, and a team that's always reacting. Operational excellence is the practice that breaks the cycle. It’s the ability to drive down the costs of managing your system while knowing whether it's healthy at any given moment. When engineers aren't constantly reacting to the same categories of problems, they have more time for the work that actually moves things forward.

Here's a closer look at what operational excellence is and why it's time to make it a priority.

What operational excellence actually is in practice

At its core, engineering leaders that prioritize operational excellence hold recurring meetings (often called an “operational excellence review”), where a cross-functional group sits down and works through a set of health metrics for the organization's services. Many teams find that a weekly cadence is the right rhythm. Teams of all sizes can run them, but the process becomes essential once your organization has SLAs to meet, enough services to lose track of, or customers that notice when something goes wrong.

Two things tend to determine whether a review like this actually drives change. The first is who's in the room. Effective reviews include a senior leadership group that includes a VP of engineering, a CTO, directors, and staff engineers, as well as the engineers on-call. Ensuring leadership is visibly present and participating sends a clear signal about what the organization actually values. A leader who shows up every week and asks hard questions changes behavior. A leader who says latency matters on a Monday morning and never shows up to the review does not.

The second is whether the data is automated. If someone has to manually pull together a report each week, the process will eventually collapse under that burden. What you want is a dashboard or document that generates itself, flags what's red, and lets the team focus the conversation on what actually needs attention. Many innovative companies leverage Cortex in their operational excellence reviews to track their health, ownership, and service maturity. H&R Block, for instance, struggled to maintain spreadsheets and disconnected tools just to get a basic picture of what was running and who owned it ahead of their busiest season. After implementing Cortex, they got back seven program managers' worth of time back for more innovative projects.

“We’re here to get rid of the repetitive manual work so you can focus on the innovative feature development that drives real value.” — Hariprasad Babu, Manager of Technology, H&R Block

Running an effective review with metrics that actually matter

The first step in running an effective review is defining what healthy actually looks like. The standard framework uses SLOs, or service level objectives, which sets expected behavior for a service. A good event might be a request that completes in under 200 milliseconds. Anything slower, or any error, counts as a bad event. Teams then set a target, typically something like 99.9% good events, and monitor against it.

The foundational metrics most teams track are sometimes called the golden signals. These signals are latency, availability, and error rate, and together they give a reliable picture of whether a service is working from the customer's point of view. Teams often choose to layer in additional context-specific metrics on top. Security posture and vulnerability status are common additions. For teams running mature OE programs, Cortex Scorecards and the Bird’s eye report become part of every review.

In addition, error budgets add a useful dimension to the picture. If 99.9% of requests should succeed, teams have a defined budget (0.1%) for failures. When a team burns through that budget quickly, it's a signal to slow down and invest in reliability. When they're running comfortably within it, they have room to ship aggressively. The budget makes the trade-off visible rather than leaving it to intuition.

How to structure the meeting so it doesn't become theater

One of the most common failure modes for OpEx reviews is that the conversation is good and nothing happens. The meeting recurs, the same issues come up, but six months later the numbers still haven't moved.

The most effective approach is having someone, ideally a TPM or a designated staff engineer, take responsibility for turning the outcomes of each meeting into tracked tickets with owners and due dates. Nubank's engineering teams, for example, treat this as a hard rule. Every action item that leaves the weekly review has an owner and a deadline, or it doesn't get treated as an action item. Structuring the report so the highest-priority items appear at the top of each section also helps. The team works through what matters most while everyone is fresh and picks up wherever they left off the following week if time runs out.

In terms of scope, the most manageable reviews keep it to ten or fewer top-level business units, with seven or fewer sub-areas within each. The goal is to complete the full loop in about an hour. When the report becomes the agenda, the meeting tends to stay focused.

What AI changes and what it doesn't

AI-assisted development is raising a real question for teams running OpEx reviews. Does more AI-generated code mean more operational risk?

The short answer: maybe. When AI tools generate large volumes of code quickly, including unit tests that may or may not reflect actual system behavior, there's a real risk that engineers approve that code without deeply understanding it.

In this type of environment, SLOs and golden signals remain the right framework for OpEx reviews, with additional focus placed on tracking code review quality.

The other side of the coin is far more optimistic. AI agents can simplify the work of generating and interpreting OE reports. The script once written by hand to pull together a weekly report could be rebuilt by an AI agent in an afternoon. Pushed further, an AI agent can act as a chief of staff for the OpEx review, surfacing the why behind the patterns in the data rather than just presenting the numbers. If flaky tests are spiking, the agent doesn't just flag it. It can tell you whether the cause is hardware failures, timeouts, or race conditions, and point to what in the process is likely contributing.

This is where the trajectory gets interesting. Code is no longer the primary bottleneck for most engineering teams. The bottleneck has shifted to everything around the code, including reliability, security, compliance, and the operational health of the systems it runs on. Teams that build a rigorous process for tracking and improving those things now will be better positioned to absorb the increased velocity that AI brings without breaking what they've built.

How Cortex drives operational excellence

Cortex is the Engineering Operations Platform that helps teams improve their operational maturity and reduce developer friction. Service maturity templates help teams track the properties that show up most often in operational excellence reviews, like service ownership, code coverage, on-call coverage, vulnerability status, incident awareness, runbooks, alerting, and dependencies.

Scorecards make those properties visible at an organizational level, not just the individual service level. When a team sits down for their weekly OpEx review, the prep work is automated and the prioritization is clear. Conversations can focus on what's red, why it's red, and who owns the work to make it green by next week, rather than arguing about ownership or piecing together health status from disconnected systems.

That's the goal of operational excellence. Not a perfect system, but a process that keeps improving it.

Want to see how Cortex fits into your team's operational excellence practice? Schedule a demo.