DRIVE Deep Dive: Delivery

This is the first post in our DRIVE Deep Dive series. Over the coming month we'll examine each pillar of the DRIVE framework in turn, and map DRIVE against the frameworks engineering leaders already run on, including DORA and SPACE. For the complete model, download the full DRIVE framework. Up next: Reliability.

---

AI has all but eliminated the bottleneck in writing code. The part of the job that defined software engineering for fifty years, the work of turning an idea into functioning software by hand, is being automated faster and more cheaply than ever. Agents can now write, review, and merge code with minimal human involvement, and throughput across the industry is climbing accordingly. When writing code was the constraint, output was a fair proxy for value, and producing it faster meant delivering more, but the data now complicates that assumption.

In Cortex's 2026 Engineering Benchmark Report, incident volume rose in parallel with PR throughput: as organizations shipped more code, more of that code failed in production. The rate of generation is outpacing our collective ability to understand and govern what is being generated, and the path from requirement to production is becoming more opaque.

The measurement gap compounds the problem. In the same report, 91% of engineering leaders said AI had improved their developers' velocity and quality, while only 25% had the data to substantiate that belief. Confidence is running well ahead of evidence, which means most organizations are accelerating along a road they cannot fully see.

DORA's 2025 research characterizes the dynamic: AI behaves as an amplifier, making the strong parts of a software delivery process stronger and the weak parts weaker. Code review queues, brittle CI, manual release steps, and on-call load have accumulated for years, and the strongest engineering organizations have run operational reviews to manage them for just as long. What changed is the slack: when writing code was slow, that pace acted as its own backpressure, giving teams room to defer the constraints they had not yet fixed. AI removes that slack, relocating pressure onto those bottlenecks and forcing the issue. The velocity gains become durable value only when an organization attacks those bottlenecks deliberately. Absent that effort, the new speed is borrowed against future reliability, and the debt is repaid in incidents and burnout.

This is where the DRIVE framework begins. DRIVE measures organizational effectiveness in the AI era across five pillars, each interrogating a different dimension of whether an organization can sustainably turn customer needs into software: Delivery, Reliability, Initiatives, Vigilance, and Efficiency. The Delivery pillar asks: are we shipping fast, and is it sustainable?

The two halves of measuring Delivery health

Delivery is comprised of two major parts. The first is the structural movement of code through the system, from commit to production. The second is the human and operational capacity to sustain that movement: whether the people and systems orchestrating the AI software factory can continue to absorb, validate, and operate the output without degrading over time.

Many organizations watch the first closely, but overlook the second. The DRIVE Delivery pillar measures both, because only together do they show whether the current pace can hold: the difference between sustainable delivery and a single fast week that cannot be repeated.

Critical Delivery metrics

DRIVE recommends the following three metrics as critical for the Delivery pillar, the first two from the DORA key signals.

Deploy frequency

Counted per team, per week. Each team should be read against its own four-week trailing average, because the signal that matters is the trend anomaly. Deploy frequency is preferable to a simpler proxy such as PRs per engineer because, as the software factory becomes more autonomous, deployments to production capture incremental value delivered to users far more faithfully than PR counts, which increasingly measure activity rather than outcomes.

Lead time for changes

Measured commit to production at p50 and p95 and reported as p95 per team. Lead time earns its place precisely because it captures the entire change-to-production lifecycle, which is where waste tends to accumulate unnoticed, whether in the review that sat for two days, the manual release gate, or the environment that had to be reprovisioned by hand. It also carries a second-order consequence that grows more significant as change volume rises: the ability to ship a fix is throttled by the same pipeline that ships a feature.

Summed per shift, per team. This is the human-capacity signal, and most likely to be missed when the structural numbers look healthy. A high on-call burden is an immediate "pull the andon cord" moment; it can lead to irreversible burnout, and indicates that the systems are not carrying the load they were meant to.

Considered together, deploy frequency and lead time describe the structural side of delivery, while on-call volume describes the human side.

To view recommended red/green thresholds for these metrics, download the full framework.

Secondary Delivery metrics

Beyond the three critical metrics, a few additional metrics are worth considering.

1
PR cycle time, the span from a pull request opening to its merge. This is the first place to look when delivery slows, since a stalled review or a long-running PR is one of the most common sources of waste in the path to production.
2
Change lead time broken down by stage, like PR cycle-time components and CI time. Look here when overall lead time degrades and the offending stage needs to be isolated.
3
Flaky build and test rates, the "drag metrics." These capture the friction that silently taxes every change passing through the pipeline, and they offer a useful fallback when end-to-end lead time proves difficult to measure cleanly.

Not every metric will matter for every organization. The appropriate move is to adapt this set to the bottleneck the organization actually faces. The three critical metrics are the place to begin; the rest can be added as specific areas reveal themselves as pain points.

Where DORA fits

DRIVE incorporates DORA’s four key signals and expands upon them. Two of DORA's signals, deploy frequency and lead time for changes, sit within the Delivery pillar, and the other two reside in Reliability. DORA itself describes these as "delivery health" metrics, so the alignment is a natural one.

The distinction lies in what each is for. DORA's signals are measurements that carry meaning in their own right. DRIVE is the governance layer and operational cadence, anchored in the Operational Excellence review, that converts those measurements into resource reallocation and action.

Measuring Delivery in an OpEx review

The Operational Excellence review is the recurring meeting in which leaders read the state of the organization against DRIVE and decide where to direct resources.

In this setting, Delivery should be read as trend and anomaly against each team's own baseline, rather than as a cross-team scoreboard. When a signal turns red, whether lead time creeping upward or on-call volume spiking, the review is where time, people, and attention are redirected toward the bottleneck producing it. This is the backpressure mechanism operating as intended: forward pressure met by a structured, human-in-the-loop counterforce that keeps the pace sustainable.

Any single one of these metrics can be gamed, especially deploy frequency, since splitting deploys to inflate a count is trivial. That is precisely why Delivery should be read as a set, inside a review, with judgment applied. A single automated number on a dashboard is too easy for someone to move. The cadence and the human interpretation are paramount.

Where the Delivery signals come together

Delivery that tells an engineering organization whether its current pace is genuine or borrowed against tomorrow's reliability and the capacity of its engineers to keep going. Assembling the full picture is harder than it appears, because the signal is distributed across the tools that each own a fragment of it: deploy frequency in CI/CD, lead time in version control, on-call burden in the pager, and DORA in whichever dashboard happens to track it.

Cortex consolidates those sources into org-level Engineering Intelligence and powers the Operational Excellence review that turns them into action.

Skyscanner offers a useful illustration of what this looks like in practice. One of their engineering teams used the DORA dashboard in Cortex to drive a 50% reduction in cycle time on a critical, traveler-facing system from one quarter to the next. Read the full case study here.

For organizations that want to understand where they stand on Delivery and the other four DRIVE pillars, the DRIVE maturity assessment is a practical starting point.

Next in the series: Reliability, the pillar that asks whether we are delivering on the promises we have made to customers.