Your AI coding tools boosted individual velocity. Developers are completing 21% more tasks and merging 98% more pull requests than a year ago. And yet, your DORA metrics barely moved. Deployment frequency is up marginally, but lead time is stuck — or worse, the change failure rate climbed. The problem is not your tooling choice. It is that you are measuring the wrong stages of the pipeline.
This guide walks through a concrete diagnostic process: how to use DORA’s four core metrics to locate where AI is helping, where it is creating downstream congestion, and what to fix first.
What You’ll Need Before You Start
Before diving in, make sure you have access to the following:
- A CI/CD data source — GitHub, GitLab, or a developer intelligence platform (LinearB, Faros, DX, Jellyfish) that can export commit-to-deploy timestamps at the stage level
- A 90-day rolling baseline — ideally one that predates your AI tooling rollout. Without this, you cannot tell whether changes are caused by AI adoption or other factors.
- Incident and rollback data — from PagerDuty, OpsGenie, or your on-call system, linked to specific deploys
- Roughly 30 minutes to pull data and map it against the framework below
If you do not have pre-AI baselines, start collecting now and treat this run as T=0. You can still do the diagnostic with current data — you just cannot attribute findings to AI adoption specifically until you have a second snapshot.
Why DORA Breaks When AI Writes 30–70% of Your Code
DORA’s four metrics — deployment frequency (DF), lead time for changes (LT), change failure rate (CFR), and mean time to restore (MTTR) — were designed to measure delivery as a system. They capture the output of that system well. What they do not capture is which part of the system produced a given change, or whether quality trade-offs are being deferred into future cycles.
In 2026, telemetry across 22,000 developers shows the consequences of this blind spot. Median time in PR review is up 441%, pull request size is up 51.3%, bugs per developer are up 54%, and incidents per PR are up 242.7%, according to DORA Metrics 2026 analysis from Byteiota. Deployment frequency can rise on the back of more PRs while change failure rate quietly climbs — and DORA alone will not tell you why.
The 2025 DORA Report found that the greatest returns from AI tooling come not from the tools themselves, but from platform quality, workflow clarity, and team alignment. DORA tells you whether those gains arrived. It does not tell you where to look when they did not.
Step 1 — Pull Your Four DORA Baselines
Start with the canonical DORA definitions and pull a snapshot of each metric for the current 90-day window and the same window 12 months ago. For each metric, note the raw value and where it sits on the DORA elite/high/medium/low scale.
- Deployment frequency: Elite = on demand (multiple deploys per day); Low = monthly or slower
- Lead time for changes: Elite = under 1 hour; Low = over 1 month. Only 9.4% of teams achieve sub-1-hour lead times in 2026.
- Change failure rate: Elite = 0–2%; Low = over 16%. Only 8.5% of teams reach elite CFR benchmarks in 2026.
- Mean time to restore: Elite = under 1 hour; Low = over 1 week
What to look for: If DF rose but LT and CFR also rose, AI is producing more work than your pipeline can absorb at current quality thresholds. If DF rose while CFR held flat, the tooling is integrating without stability regression. If LT barely moved despite more PRs, the bottleneck is not code generation — it is somewhere downstream in the pipeline.
Step 2 — Decompose Lead Time by Stage
Lead time is a composite number. A 4-day average could mean 4 days of coding, or 4 hours of coding followed by 3.8 days waiting in review. These have opposite fixes, and DORA will not distinguish them without a stage breakdown.
Break lead time into four stages and measure each separately:
- Coding time — first commit to PR open
- PR open to first review — how long PRs sit before a reviewer engages
- Time in review — active review period including back-and-forth iterations
- Merge to deploy — CI pipeline execution, approval gates, deploy time
In most teams with significant AI adoption, the bottleneck shows up in time in review, not in coding time. AI-assisted developers generate larger, more complex PRs. Reviewers have not scaled to match. Research from DX notes that a 51.3% increase in PR size means reviewers face roughly 50% more surface area per review cycle — with no additional review time or tooling.
If your review stage dominates lead time, the fix is not a better AI coding tool. It is PR size limits, review assignment policies, and structured review checklists that force smaller, reviewable units of work.
Step 3 — Cross-Reference Change Failure Rate with Deployment Frequency
This is the most diagnostic pairing in the DORA framework. Plot both metrics on a timeline alongside your AI tooling rollout date and look for one of three patterns:
- DF up, CFR flat or down: AI adoption is working. Throughput increased without stability regression.
- DF up, CFR also up: Quality is slipping. AI is generating code that clears CI but breaks in production — common when test coverage does not scale with code generation speed.
- DF flat, CFR up: You are deploying the same volume with worse quality. Often caused by larger PRs that passed automated checks but introduced subtle regressions harder to catch in review.
CFR problems from AI adoption typically surface 30–90 days after initial rollout, once the early wins are exhausted and compounding quality debt starts hitting production. The AI velocity paradox is structural: generating more code faster does not increase delivery throughput if downstream quality gates are not ready to handle the volume.
Step 4 — Layer in AI-Specific Signals
Once you have stage-level lead time data and the DF/CFR correlation, add three supplementary signals to sharpen the diagnosis.
Rework rate measures the percentage of code modified within three weeks of merge. Rising rework indicates AI-generated code is landing in production but failing to hold — either because requirements were underspecified or because the code was not reviewed at sufficient depth. DORA now tracks this as a fifth metric alongside the original four.
PR size trend tracks average lines changed per PR over time. If this metric rose sharply after your AI rollout, your review bottleneck will worsen before it improves. Some teams solve this with automated PR splitting rules; others set hard limits (e.g., 400-line caps) and require AI-generated PRs to be broken into logical units before submission.
Incidents per PR correlates your incident log against deploy frequency to identify whether specific teams, services, or PR types are generating a disproportionate share of production incidents. This surfaces accountability at a more granular level than aggregate CFR allows.
These signals annotate DORA rather than replace it. The 2025 DORA Report, which surveyed nearly 5,000 technology professionals, found that 90% use AI at work and over 80% believe it improved personal productivity. But individual productivity gains do not automatically translate into system-level delivery performance. The supplementary signals above bridge that gap.
What to Act On First
After running this diagnostic, the bottleneck is almost always one of three things: review capacity has not scaled with code volume; test coverage has not scaled with AI-generated code; or deployment pipelines have not been updated to handle higher PR frequency with smaller blast radius. Adding more AI coding tooling into a constrained system does not increase delivery — it increases queue depth.
Start with whichever of the four lead time stages is consuming the most time. Fix that stage before investing further in generation-side tools. The teams seeing real DORA improvement from AI in 2026 are the ones that instrumented the full pipeline first, identified their actual constraint, and addressed it directly — not the ones that added more tools to the top of an already-clogged funnel.
Further Reading
- DORA’s official metrics guide — canonical definitions updated for 2026, including the fifth metric (Rework Rate) and elite benchmarks
- DORA Metrics 2026: AI Expansion Meets Visibility Crisis — the most thorough current analysis of where DORA breaks under high AI adoption, with telemetry data from 22,000 developers
- DORA metrics tools in 2026: What to measure, and what’s missing — practical review of tooling options for capturing stage-level lead time and AI-specific signals alongside the four core metrics

