The Numbers Don’t Add Up
By every individual productivity metric, AI has transformed software development. Developers complete 21% more tasks per week. Pull request volume is up 98% at shops running Cursor or Claude Code. Code generation that used to take a junior engineer a full morning now takes an agent under three minutes. If you read the press releases, we are living through the most productive period in the history of software engineering.
The delivery metrics tell a different story. Despite the productivity surge, median time in PR review is up 441%. Bugs per developer have risen 54%. Production incidents per pull request have climbed 242.7%—meaning the probability that a given code change causes a production failure has more than tripled since 2023. And the DORA metrics that engineering leaders have used for years to assess team health—lead time, deployment frequency, change failure rate, MTTR—have not meaningfully improved.
Something is wrong with the math. Developers are building more, but organizations are delivering less reliably. Understanding the gap is not a philosophical exercise. It has practical consequences for every team currently investing in agentic tooling.
Coding Was Never the Bottleneck
In March 2026, Leonardo Stern published an analysis based on Agoda’s engineering experience that cut to the heart of the problem: AI coding assistants haven’t sped up delivery because coding was never the bottleneck. It’s a provocation, but the underlying argument is structurally sound.
Fred Brooks made a version of this claim in 1986 in “No Silver Bullet.” The essential thesis: accidental complexity—the friction of translating a specification into working code—is only a fraction of the total cost of software. The essential complexity, figuring out what to build and verifying that you built it correctly, is irreducible. Improve only the accidental part and you improve only a small slice of total delivery time.
AI coding assistants have, so far, primarily attacked the accidental complexity. They are extraordinary at translating clear specifications into syntactically correct, logically coherent code. They are far less capable at generating the specification itself, or at verifying that the output meets the intent. Those two activities—specification and verification—are exactly where human judgment is still irreplaceable, and exactly where organizations are now stacking up.
The 2025 DORA report, State of AI-Assisted Software Development, frames this as an amplification dynamic: AI strengthens high-performing teams and exposes weaknesses in fragmented ones. What looks like an AI problem is usually an organizational problem that AI has made visible.
The Three Failure Modes
The gap between individual productivity and organizational delivery has three distinct causes. Most teams are experiencing all three simultaneously.
1. Review as an Afterthought
When PR volume nearly doubles without a proportional increase in review capacity, something has to give. The data shows what gives: reviews get shorter, shallower, and in 31% of cases, skipped entirely. Merging with no review has risen by nearly a third across teams that have fully adopted AI coding agents.
The irony is structural. AI speeds up the code writing phase so dramatically that the bottleneck shifts immediately upstream to the review queue. But the review queue was staffed assuming the old production rate. No one hired more senior engineers to review the AI-generated output. The result is that AI agents effectively create more work for the humans who are already the scarcest resource.
We wrote about this dynamic in AI Velocity Paradox: More Code, More Bottlenecks. The pattern is not new—it was predictable from queuing theory—but it’s now showing up clearly in production incident data.
2. The Specification Gap
Agents are only as good as their inputs. A vague ticket becomes vague code. A well-specified requirement, with clear acceptance criteria and edge cases enumerated, produces dramatically better output—both in correctness and in the quality of the verification surface it creates.
Most engineering organizations have not invested in specification quality. Tickets written for human developers, who could infer intent from context and ask follow-up questions, are being handed directly to agents that execute literally and miss the implicit. The output is technically functional but wrong in ways that unit tests don’t catch and that only surface in production under real user behavior.
Stern’s preferred model at Agoda is what he calls “grey box” development: humans remain accountable at two explicit checkpoints. First, writing the specification with enough precision that the agent can execute it without ambiguity. Second, verifying the result against behavioral evidence—not by reading the generated code line by line, but by defining upfront what a correct output looks like and checking against that.
3. The Organizational Speed Limit
Even teams that get specification and review right hit a ceiling imposed by the rest of the organization. The 100x Agent Illusion captures this precisely: a 10x software factory embedded in a 1x decision-making process delivers at 1x. Approval cycles, security reviews, change advisory boards, and deployment windows are not bottlenecks that AI can dissolve. They are organizational artifacts, and they are now visibly constraining teams that have maxed out their technical throughput.
The 2025 DORA report found that DORA alone is no longer a sufficient signal. Teams with strong DORA scores still report high friction, because the measurement framework was calibrated for a world where engineering was the constraint. In 2026, the constraint has shifted in many organizations to governance and verification—neither of which DORA was designed to measure.
What Separates High-Performing Teams
The teams that have successfully translated individual AI productivity into organizational delivery improvements share several characteristics that are worth studying.
They Invested in Specification Before Agents
High performers treat specification as a first-class engineering artifact. Requirements are written with explicit acceptance criteria, edge cases, and failure modes enumerated. This isn’t a new practice—behavior-driven development has advocated for it for over a decade—but AI has sharply raised the ROI of doing it well. A precise specification is now both a contract for the agent and the validation checklist for the output.
They Matched Review Capacity to Agent Output Rate
The teams with the best incident rates did not simply deploy agents and hope for the best. They explicitly modeled the downstream impact on the review queue and staffed accordingly. In some cases, this meant reducing the number of parallel agent sessions. In others, it meant assigning dedicated senior engineers to AI-generated PR review, treating that as a specialized skill rather than generic bandwidth.
They Automated Verification, Not Just Generation
The most mature teams have applied the same agentic approach to testing and verification that they applied to code generation. They run automated behavioral test suites against every PR, maintain contract tests for interfaces, and use AI-assisted security scanning that flags patterns before human review. The goal is to shrink the surface area that requires human attention, not to eliminate human judgment from the loop.
This is consistent with findings from the AI Productivity Paradox research report by Faros, which found that teams with strong automated verification frameworks saw delivery stability improve even as PR volume doubled.
They Updated Their Governance Model
Change advisory boards designed for quarterly release cycles are incompatible with teams merging PRs every few minutes. The highest-performing teams renegotiated their governance contracts with security, compliance, and operations stakeholders. They moved from event-based approvals to policy-based continuous compliance: automated checks that enforce the same rules, faster, without requiring a human in the critical path of every deployment.
The Real Benchmark Question for 2026
The industry’s obsession with benchmark performance—SWE-bench scores, coding challenge rankings, token throughput—has produced a distorted picture of what matters for software delivery. A model that solves 72% of SWE-bench verified problems is impressive. It says nothing about whether integrating that model into your delivery pipeline will improve your change failure rate.
The useful benchmarks in 2026 are organizational, not technical. How much of your specification writing is precise enough for agent execution? What is your review-to-generation ratio? What percentage of PRs have behavioral test coverage before merge? How long does your governance process add to each deployment cycle?
These are harder questions than “which model scores highest on coding benchmarks,” and they produce more actionable answers. As we noted in our earlier analysis, AI Coding Agents in 2026: 90% Adoption, Zero DORA Gain, the adoption decision is largely settled. The implementation quality decision is not.
Who Should Slow Down to Speed Up
Not every team should be running at maximum agent throughput. The Agoda analysis identifies what Stern calls the “Unhappy Middle”: organizations with significant legacy debt, bespoke internal frameworks, and no slack capacity for verification. For these teams, increasing agent output rate makes delivery worse, not better. The backlog of unreviewed PRs grows faster than the team can verify them, incidents accumulate, and the humans in the loop spend their time fighting fires rather than improving the system.
For teams in the Unhappy Middle, the correct move is counterintuitive: reduce agent concurrency, invest in specification quality and automated verification, and only increase throughput once the downstream processes can absorb it. This is not an argument against agentic development. It’s an argument for sequencing the investment correctly.
High performers have already done this work. They built the process infrastructure before they opened the velocity throttle. For everyone else, the process debt is now the bottleneck, and no amount of faster code generation will clear it.
The Audit Every Engineering Lead Should Run
Before deciding whether to increase or decrease agentic throughput, run a quick diagnostic on your pipeline. Four questions surface the real constraint.
What is your specification-to-execution ratio? For every ticket that an agent executes, how much time did a human spend writing the specification? If the answer is “less than the agent spends generating code,” your specs are likely too thin. Elite teams spend 2–3x longer on specification than on generation—not because agents are slow, but because precise specs compound. A well-specified requirement produces testable output the first time, skipping revision cycles that cost far more than the upfront investment.
What is your PR-to-reviewer ratio? Calculate how many PRs each senior engineer is reviewing per week. If that number has grown by more than 50% since you deployed AI agents, you are running a review deficit. Every unreviewed PR is a latent production incident. The teams with the best incident data cap their per-reviewer review load and throttle agent output before allowing the queue to grow.
What percentage of your PRs have behavioral test coverage before merge? Not unit test coverage—behavioral coverage. Can you describe in a sentence what the correct user-visible behavior is, and is that sentence encoded as an automated check? This is the verification gap that accounts for most of the bug-rate increase. AI-generated code passes syntax checks and unit tests by construction. It fails on unanticipated user paths that were not explicitly specified.
How long does your governance process take, measured in hours? If your change advisory board meets weekly, your effective deployment frequency ceiling is once per week regardless of how fast your agents ship code. Many teams have optimized the technical side of delivery to hours or minutes while leaving a governance process in place that was designed for quarterly releases. Map the non-technical steps in your pipeline; they often account for more total latency than the code itself.
What Comes Next
The next 12 months will likely separate teams that understand this into two camps: those that invest in specification tooling, verification automation, and governance reform, and those that add more agents and wait for delivery metrics to improve. The former will scale. The latter will accumulate technical debt at an unprecedented rate.
The 2026 DORA report, expected later this year, will likely formalize what the current data already suggests: individual productivity is no longer a meaningful proxy for organizational delivery health. New metrics, focused on specification quality, verification coverage, and governance velocity, are already being discussed in the DORA community as necessary additions to the standard framework.
The teams paying attention to this now will have a compounding advantage. Delivery infrastructure is slow to build and slow to copy. Getting it right in 2026 is a moat.
Further Reading
- AI Coding Assistants Haven’t Sped up Delivery Because Coding Was Never the Bottleneck — The original Agoda analysis that sparked this conversation; required reading for engineering leads.
- DORA State of AI-Assisted Software Development 2025 — The most comprehensive dataset on how AI adoption is affecting engineering performance across thousands of teams.
- The AI Productivity Paradox Research Report (Faros) — Detailed analysis of why individual productivity metrics and delivery metrics are diverging, with concrete recommendations.

