Skip to content

Why AI Makes Experienced Developers 19% Slower

8 min read

Introduction

In July 2025, a research team at METR published what is probably the most carefully designed study on AI coding tools to date — and the headline result was not what anyone expected. Experienced developers using AI tools took 19% longer to complete coding tasks compared to working without them. The same developers had predicted they would finish 24% faster. Even after living through the slowdown, they believed they had been 20% more productive. All three numbers cannot be right simultaneously, and that gap is worth understanding.

The METR study (“Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity”) ran a proper randomized controlled trial — the kind of experiment that most “AI boosts productivity by X%” claims conspicuously lack. Eight months later, in February 2026, METR updated their experimental design for reasons that are equally revealing. This is not a story about AI being useless. It is a story about what productivity actually means, and why the right question is not “does AI help?” but “for what, for whom, and how do you measure it?”

What the Study Actually Measured

The setup was methodologically rigorous. METR recruited 16 experienced open-source developers — contributors to large, mature repositories averaging 22,000 GitHub stars and over one million lines of code, each developer having worked on their project for multiple years. Tasks were drawn from real issue trackers, averaged two hours each, and participants were paid $150 per hour. The AI condition used Cursor Pro with Claude 3.5 and 3.7 Sonnet — the tools that were considered best-in-class at the time. Across 246 tasks randomly assigned to either “AI allowed” or “no AI,” developers in the AI condition took 19% longer.

METR is careful not to overgeneralize. They explicitly state the findings do not necessarily apply to less-experienced developers, greenfield projects, or different domains. They also ruled out several confounds: code quality in the AI condition was not worse, developers did not fake compliance, and models were not chosen poorly. The slowdown was real and not an artifact of bad experimental design.

Five likely explanations emerged from the analysis: developers used overly simple prompts, interfaces like Cursor were still unfamiliar to some, the high quality standards of mature open-source projects conflicted with AI-generated suggestions, models struggled with the complexity of specific cases, and — perhaps most interesting — there was a cognitive cost to context-switching between natural thinking and AI-assisted generation. None of these factors alone explains a 19% slowdown, but together they paint a coherent picture.

The Perception Gap

The more provocative number from the study is not 19% but the contrast between that and the 20% speedup developers believed they had experienced. This is not a minor rounding error — it is a sign in the wrong direction. Developers were measurably slower and felt measurably faster. How does that happen?

Quentin Anthony, one of the study participants, gave a clear-eyed answer in a DX interview: “We don’t focus on all the time we actually spent — we just focus on how it was more enjoyable.” AI coding tools make the work feel faster by reducing friction on the parts you notice — the blank-page moments, the boilerplate, the syntax you can’t quite recall. The overhead they introduce — prompt iteration, reviewing generated code that is almost right, debugging AI-introduced regressions — accumulates quietly and does not register the same way.

This is a well-documented feature of human cognition under new workflows: we are poor at tracking time spent on tasks we enjoy or find novel. The implication for teams measuring AI productivity by asking developers whether they feel more productive is blunt: that data is unreliable.

Why Task Type Changes Everything

The METR findings seem to contradict a pile of other studies. A widely cited 2023 GitHub Copilot experiment found developers completed an HTTP server implementation task 55% faster with AI assistance. GitHub’s broader research reported up to 51% faster completion for certain tasks. Google’s CEO Sundar Pichai credits AI with roughly a 10% improvement in engineering velocity across the company. How do you square these with a 19% slowdown?

The answer is task-model fit — a framing that Quentin Anthony put precisely: “The bigger issue is that some tasks just don’t map well to what models can do today. The study is really about task-model fit, not developer aptitude.” The GitHub experiments tested developers on well-scoped greenfield tasks (write a working HTTP server) in codebases they had never seen. Models are genuinely fast at this. The METR study tested experts maintaining complex, decade-old codebases they knew intimately, fixing real bugs with real context requirements. Models are not good at this — yet.

The divide is consistent across every study that looks closely: AI tools add the most value for boilerplate generation, documentation, writing tests for existing functions, and code that looks like its training data. They struggle with tasks requiring deep understanding of a specific codebase, novel architecture decisions, and edge cases that require reasoning about system-level behavior. Experienced developers working in familiar codebases spend most of their time in the second category. Junior developers and anyone working in a new domain spend more time in the first.

METR’s February 2026 Update

In February 2026, METR published a significant update to their experimental program — and the reasons are instructive. They had attempted to run follow-up studies in late 2025 and discovered a new problem: selection bias so severe it threatened to invalidate their methodology.

Between 30 and 50 percent of developers were avoiding submitting tasks they believed would benefit from AI assistance — preferring instead to submit the hard, long tasks that they expected would not. One participant described the dynamic explicitly: “I found I am actually heavily biased sampling — I avoid issues like AI can finish in just 2 hours, but I have to spend 20 hours.” Additionally, many developers refused to participate in any study that required them to work without AI, even at $50 per hour. The developers most committed to AI tools were opting out of the very experiments designed to measure those tools.

The late-2025 raw data from returning developers suggested an 18% speedup — a dramatic reversal from the July 2025 results — but METR believes this figure is inflated by selection effects and by genuine improvements in both the tools and developers’ facility with them. Their response is to redesign the studies: shorter more intensive experiments, observational analysis, and developer-level rather than task-level randomization. The honest interpretation is that the tools have improved, developers have learned how to use them better, and the original study’s results no longer fully apply — but nobody has yet measured the current state cleanly.

What This Means for Your Workflow

Three practical takeaways hold up across all of this data. First, time-box your AI attempts. If a model is not producing useful output on a complex task within about ten minutes of prompt iteration, stop. The cognitive overhead of continued prompting on poorly-scoped tasks eats into the time savings you get elsewhere. This is not a reason to abandon AI tools; it is a reason to use them selectively.

Second, apply AI most aggressively to tasks with high boilerplate-to-logic ratios: writing tests for known functions, generating documentation, drafting configuration files, converting between data formats. These are the tasks where models have strong priors and the failure modes are cheap to catch. Apply it most cautiously to tasks requiring deep familiarity with your specific system’s behavior and history.

Third, do not trust your intuition about whether AI made you faster. If your team wants to measure productivity impact, use objective metrics — cycle time, merge frequency, issue close rates — not developer surveys. The METR study’s perception gap is not a quirk of their sample; it reflects something real about how humans experience assisted versus unassisted work.

Conclusion

The METR study’s 19% slowdown result has been cited widely, often without the important context that surrounds it: a specific type of developer (expert), a specific class of task (complex, familiar codebase), and a specific moment in time (early 2025 tools). METR’s own February 2026 update suggests the landscape has shifted, and re-measurement is ongoing. What the study reliably established — and what no follow-on experiment has refuted — is the perception gap. Developers feel productive with AI tools in ways that outrun what the data supports. Closing that gap, through honest measurement and better task selection, is the actual work of making these tools pay off.

Further Reading

Don’t miss on Ai tips!

We don’t spam! We are not selling your data. Read our privacy policy for more info.

Don’t miss on Ai tips!

We don’t spam! We are not selling your data. Read our privacy policy for more info.

Enjoyed this? Get one AI insight per day.

Join engineers and decision-makers who start their morning with vortx.ch. No fluff, no hype — just what matters in AI.