AI Agents Now Handle 12-Hour Tasks. Here’s the Data.

The METR Finding: A New Kind of Moore’s Law

In March 2025, researchers at METR (Model Evaluation and Threat Research) published something that deserves more attention than it got: a rigorous measurement showing that the length of coding tasks frontier AI agents can complete with 50% reliability has been doubling approximately every seven months since 2019. An updated version of the study, released in January 2026, revised that figure downward—using only 2024 and newer models, the doubling time is now closer to 89 days, roughly three months.

This is not a press release metric. METR evaluated 13 frontier models across 228 tasks drawn from real software engineering work, with human domain experts timing themselves on the same tasks to set a baseline. The R² of the exponential fit across six years of data is 0.83—a strong correlation for a trend this noisy. The study is available at arXiv:2503.14499.

The metric they track is called the “50% time horizon”: the duration of task (as measured by how long it takes a skilled human) that an AI agent can complete autonomously with a 50% success rate. It is a deliberately conservative threshold. At 80% reliability, the time horizon is roughly one-fifth as long—a reminder that “can sometimes do this” and “will reliably do this in production” are very different things.

The Numbers in Practice

The growth from 2023 to early 2026 is striking when you line up the benchmarks side by side. GPT-4, released in March 2023, had a 50% time horizon of approximately 6 minutes. Claude 3.7 Sonnet in February 2025 reached 55 minutes. By late 2025, GPT-5.2 was at 352 minutes (~6 hours) and Claude Opus 4.5 at 320 minutes. The Time Horizon 1.1 update from January 2026 placed Claude Opus 4.6 at approximately 719 minutes—roughly 12 hours of human-equivalent work.

That is a 120x increase in two years. If the trend holds at the recently observed pace of doubling every three to four months, METR projects that AI agents will be capable of completing tasks that take a human a full work week by 2028, and tasks requiring a month of work by 2029.

It is worth being precise about what this does and does not mean. The “time horizon” measures the elapsed time a human would spend on a task—not the calendar time an agent needs to run. A 12-hour time horizon task might be something like implementing a non-trivial API integration, writing comprehensive tests, and fixing edge cases across multiple files. It does not mean the agent completes it in 12 human hours; it means the scope and complexity of the problem matches that human effort level.

Why Complexity Still Humbles the Benchmarks

Here is where the data gets humbling for anyone tempted to extrapolate linearly. SWE-Bench Verified—the widely cited benchmark for coding agent performance—has been criticized for including too many trivial tasks: 161 of its original 500 problems required only one or two lines of code to fix. Scale AI’s SWE-Bench Pro addressed this by excluding trivial edits and keeping only problems requiring meaningful multi-file changes. The reference solutions average 107 lines across 4.1 files.

The performance drop is dramatic. Claude Opus 4.5 scores 80.9% on SWE-Bench Verified and 45.9% on SWE-Bench Pro—the same model, nearly half the score, on tasks that better reflect real engineering work. Model performance degrades consistently as the number of files touched and lines changed increases. The headline benchmark numbers are real, but they are measured on the easier half of the distribution.

This creates a meaningful gap between “what AI agents can do on a good benchmark day” and “what you can reliably assign to an agent on a production codebase.” The 50% time horizon is a median, not a floor. Half of tasks at that complexity level will fail—and in a real engineering context, a failed agentic run that partially modifies code and then stops is often worse than not starting at all.

There is another inconvenient data point from METR itself: their separate study on how AI tools affect experienced developers found that professionals using the tools available in early 2025 were 19% slower than those working without them. The capability of the underlying models has improved since then, but the finding is a reminder that raw task-completion ability and practical productivity are different measurements entirely.

What This Means for Development Teams

The honest reading of the METR data is not “AI will replace developers.” It is “the kind of work you can delegate to an agent is expanding at an unusually predictable rate, and you should be planning for that now rather than when it arrives.”

A few concrete implications:

Task decomposition becomes a high-value skill. As agents handle longer autonomous runs, the bottleneck shifts to how well a human can frame a task, provide sufficient context, and set clear acceptance criteria. Writing a crisp ticket that an agent can execute without interruption is harder than it looks, and it is increasingly a distinct skill from writing code.

Review load will grow, not shrink. More autonomous agent output means more code arriving in pull request queues from sources that did not write that code in the same way a human would. As vortx.ch noted in the 2026 agentic coding trends data, 90% adoption of AI coding tools has not translated into DORA metric gains—largely because the review and verification overhead is absorbing the throughput gains.

The complexity cliff matters more than the headline number. A model that can complete 12-hour tasks at 50% reliability is genuinely useful for bounded, well-specified work. It is less useful for the ambiguous, multi-stakeholder, performance-sensitive problems that define the hard part of software engineering. Routing the right tasks to agents—and keeping humans on the rest—is where the actual leverage is.

Projections should inform hiring and tooling strategy now. If the doubling trend continues at anything close to its current pace, the tasks agents reliably handle will look very different in 18 months than they do today. Organizations that treat agentic AI as a static capability they evaluated once in 2025 will be behind teams that reassess quarterly.

The Acceleration Risk

There is one scenario the METR researchers flag that deserves mention: if AI systems become capable enough to meaningfully accelerate AI research itself, the doubling rate could shift from empirical observation to a self-reinforcing loop. At that point, extrapolation from historical trend lines becomes much less reliable in either direction.

We are not there yet. The current 89-day doubling time for 2024-onward models is still based on human-evaluated tasks that humans designed. But the transition from “AI assists AI research” to “AI drives AI research” is not a hypothetical—it is something Anthropic, OpenAI, and DeepMind have all explicitly named as a near-term goal. The METR data is valuable precisely because it gives us a baseline to measure that transition against, if and when it happens.

For now: AI agents can reliably complete tasks that take a skilled human about half a work day. By late 2028, if nothing breaks the trend, that number will be a full week. Plan accordingly.

AI Agents Now Handle 12-Hour Tasks. Here’s the Data.

The METR Finding: A New Kind of Moore’s Law

The Numbers in Practice

Why Complexity Still Humbles the Benchmarks

What This Means for Development Teams

The Acceleration Risk

Further Reading

Don’t miss on Ai tips!

Don’t miss on Ai tips!

AI Agents Now Handle 12-Hour Tasks. Here’s the Data.

The METR Finding: A New Kind of Moore’s Law

The Numbers in Practice

Why Complexity Still Humbles the Benchmarks

What This Means for Development Teams

The Acceleration Risk

Further Reading

Don’t miss on Ai tips!

Don’t miss on Ai tips!

Enjoyed this? Get one AI insight per day.

Related Articles

Hannover Messe 2026: Factory AI Agents Hit the Floor

Hannover Messe 2026: Factory AI Agents Hit the Floor

AI Agents in Manufacturing: Real Results from Hannover Messe