Brewing...

Industry Insights

METR's Latest Time-Horizon Data Makes AI Capability Planning Much More Concrete

METR's live March 3, 2026 dashboard update keeps the core result intact: frontier AI task-completion horizons are still growing on an exponential curve. Claude Opus 4.6 now posts a roughly 12-hour 50% horizon, with a raw 6-for-6 result on one 30-hour task.

Sean McLellan

Lead Architect & Founder

March 19, 20265 min read

METR has turned one of the fuzziest questions in AI into a measurable curve: how long a task a model can complete autonomously, where task length is defined by how long it takes a qualified human expert. The latest public update, reflected on METR's live dashboard last updated March 3, 2026, keeps the central pattern intact. Frontier AI task-completion horizons have been rising exponentially for years.

The exact doubling figure is worth stating carefully. METR's March 19, 2025 paper and its January 29, 2026 Time Horizon 1.1 release described the frontier trend as roughly 196 days, or about 7 months. The live dashboard correction published on March 3, 2026 now shows a stitched all-time estimate of 187.8 days. That is slightly faster, not slower, and close enough that "about every 7 months" is still a fair shorthand.

The useful part is the measurement framework

METR is not asking whether a model can answer a hard question or ace a benchmark prompt. It measures whether an agent can complete entire tasks end to end, then fits a logistic curve to predict success as task duration increases. The durations come from human expert completion times, not model wall-clock time. That ties the result to the amount of coherent work being delegated, not to how fast the model types.

Time Horizon 1.1 also tightened the methodology. METR expanded the suite from 170 to 228 tasks, increased the count of 8-hour-plus tasks from 14 to 31, moved its evaluation infrastructure to Inspect, and continued running multiple independent trials per task. That is why this dataset is more useful than the usual "model X solved problem Y" anecdote. It is trying to estimate a capability frontier, not win a demo war.

Opus 4.6 pushed the frontier past the half-day mark

The current standout on METR's live page is Claude Opus 4.6, added on February 20, 2026. Its fitted 50% time horizon is 718.8 minutes, or just under 12 hours. Its fitted 80% time horizon is 69.9 minutes, or a little over 1 hour. Those numbers are a reminder that reliability falls off sharply as tasks get longer, even for the best current systems.

The raw task data shows why the trend still matters. On one task with an estimated human duration of 1,800 minutes or 30 hours - roughly 1.25 days - Opus 4.6 went 6 for 6. That does not mean AI can now reliably complete any 30-hour task. It means the frontier has advanced far enough that fully successful day-scale runs are no longer theoretical edge cases on the public record.

That distinction is important for anyone making planning decisions. The fitted curve says "expect around 50% success at 12-hour tasks." The raw task win says "some 30-hour tasks are already crossing into dependable territory." Both are true, and they describe a market that is moving faster than most roadmap decks admit.

The business signal is about sequencing, not surrender

METR's own FAQ is explicit about the limits. The tasks are mostly software engineering, machine learning, and cybersecurity. They are cleaner than real jobs. They resemble what a capable outsider with little prior context could accomplish, not what a deeply embedded employee can do inside a messy organization. That makes the data narrower than the grandest automation claims and more useful than the grandest automation claims.

If you run an agency, internal product team, or operations-heavy business, the planning implication is straightforward: stop treating "AI agents" as one category. Sort workflows by how much context they require, how well success can be scored, and how many hours of coherent effort they demand. The first tasks that move are not "entire jobs." They are bounded pieces of work that already look like METR tasks: clear inputs, tool access, explicit checks, and a limited need for social coordination.

This is also why the seven-month doubling figure matters more than the absolute number. A 12-hour p50 horizon today does not stay a 12-hour horizon for long if the curve keeps compounding. A capability that feels marginal for a quarterly planning cycle can look operational by the next budgeting cycle.

Planning against the curve beats arguing with it

The right verdict is that METR has given businesses a capability roadmap, not a hype slogan. As of March 2026, the best public frontier agents still are not reliable substitutes for broad, open-ended knowledge work. They are, however, advancing on a measured curve that keeps pulling multi-hour and now occasional day-scale tasks into scope. Teams that model that curve realistically will make better decisions than teams still waiting for either full automation or full disappointment.

Share this post

Share on X Share on LinkedIn Share on Bluesky

METR's Latest Time-Horizon Data Makes AI Capability Planning Much More Concrete

METR's Latest Time-Horizon Data Makes AI Capability Planning Much More Concrete

The useful part is the measurement framework

Opus 4.6 pushed the frontier past the half-day mark

The business signal is about sequencing, not surrender

Planning against the curve beats arguing with it

Share this post

Related Posts

A 1.5B model fixed the easy merge fights. The agent team still collapsed.

45,000 credits is the part of Perplexity Computer worth stealing

58% of Android Bench Is Libraries. That’s the Part Worth Stealing.

Keep Reading

A 1.5B model fixed the easy merge fights. The agent team still collapsed.

45,000 credits is the part of Perplexity Computer worth stealing

58% of Android Bench Is Libraries. That’s the Part Worth Stealing.

METR's Latest Time-Horizon Data Makes AI Capability Planning Much More Concrete

The useful part is the measurement framework

Opus 4.6 pushed the frontier past the half-day mark

The business signal is about sequencing, not surrender

Planning against the curve beats arguing with it

Share this post

Related Posts

A 1.5B model fixed the easy merge fights. The agent team still collapsed.

45,000 credits is the part of Perplexity Computer worth stealing

58% of Android Bench Is Libraries. That’s the Part Worth Stealing.

Keep Reading

A 1.5B model fixed the easy merge fights. The agent team still collapsed.

45,000 credits is the part of Perplexity Computer worth stealing

58% of Android Bench Is Libraries. That’s the Part Worth Stealing.