A common pitch cycle for AI agents goes like this: vendor shows benchmark scores, scores look impressive, team buys in, deployment starts, and then something weird happens. The agent works fine Monday morning. Tuesday afternoon it halves its output quality on the same task. Friday it confidently gives the wrong answer.
Benchmarks never predicted this. That is by design --- and new research from Princeton finally explains the mechanism.
Accuracy Is the Wrong Measurement
The paper, Towards a Science of AI Agent Reliability by Stephan Rabanser, Sayash Kapoor, and Arvind Narayanan, evaluated 14 frontier models across two benchmarks, testing 500 runs per configuration over nearly two years of model releases. The finding that matters:
- Accuracy improved 21% per year. Models are genuinely getting smarter at answering questions correctly.
- Reliability improved just 3% per year. Models are barely getting better at doing so consistently.
That gap --- 21% vs 3% --- is not a rounding error. It is a fundamental mismatch between what the industry measures and what production systems require.
The researchers borrow a framing from aviation and nuclear safety: you do not evaluate a control system by whether it can produce the right output under ideal conditions. You evaluate it by whether it will produce the right output under varying conditions, repeated across time, with partial failures and edge inputs in the mix.
AI benchmarks almost exclusively test the first criterion. Production environments demand the second.
Four Things That Actually Break
The Princeton framework decomposes reliability into four dimensions, each with specific sub-metrics. Here is what each one looks like when it fails in practice:
Consistency (30% to 75% outcome consistency across models) is the most visible failure mode. Run the same task twice, get different results. Not just slightly different --- sometimes meaningfully different. A summarization agent that produces a 4-point summary at 9am and a 7-point summary at 2pm on identical input is not broken in a way any benchmark would catch.
Robustness breaks when prompts are reformulated. The paper found that most models remain susceptible to surface-level prompt changes --- rewording a request slightly, changing capitalization, adjusting punctuation --- can shift outputs in ways that have nothing to do with the underlying task. If your users do not phrase questions identically every time (they do not), your accuracy numbers are not representative.
Predictability is about whether a model can signal its own uncertainty. Can it tell you when it is likely to be wrong? The answer, across the 14 models tested, is: inconsistently. Some tasks produce overconfident wrong answers at rates that make the model actively harmful as a decision aid.
Safety in this context is not about harmful content. It is about whether the agent behaves safely under failure conditions --- server crashes, API timeouts, ambiguous inputs. Most models handle hard failures gracefully. The soft failures (ambiguous states, partial information, conflicting context) are where things deteriorate.
The 21% vs 3% Gap, In Practice
Here is the operational translation: a model that improved from 60% to 73% accuracy over the past year is genuinely more capable. But if its consistency score stayed at 45%, a team deploying it as an agent will notice the 45% long before they notice the 73%.
This explains something that has puzzled executives who watch benchmark releases and then ask why their AI pilots have not closed. The models are improving. The improvement just is not landing in the dimension that determines whether an agent can be trusted with a real workflow.
The Princeton team puts it plainly: "We hope [this] can help explain the puzzlement among many in the industry as to why the economic impacts of AI agents have been gradual, even though they are crushing capability benchmarks."
What Deployers Should Audit Before They Ship
The researchers suggest several interventions for teams actually deploying agents rather than evaluating them:
Run your own consistency tests. Take 20 representative tasks. Run each one 10 times with identical inputs. If you see more than 20% variance in key output fields, the model is not ready for that workflow without a validation layer.
Test prompt robustness explicitly. Give the same task to 3-4 people on your team and collect the natural phrasings they use. Run all variants. If you get materially different outputs from semantically identical inputs, build a canonicalization step or switch models for that task.
Build failure-mode logging from day one. The predictability dimension is nearly impossible to assess pre-deployment. Log every agent decision alongside a confidence proxy (where available) and human-reviewed outcomes. You will find the failure patterns faster than any benchmark.
Split the scorecard. If you are evaluating a vendor or making a build-vs-buy decision, ask for consistency numbers alongside accuracy numbers. Any vendor who cannot provide them has not measured what matters.
The interactive reliability dashboard from the Princeton team is publicly available and covers 14 models across their full 12-metric framework. It is a more useful starting point for model selection than any leaderboard that only shows accuracy.
Capability scaling and reliability scaling are separate curves, and right now they are diverging. Every team shipping agents in 2026 is navigating that gap, whether they have measured it or not.
