Before you trust the score
Agent evaluation budget card
Fill this out before an agent benchmark or pilot result becomes a purchase decision, production gate, or safety claim.
- 01
Task
Pins down: The exact workflow or benchmark item the agent was asked to complete.
Why it matters:Capability claims only matter when the task is specific enough to repeat.
- 02
Human time horizon
Pins down: How long a competent person would take on the same task.
Why it matters:Longer human tasks usually need more agent budget before the result is informative.
- 03
Feedback type
Pins down: Whether the agent can check itself with tests, tools, logs, or other immediate signals.
Why it matters:Extra compute helps most where the agent can learn from work-in-progress feedback.
- 04
Baseline budget
Pins down: The token, time, attempt, or step ceiling behind the headline score.
Why it matters:The budget is part of the measurement, not an implementation detail.
- 05
High-budget sweep
Pins down: The larger budgets tested to see whether performance keeps climbing or plateaus.
Why it matters:A still-rising curve means the baseline score is a lower bound, not the ceiling.
- 06
Serial vs. parallel allocation
Pins down: Whether budget went into one long run or several shorter attempts.
Why it matters:Those two shapes can produce different reliability and cost profiles.
- 07
Stop rule
Pins down: The condition that ended the run: token cap, wall-clock limit, step count, failed check, or human stop.
Why it matters:The stop rule explains what the score actually measures.
- 08
Success receipt
Pins down: The artifact proving completion, such as passing tests, a working migration, or a verified report.
Why it matters:A self-reported done is not enough for agent work.
- 09
Reliability by budget
Pins down: Pass rate at each tested budget level.
Why it matters:One lucky completion and repeatable capability are different findings.
- 10
Plateau evidence
Pins down: Whether performance flattened before the cap or was still improving when the run stopped.
Why it matters:This is the difference between a capability ceiling and a budget ceiling.
- 11
Cost ceiling
Pins down: The production cost of the budget that actually made the task reliable.
Why it matters:A capability that exists only at an unaffordable budget is still not production-ready.
- 12
Production implication
Pins down: Reject, cap, route, retest, require review, or proceed with a named operating boundary.
Why it matters:The card should change the decision, not decorate the report.
If you cannot name the stop rule, you cannot trust the score.
The pilot team gives every model the same runway: same token ceiling, same wall-clock limit, same number of retries. It's the fair thing to do, and it looks like good methodology. Model A finishes the task. Model B stalls out at step eleven of twenty and gets marked "not there yet." The team writes up the finding, picks Model A, and moves on.
What they've actually measured is where their stopwatch ran out.
That's the practical implication of a finding the UK AI Security Institute (AISI) published on July 2: agent evaluations that cap compute at a fixed budget can make a genuinely capable model look incapable, simply because the eval stopped watching before the model finished working. AISI's own framing: "Standard evaluations cap how much compute AI agents can use. We show that raising those caps changes measured capability, the difficulty of tasks agents can solve, and how fast the frontier appears to move."
A score is a point. Capability is a curve.
Most agent benchmarks report one number. Pass rate. Pass/fail. Longest task completed. AISI's team ran frontier models against agentic benchmarks in cybersecurity, software engineering, math, academic reasoning, and healthcare. Instead of running each model once at one budget, they swept the token budget from low to high and tracked what changed. Their starting complaint: "Most agent evaluations still reduce capability to a single number: a benchmark score, a pass/fail, or the length of task an agent can finish." Their finding, after the sweep: "Model capability is not a single score but a curve over test-time compute."
Test-time compute, in plain terms, is everything an agent gets to spend while it's actually doing the work, as opposed to what went into training it. That includes tokens, how much the model reads, reasons over, and generates before answering; tool calls, how many times it can run a script, query a database, or hit an API; attempts, whether it gets one shot or can retry after failure; wall-clock time, how long the whole run gets to take; and how that budget gets spent - one long uninterrupted trajectory, or several shorter attempts run side by side with the best one kept.
A fixed-budget benchmark treats all of this as background noise. AISI's finding is that it isn't noise. It's the independent variable. Change the budget and you get a different capability, sometimes a substantially different one, sometimes the same one, depending entirely on what kind of task you're testing.
Where more budget changes the answer, and where it doesn't
The useful part of AISI's report isn't "throw more compute at it." It's the shape of where that extra compute pays off, because it isn't uniform.
On software engineering tasks (TerminalBench 2.0 and SWE-Bench Pro), raising the total token budget from 1 million to 10 million tokens lifted performance by roughly 25%. On math and academic reasoning (Humanity's Last Exam), pushing the budget up to 5 million tokens produced about a 22% gain. In cybersecurity tasks, around 8% of the tasks in AISI's suite were only solved once the budget crossed 10 million tokens, and some needed as much as 50 million. Those are wide, real gaps between what a capped-budget eval reports and what the model can actually reach given room to work. The Decoder's independent summary of the same report lands on the same two headline figures: roughly 25% on the coding tasks, roughly 22% on HLE.
Then there's HealthBench, where models plateaued well inside the standard budget. More tokens didn't move the number.
AISI's explanation for the split is the most operationally useful sentence in the report: extra compute helps most where an agent can check its own work as it goes, such as running code and reading the test output, or testing whether an exploit actually landed, and helps least where the feedback the agent gets while working is weak or absent. Coding and cyber tasks hand an agent a built-in verifier: did the tests pass, did the shell return what you expected. Healthcare tasks in AISI's suite didn't offer that same kind of loop back to the agent, so extra attempts didn't compound into extra correctness. They just repeated the same uncertainty.
That distinction matters more than the raw percentages. If you're evaluating an agent for a task with a tight, checkable feedback loop, a narrow budget is likely to understate the ceiling. If you're evaluating one for a task where the agent can't easily tell whether it's right, a bigger budget probably won't rescue a bad eval.
A related paper on multi-step cyber attack scenarios (arXiv:2603.11214) illustrates the extreme end of that first case. Testing frontier models against a 32-step corporate network intrusion and a 7-step industrial control system attack, the researchers found performance scaling log-linearly with inference-time compute, with no plateau in sight. Going from 10 million to 100 million tokens produced gains up to 59%. The best single run got through 22 of the 32 steps, in about 6 of the estimated 14 hours a human expert would need for the full attack. That's a narrow, specific scenario, not a general claim about every long-horizon workflow, but it's a clean demonstration of what "the curve keeps climbing" looks like when nobody's checking your homework except the outcome itself.
What this means the next time you cap a pilot
None of this argues for infinite budgets. Nobody's running production agents on a 50-million-token allowance for an unbounded amount of wall-clock time, and AISI isn't recommending that either. What they describe instead is an ongoing program: evaluate frontier models across multiple budgets, report reliability and reach against budget, define the minimum budget that's actually informative for a given task type, and build methods to forecast high-budget performance from cheaper runs.
The practical translation for a team running its own agent pilot: your evaluation needs to report where it stopped, not just what it found. A model that failed at your fixed budget might fail at any budget, or it might be sitting on a curve that keeps rising well past the point you cut it off. Those are different findings, and a single pass/fail number can't tell you which one you're looking at.
The Agent Evaluation Budget Card
Instead of trying to remember every caveat above mid-pilot, it helps to have a fixed artifact you fill out for every agent evaluation before you trust its result. We call it the Agent Evaluation Budget Card. Twelve fields, one page:
- Task — what the agent is actually being asked to do.
- Human time horizon — how long a competent person would take on the same task, as a reference point.
- Feedback type — does the agent get a checkable signal while it works (tests, exploit confirmation, compiler output), or is it working mostly blind?
- Baseline budget — the token/time/attempt ceiling you used for the headline number.
- High-budget sweep — the higher ceilings you also tested, and by how much you raised each one.
- Serial vs. parallel allocation — was the budget spent as one long attempt, or as several shorter attempts with the best one kept?
- Stop rule — the exact condition that ended the run: token limit, time limit, step count, or a human calling it.
- Success receipt — the concrete artifact that proves success (a passing test suite, a working migration, a completed report), not a self-reported "done."
- Reliability at each budget — pass rate at baseline vs. pass rate at the high-budget sweep.
- Hardest task reached — the most difficult item in your suite the agent actually solved, and at what budget.
- Plateau evidence — did performance flatten before the budget ran out, or was it still climbing when you stopped?
- Cost ceiling and production implication — what the high-budget run would cost to run in production, and whether that cost is one your workflow can actually absorb.
Fields 9 and 11 are the ones a fixed-budget pass/fail score simply can't answer. If you can't fill them in, you don't have a capability finding yet. You have a budget finding.

Where this fits with the rest of your agent stack
This card is a measurement discipline, not a replacement for the acceptance work you should already be doing once an agent's output ships. If your agents are producing PRs, our harness-before-hero-reviewer piece covers what should catch problems after the eval is done. If they're migrating code, the acceptance-receipt approach in our Java migration piece is the production-side complement to reliability-at-budget here. And if your team is already tuning how much a coding agent gets to reason before answering, that's a runtime decision downstream of this one. See our piece on agent effort policy for that layer, and our token-budget visibility piece for making that spend legible day to day.
If you're about to run a pilot, or you're sitting on a vendor demo you're not sure how to read, it's worth mapping the budget before you trust the score. We help teams build that evaluation plan as part of our process automation and responsible AI workflow controls work. Reach out if you'd like a second set of eyes on how your current agent evaluation is scoped, and where its stop rule is actually sitting.
Agent evaluation review
Map the budget before you trust the score
Bring one agent pilot, benchmark result, or vendor demo. BaristaLabs will help define the stop rule, budget sweep, success receipt, cost ceiling, and production implication.
Best fit for teams comparing agents, approving pilots, or deciding whether a failed run was a capability limit or a budget limit.
Practical AI Workflow Notes
Want more practical AI operations ideas?
Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.
Turn this idea into a pilot
Which workflow should go first?
Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.
- 3-5 minutes
- Deterministic score
- No sensitive data
Share this post
