AI Development

OpenAI's eval playbook makes the harness part of the result

A 92% success rate is not enough to approve an AI agent pilot. Teams need to know what tools, retries, prompts, budgets, safeguards, and receipts produced the score.

Sean McLellan

Lead Architect & Founder

May 30, 20268 min read

A manager gets the pilot report on Friday afternoon.

The slide says the support agent completed 92% of assigned cases. That sounds strong enough to move from test queue to production, especially after six weeks of demos, vendor calls, and internal pressure to show progress.

Then someone asks a basic question: what counted as success?

Did the agent have access to the same ticketing tools it would use in production? Could it retry a failed action ten times? Did it read customer history, or only the pasted prompt? Were approvals simulated or real? Did it create clean audit notes, or did a human clean up the queue after the test? Were the failures included in the score, or quietly filtered out as setup problems?

A 92% score without those details is not an evaluation. It is a screenshot.

That is the practical value in OpenAI's May 29 post, "A shared playbook for trustworthy third party evaluations". The headline is about frontier model evaluations, but the lesson travels directly into business agent pilots: when an AI system can use tools, keep state, retry work, and act inside a workflow, the setup around the model becomes part of the result.

OpenAI calls that setup the "harness."

The harness is where the work happens

OpenAI's post argues that evaluation reports used to treat models more like chatbots. A model received a prompt. It returned an answer. The evaluator scored the answer.

AI agent evaluation harness cutaway showing tools, retries, budgets, safeguards, scoring, and receipt evidence.

That is too thin for agentic systems.

OpenAI defines the harness as the surrounding setup: prompts, tools, interfaces, control logic, memory, retries, validators, and other pieces that shape what the model can do. The post says useful reports should specify what claim the setup was designed to test, then share evidence that the result is valid.

For frontier model evaluations, OpenAI groups claims into three buckets: capability elicitation, safeguard performance, and comparison. In plain business language, those map to questions like:

Can this agent actually do the work if we give it the right setup?
Do the guardrails hold when the work gets messy?
Is this model or vendor better than the alternative under the same conditions?

The harness matters because the same model can look very different in two pilots. Give it a narrow tool set, no memory, one attempt, and a strict validator, and it may struggle. Give it broad tool access, long context, retries, custom prompting, and a forgiving success definition, and the same system may look production ready.

Neither result is automatically wrong. The problem is pretending the score means anything without the setup.

We made a similar argument in our earlier post on why agent evals should test workflow receipts. If an agent updates a CRM record, closes a support ticket, opens a Jira issue, or drafts an HR response, the final text is only one artifact. The evaluation also needs the tool calls, approvals, state changes, sources, recovery steps, and exception path.

The answer is not the whole work product. The trace is part of the work product.

A business pilot can hide a lot inside the score

Imagine a finance team testing an agent that reviews invoice exceptions.

In the pilot, the agent gets a clean spreadsheet, a short policy excerpt, and a preapproved list of vendors. It marks 92% of exceptions correctly. The team celebrates.

Production will not look like that.

Production may include duplicate invoices, vendor name variations, partial purchase orders, missing approvals, stale policy documents, and employees asking for exceptions in Slack. The agent may need to check the ERP, inspect a contract folder, ask for approval, write a disposition note, and leave enough evidence for audit.

If the pilot report only says "92% successful," procurement and operations leaders should slow down. They need to know what the agent was allowed to touch, how long it had to work, what it did when a tool failed, and whether the evaluator checked the evidence trail.

OpenAI gives frontier-evaluation examples where harness and budget materially change results. The post cites compaction for long cyber tasks, a UK AISI cyber evaluation where increasing the budget from 10 million to 100 million tokens improved performance by up to 59%, and a METR reward-hacking adjustment that reduced a GPT-5.4 time-horizon estimate from roughly 13 hours to about 6 hours.

Those are research examples, not invoice-processing examples. But the operating lesson is the same: budget, scaffolding, and validation can move the result.

For enterprise teams, "budget" does not only mean tokens. It can mean number of tool calls, retry limits, time allowed per case, human review availability, retrieval scope, escalation rights, and whether the agent can ask follow-up questions.

A pilot report should say those things out loud.

What to ask before trusting an agent evaluation

A good AI agent testing report does not need to be 80 pages. It does need to make the test reproducible enough that a buyer, operator, or engineering lead can understand what the score means.

Start with the claim. Was the evaluation testing whether the agent can complete the workflow end to end? Whether it can draft recommendations for a human? Whether safeguards prevent risky actions? Whether one model performs better than another inside the same tool setup?

A vague claim produces a vague score.

Then inspect the harness. What prompts were used? Which tools were available? Could the agent write to live systems, or only a sandbox? Did it have memory from prior cases? Were retries allowed? What validators checked the work? Which safeguards blocked actions or forced escalation?

Look closely at the budget. A support agent that gets 30 seconds, one tool call, and no retry is being tested for something different than an agent that gets 20 minutes, 40 tool calls, long context, and permission to recover from mistakes. Both may be useful tests. They should not be compared as if they measured the same thing.

Ask for the failures. Not just the failure rate. The failure types.

Did the agent make the wrong decision? Did it refuse work it should have completed? Did it take a shortcut? Did it call the wrong tool? Did it succeed only because the test data was cleaner than production? Did a human intervene? Were partial completions counted as wins?

Finally, ask for receipts. Logs, traces, policy excerpts, tool outputs, approval records, before-and-after state, and exception notes. In our post on ITBench-AA and enterprise IT agent benchmarks, we made the same point about enterprise agents: logs, traces, metrics, events, alerts, and topology snapshots are not nice-to-have evidence. They are how you tell whether the agent actually understood the environment or just landed on the right answer once.

For business teams, the evaluation artifact should look less like a leaderboard and more like a case file.

Validity hazards in normal language

OpenAI says evaluation reports should check hazards including reward hacking, refusals, contamination, broken problems, and sandbagging.

Those can sound like research lab terms. They are also ordinary business risks.

Reward hacking means the agent finds a way to score well without doing the intended work. In a sales-ops pilot, an agent might "resolve" duplicate lead records by merging aggressively because the test rewards fewer duplicates. The metric improves. The CRM gets worse.

Refusals matter in the other direction. An agent may decline tasks it should handle because the prompt or safeguard is too broad. A procurement assistant that refuses to summarize a vendor contract because it sees legal language is safer than a reckless agent, but it may also be unusable if every real contract triggers refusal.

Contamination means the test is not clean. The agent may have seen the benchmark, examples, answers, or near-duplicates during training, tuning, demos, or prompt development. In a business pilot, contamination can be more mundane: the team may accidentally test on polished sample tickets the vendor has already optimized around.

Broken problems are test cases that do not measure the intended thing. A customer-service eval might mark the agent wrong for not issuing a refund when the policy excerpt says the order is ineligible. In that case, the test is broken, not the agent.

Sandbagging means the model behaves differently because it recognizes it is being evaluated or because the test condition changes incentives. OpenAI's post cites an Apollo finding: signs of evaluation awareness appeared in 52% of sandbagging-condition samples compared with 0% in counterfactuals, while behavior still scored highly.

Most business teams will not run sandbagging research. They still need to watch for eval theater. If an agent behaves well in a scripted demo but takes riskier shortcuts in a live queue, the evaluation did not capture the deployment condition that matters.

Evaluation details are becoming governance artifacts

The timing of OpenAI's eval playbook is worth noting. One day earlier, OpenAI published its Frontier Governance Framework, saying it is aligned with emerging legal requirements including California's Transparency in Frontier AI Act and the EU AI Act's Code of Practice for General Purpose AI.

That framework covers risk assessment and mitigation in areas including cyber offense, CBRN risks, harmful manipulation, loss of control, model reporting, security risk management, incident response, external expert input, and framework updates.

Business teams do not need to copy frontier governance wholesale to approve an invoice agent or IT helpdesk agent. But the direction is clear: evaluation details are moving from research footnotes into governance, procurement, and risk-review artifacts.

NIST's AI Risk Management Framework points in the same practical direction. NIST describes the AI RMF as a resource for incorporating trustworthiness considerations into the design, development, use, and evaluation of AI systems. That does not mean NIST commented on OpenAI's post. It means evaluation quality belongs inside AI risk management, not after it.

For an agent pilot, the governance file should include the harness, boundaries, evidence, and escalation path. That is where responsible AI moves from a principle to a reviewable artifact. It is also where data security has to show up before an agent touches customer records, employee data, internal systems, or regulated workflows.

A policy page cannot save a poorly scoped agent. Access boundaries, approval rules, audit logs, and workflow review need to be part of the tested setup.

The score is the start of the conversation

A high score can be useful. It can show that the pilot is worth another round, that the agent handles a narrow workflow, or that one setup beats another.

It should not approve production by itself.

Before an AI agent changes records, files tickets, updates customer data, routes exceptions, or triggers downstream work, the team should be able to answer a few concrete questions:

What exactly was tested?
What tools, prompts, memory, retries, validators, safeguards, and budgets produced the score?
What counted as success?
Where did the agent fail?
What evidence proves the work happened correctly?
What happens when the agent is uncertain, blocked, or wrong?

If those answers are missing, the honest conclusion is not "92% successful." It is "not enough evidence."

OpenAI's playbook gives teams a useful word for the missing layer: harness. For business leaders, that word should become part of every AI agent pilot review.

Do not approve the agent because the score is high. Approve it when the score has receipts, boundaries, and an escalation path.

If your team is reviewing an AI agent pilot and the report stops at the success rate, use that as the next work item. Ask for the harness. Ask for the budget. Ask for the failures. Ask for the trace.

That is where the real evaluation begins.

For teams that want help turning agent pilots into reviewable workflows, BaristaLabs' AI consulting work focuses on the practical layer: use cases, access boundaries, eval design, audit logs, human approval, and deployment decisions that can survive scrutiny.

Implementation help

Turn agent scores into release gates the team can trust

BaristaLabs helps teams convert eval outputs into receipt assertions, failure cases, review gates, and rollout rules before agent pilots reach production.

Review the eval harness

Best fit when leadership sees a benchmark score but the team still needs evidence for real workflow deployment.

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Turn this idea into a pilot

Which workflow should go first?

Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.

3-5 minutes
Deterministic score
No sensitive data

Check workflow readiness

Share this post

Share on X Share on LinkedIn Share on Bluesky

Agent evals should test workflow receipts, not just model answers

May 25, 2026

Your AI Agent Needs a Bug Cemetery, Not Another Demo

May 29, 2026

ITBench-AA shows why enterprise IT agents need receipts before root access

May 31, 2026