Quick path
In this article
Quick read: what changed, why it matters, and what to do next.
The Slack notification dropped at 12:08 PM. A backend lead glanced at it — "new PR: auth refactor + coverage" — and went to lunch.
By 2:15, the PR was ready for review. Eight hundred lines: updated token-handling middleware, new test files for three edge cases, a revised CI workflow that added a security scan step, and — tucked at the bottom — a Terraform change to an RDS parameter group. The diff was plausible. The agent had been running for two hours.
The backend lead who picked it up had forty minutes before standup and no context on why the database config had moved.
Not because the code is bad. It isn't, as far as anyone can tell. But the PR description says "Refactor auth token handling and add coverage," and nothing attached to it answers the questions that actually matter before anyone can safely say yes: What was the agent solving when it touched the database config? Which tests ran against the auth changes and what did they return? If this deploys and the app breaks, what does recovery look like?
That reviewer is now the bottleneck, the risk boundary, and — if something breaks in production — the person who approved the change.
This is where AI coding adoption is actually stuck. Not at code generation quality. At review legibility.
Speed creates review debt
In February, OpenAI published an account of what it called harness engineering: a product built over five months where Codex wrote every line — application logic, tests, CI configuration, documentation, observability tooling, and internal developer utilities. Because that essay can be Cloudflare-challenged for automated readers, OpenAI's crawler-friendly Codex cloud documentation is the durable companion source for the same operating model: delegate a task, run it in a configured cloud environment, and turn the result into reviewable repository work. The throughput number got the attention. The process detail was quieter and more important: before any PR reaches a human reviewer, the agent reviews its own work and requests additional agent reviews. The harness runs first. Human review comes last, operating on evidence the harness already collected.
Most teams adopting coding agents have not built that loop. The gap between what a capable agent can produce in two hours and what a reviewer can safely approve in forty minutes is where review debt accumulates.
The failure mode is not a rogue agent doing something obviously wrong. It is a plausible-looking change that nobody can safely approve because the necessary evidence is not there — and the team, under throughput pressure, approves it anyway.
The merge receipt
The pattern worth building is the merge receipt: a structured card attached to every agent-generated change that makes the right evidence unavoidable before merge or apply. Not a long document. Not a separate wiki page. A card that answers the questions a reviewer needs before they can reasonably say yes.
The merge receipt is the new code comment. In the era of hand-written code, a good comment explained a non-obvious decision for the next human reader. In the era of agent-generated code, the receipt explains intent, risk surface, and evidence for the next human approver. Both exist for the same reason: to make the code reviewable by someone who was not in the room when the decision was made.
The shift is that the receipt is not optional or stylistic. When an AI agent can open a PR against your infrastructure in two hours, "I'll figure it out from the diff" is not a safe review strategy. The receipt makes evidence present — or it makes the absence of evidence visible before merge, not after.
This connects to the broader pattern of agent receipts for reconstructing AI work, applied specifically to the moment before a change is approved and applied.

A reviewer with the receipt in hand
Take the two-hour PR from the opening. App code, CI, and Terraform — three distinct systems. Here is what changes when the merge receipt is present.
The reviewer picks it up before opening the diff. The first thing they read is not the PR title. It is the original task specification the agent was given: "Implement short-lived refresh token rotation in auth middleware, add coverage for expiration edge cases, non-production environments only." That sentence reframes the review. Instead of reverse-engineering intent from 800 lines, the reviewer is checking whether the diff achieves a declared goal — and the "non-production only" scope tells them immediately that a production infrastructure change was not on the brief.
The receipt also declares touched systems explicitly: authentication middleware, CI pipeline, RDS configuration. Not inferred from the diff. Declared by the agent or the harness before the reviewer opens a single file. The Terraform change is visible before they reach it.
Test results come next — and this is where a receipt is either evidence or a red flag. Not a checkbox that says "tests pass." The actual run output, linked, with which suite, which environment, and what the edge cases returned. If no tests ran against the Terraform change, that field is empty on the receipt. The absence is legible before merge, not after.
For the infrastructure piece specifically, a valid-looking Terraform plan still needs a record of what the plan showed before apply. HCP Terraform run tasks support this directly: integrations that validate or scan at specific lifecycle stages and return pass/fail before a run proceeds. The receipt records the outcome. If the run task did not complete, the card says so.
Two more fields close out the receipt: the reviewer lane and the rollback path. App and CI changes may go to engineering review; the RDS change needs a separate infrastructure approval before merge. Separate lanes let teams apply different thresholds to different risk surfaces without blocking throughput on low-risk changes or underscrutinizing high-risk ones.
The rollback path for this PR is not simply "revert the commit." It is closer to: restore the prior RDS parameter value, confirm whether active connections are affected, and redeploy safely. That procedure should exist before the change ships, not during the incident call. A structured rollback checklist is worth having before an agent is authorized to touch infrastructure.
The final field is scope: what the agent was explicitly not permitted to do in this run. If it was scoped to non-production environments, was not authorized to modify secrets, or was operating in read-only planning mode, that appears on the receipt. A reviewer who can read the scope boundary can trust it. A reviewer who has to assume it is guessing.
Seven data points. None of them require the reviewer to stare harder at the diff. They require the harness to have run before the PR arrived.
What the harness actually does
The merge receipt does not replace judgment. It changes what judgment operates on.
A reviewer looking at the receipt for the two-hour PR can tell immediately whether the Terraform change has an environment record attached, whether tests ran against the auth middleware in a staging environment that matches production, and whether the rollback path is named. If any field is empty, the PR is not merge-ready — not because the code is wrong, but because the evidence is incomplete.
That is the failure the harness prevents. Not bad code. Missing evidence.
Anthropic's framing of workflows versus agents draws the line here: workflows orchestrate through predefined code paths; agents direct their own process and tool use. The more autonomy an agent has, the more the harness matters, because the human reviewer is the last gate before production and cannot be the only one.
A team that builds this harness does not need a hero reviewer who can hold an entire application model in their head at diff-review time. It needs a system where the right evidence is unavoidable before a PR can proceed, and where the absence of evidence is legible before something ships.
Before the next agent PR lands
Pick one AI-generated PR or automation change your team has already reviewed — or the next one that comes in. Map the merge receipt fields against it: intent, touched systems, test evidence, environment record, reviewer lane, rollback path, forbidden actions. Which fields would have been empty?
That gap is where to start. Not with more task wording tweaks or longer PR descriptions. With the harness that makes evidence present before it is needed.
If you want to map those fields against a real AI-generated PR or infrastructure change before increasing your team's agent autonomy, bring it to us.
Map an AI pull request harness
Make the next AI-generated PR reviewable
BaristaLabs helps teams turn fast AI-assisted changes into reviewable workflows with receipts, scoped permissions, rollback plans, and human approval where it matters.
Best fit for teams using coding agents, AI reviewers, infrastructure automation, or internal workflow agents that can prepare changes faster than people can safely approve them.
Practical AI Workflow Notes
Want more practical AI operations ideas?
Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.
Turn this idea into a pilot
Which workflow should go first?
Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.
- 3-5 minutes
- Deterministic score
- No sensitive data
Share this post
