The first win looks clean from the outside.
The inbox drops to zero. The pull request has five automated notes before a human opens it. The support queue has draft replies waiting by the time the manager gets coffee. Everyone can see the time savings.
Then the new pile appears.
Some notes are useful. Some are noise. Some are technically right but not worth changing. Some look small until a customer, security reviewer, or product owner sees the consequence. The team is no longer waiting for people to type the first pass. It is waiting for people to decide what the first pass means.
That is the quieter part of AI automation: it creates observations faster than most teams can absorb them.
The inbox-zero signal
Dan Shipper's May 2026 essay, After Automation: AI progress creates more work for humans, not less, is useful because it comes from an aggressive adopter, not a skeptic.
Every has pushed AI into coding, writing, design, and customer service. Shipper says the company still has almost 30 people and still hires humans. The sharpest detail is email: AI responded to 95 percent of his work emails for several weeks, leaving him almost always at inbox zero. But he still reviews email.
That is the line operators should sit with.
If AI handles the first pass, the human role does not shrink to the remaining 5 percent. It changes into a review job: checking judgment calls, spotting exceptions, deciding whether the workflow learned the right lesson, and noticing when a clean draft used the wrong premise.
The work moved. It did not evaporate.
The public artifact is already here
A recent PostHog pull request shows the developer version of the same pattern.
In PostHog PR #61984, a feature-flagged change tested new web analytics metric cards. The thread gathered a Greptile bot review, a bundle-size report, a human-authored comment labeled "Automated comment by QA Swarm - not written by a human," a Playwright flake note, visual review approval, and deployment status.
That is not a failure story. It is a public artifact of modern work.
A single change now attracts multiple machine observations before and around human review. The reviewer is not just reading code. They are deciding which automated findings matter, which warnings are unrelated, which notes should become future tests, and which ones can be safely ignored.
A second PostHog example, PR #60833, has the same shape around a narrow internal-debug/user-facing boundary. Size reports, bot findings, QA Swarm approval, and deployment status all collect around a small operational fix.
The useful lesson is not "use these tools" or "avoid these tools." It is that automated review inputs become an artifact the team has to manage.

GitHub makes the same point in its guide to reviewing agent pull requests. Agent output still needs review habits. The fact that work arrived faster does not make it lower-risk.
Call it the observation budget
Most teams budget the wrong part of the pilot.
They ask: how many replies can the agent draft? How many tickets can it classify? How many PR comments can it generate? How many documents can it summarize?
Those are production metrics. They tell you how fast the machine can make the first pile.
The missing metric is the observation budget: how much machine-generated evidence, warning, suggestion, draft, diff, summary, or exception can the team review without skimming past the important part?
That budget is finite.
A support lead can only review so many drafted replies before judgment gets shallow. A developer can only process so many bot comments before the thread becomes wallpaper. An ops manager can only look at so many exception summaries before every amber item starts to feel the same.
When teams ignore that budget, they get a strange result: more automation, but not less pressure.
Sort observations before you sort permissions
The practical move is to classify the machine observations before expanding what the agent can do.
A first-pass taxonomy can be simple:
| Observation type | What the reviewer should decide |
|---|---|
| Blocker | Does this stop the action from shipping? |
| Policy exception | Which rule or owner decides the next step? |
| Missing evidence | What source must be shown before approval? |
| Quality edit | Is this a wording/style fix or a substantive correction? |
| Unrelated noise | Should this be hidden, sampled, or tuned out next time? |
| Reusable learning | Should this become a test, rule, checklist item, or template change? |
That last row matters. If every rejected draft only disappears, the system gets busy without getting smarter. If every useful correction becomes a rule, test, instruction change, source change, or reviewer hint, the queue becomes a learning surface.
This is where a lightweight agent receipt helps. The receipt does not need to store everything forever. It should preserve enough context to answer a few basic questions: what did the system see, what did it propose, what did a person change, what final action happened, and what should change next time?
The dangerous metric is completion rate
A pilot that says "AI drafted 500 replies" is not wrong. It is incomplete.
A better pilot report would say:
AI drafted 500 replies.
340 shipped with no edit.
96 needed tone edits.
41 needed policy corrections.
17 lacked enough source evidence.
6 would have reached the wrong customer or account.
12 recurring corrections became workflow rules.
Now the team can make a real decision.
Maybe the first lane can move faster. Maybe angry-customer replies need human review every time. Maybe the retrieval step is weak. Maybe the interface is hiding the evidence reviewers need. Maybe the model is fine, but the workflow asks it to decide something the business has never written down.
That is a better conversation than "the AI saved time" versus "the AI created cleanup work." Both can be true.
The NIST AI Risk Management Framework uses bigger language: map, measure, manage, and govern. For a small team, this is the same habit in plain clothes. Map where AI creates observations. Measure which observations humans change. Manage the recurring exceptions. Keep enough evidence to explain the outcome.
Do not make reviewers read everything forever
A review queue is a bridge, not a permanent tax.
Once a pattern is boring and reversible, it can move toward sampling or auto-approval. Once a pattern is high-risk, it should keep a stronger human gate. Once a machine note is repeatedly useless, tune it out or stop generating it.
That is the operating work AI creates. Not glamorous. Very real.
The goal is not to make people review every agent action forever. The goal is to learn which observations deserve attention.
For engineering teams, that might mean converting recurring bot comments into tests or lint rules. For support teams, it might mean turning repeated edits into response policy. For operations teams, it might mean separating routine drafts from exceptions that need an owner. For customer-facing workflows, it might mean using an approval queue only where the action can affect money, records, relationships, or public claims.
If the queue keeps growing and nothing graduates out of it, automation has become another inbox.
Plan for the second pile
AI automation can absolutely reduce manual work. It can clear inboxes, draft code, classify tickets, summarize calls, and keep routine work moving.
But the first measurable win often creates a second pile: machine-generated work that needs human judgment.
Plan for that pile before the pilot expands. Decide which observations are blockers, which are exceptions, which are learning opportunities, and which are noise. Track reviewer edits. Turn repeat corrections into rules. Keep receipts for customer-visible or system-of-record actions.
BaristaLabs helps teams design those practical automation patterns: review surfaces, receipt logs, exception lanes, and the metrics that show whether the work is actually getting lighter. If your first AI workflow is creating a second pile, we can help you make it manageable through process automation or talk through the workflow.
AI Pilot Readiness Checklist
Turn the idea into a pilot you can defend.
AI agent articles are easy to bookmark and hard to operationalize. Use the readiness questions as a shared way to decide whether a workflow is specific enough, safe enough, and measurable enough to pilot. If they surface a strong candidate, BaristaLabs can review it with you and help shape a first version that fits your systems, approval process, and risk tolerance.
Please do not submit PHI, customer records, credentials, or confidential workflow exports.
Practical AI Workflow Notes
Want more practical AI operations ideas?
Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.
Share this post
