AI Development

OpenAI's tax agents show why AI automation needs a feedback loop

OpenAI's Tax AI pilot with Codex is less a story about automated tax prep and more a lesson in production AI: agents improve when practitioner corrections become structured evidence, evals, and guarded releases.

Sean McLellan

Lead Architect & Founder

May 27, 20268 min read

The most useful lesson from OpenAI's new tax-agent writeup is not that AI can help prepare tax returns.

It is that production AI agents get better when the surrounding workflow is designed to learn.

In its May 27, 2026 engineering post, "Building self-improving tax agents with Codex", OpenAI describes Tax AI, a system co-developed with Thrive Holdings for Crete, a network of more than 30 accounting firms. The pilot covered 7,000 tax returns during tax season and reportedly saved practitioners roughly one-third of prep time, drafted returns with up to 97% accuracy, and increased throughput by about 50%.

Those numbers are attention-grabbing. But the more important part is the operating model behind them.

Tax AI improved because practitioner corrections did not disappear into Slack threads, support tickets, or vague "the model got this wrong" notes. The workflow preserved evidence. Engineers could turn that evidence into evals. Codex could help convert production signals into scoped product and engineering improvements. Changes could be tested before they reached the next batch of work.

That pattern matters far beyond accounting. If your business handles documents, forms, approvals, exceptions, compliance checks, reporting, or customer records, the same lesson applies: AI workflow automation needs a feedback loop before it needs a bigger demo.

What OpenAI built

OpenAI and Thrive Holdings built Tax AI for Crete accountants working across a wide range of filings and source documents.

The target workflows included 1040 and 1041 returns, along with common forms and schedules such as W-2s, 1099s, K-1s, Schedule E rental income, Schedule C, and Schedule A. OpenAI notes that for medium- to large-complexity filings, data entry alone can take up to eight hours per return.

That is exactly the kind of work where AI agents look promising on paper:

many documents
repeated extraction and mapping tasks
messy inputs
prior-year records
emails, spreadsheets, and notes
high seasonal pressure
expensive human review time
real consequences when details are wrong

Tax AI was not presented as a fully autonomous replacement for accountants. It drafted returns, handled parts of the preparation workflow, and learned from practitioner review.

That distinction matters. In serious business process automation, the goal is rarely "remove the expert." The better goal is to give the expert a cleaner draft, better evidence, fewer repetitive steps, and a reliable way to teach the system when it misses.

Why the accuracy curve matters

The most revealing part of OpenAI's post is the improvement curve.

OpenAI measured accuracy by the share of scored returns that reached different levels of correct field completion. At launch:

25.13% of scored returns reached at least 75% field completion
1.26% reached at least 90%
0.01% were 100% field complete

By week 6:

86.24% reached at least 75% field completion
59.77% reached at least 90%
9.50% were 100% field complete

OpenAI also listed a mid-year target or projection of:

98% reaching at least 75% field completion
90% reaching at least 90%
50% reaching 100%

That curve is the point.

A brittle AI pilot often looks impressive in a controlled demo, then stalls in production. The team finds edge cases, argues about whether they are model problems or workflow problems, patches a few prompts, and slowly loses confidence.

OpenAI's Tax AI story shows a different path. The first version was incomplete. Real users found the gaps. The product captured enough context to understand those gaps. The engineering system turned recurring issues into eval-backed changes. Accuracy improved over weeks.

That is what self-improving AI agents should mean in a business setting. The agent is not magically teaching itself in a vacuum. The business process is producing structured learning material.

Production failures are messy

OpenAI is direct about one of the hardest parts: production failures are not clean benchmark failures.

A mismatched tax field might mean the system failed to extract a value from a document. It might mean the value was extracted correctly but mapped to the wrong destination. It might mean the product did not yet support that schedule. It might reflect a practitioner preference, a prior-year carry-forward, workflow noise, tax judgment, or a downstream change elsewhere in the filing workflow.

That messiness appears in every document-heavy operation.

In finance, a variance may come from a missing invoice, a coding rule, a timing issue, or a human override.

In compliance, a flagged clause may be a real risk, a template artifact, a jurisdiction-specific exception, or a policy that changed last week.

In customer operations, a bad recommendation may come from stale CRM data, a missing email, an unusual account history, or a support agent making a reasonable exception.

If an AI system only records "wrong answer," the team has very little to improve. If it records the source document, extracted field, evidence, workflow step, human correction, reviewer rationale, and final outcome, the team can start separating model errors from product gaps and process ambiguity.

That is where AI agent evals become practical. They stop being abstract benchmark scores and start testing whether the workflow produced the right result from the right evidence.

This is also why we often recommend building an AI approval queue before adding agents. Review screens are not only safety checkpoints. Done well, they become the place where human judgment turns into training signal, product feedback, and audit history.

What business teams can copy from the pattern

Most small and midsized businesses do not need a tax agent. They do need a better way to automate document-heavy work without losing control.

Here is the pattern worth copying.

1. Start with a narrow workflow

Tax AI focused on specific filing workflows and document types. It did not begin as a general "do accounting" agent.

That is the right instinct. Pick a workflow with clear inputs, known reviewers, measurable outcomes, and enough repetition to learn from. Examples:

invoice intake and GL coding
contract review for standard terms
customer onboarding packet review
monthly reporting package preparation
insurance or compliance document checks
support ticket triage with evidence summaries

For many teams, this starts with process automation: mapping the handoffs, documents, decision points, and review steps before adding an agent.

2. Capture evidence, not only outcomes

A useful agent workflow should preserve a receipt of what happened.

For document work, that usually means:

source file or message
extracted field
location or citation
confidence or uncertainty
transformation or mapping rule
human correction
reviewer note
final approved output

Without that chain, you cannot reliably answer why the agent failed. With it, you can build evals that test the actual job.

We have written about this in more detail in Agent evals should test workflow receipts. The short version: if the business process requires evidence, the eval should check evidence too.

3. Turn corrections into structured evals

Practitioner feedback is valuable only if the system can reuse it.

A correction buried in a comment is better than nothing. A correction linked to the source document, field name, expected value, reviewer rationale, and workflow state is far more useful.

That structure lets teams ask:

Is this a recurring extraction problem?
Is the mapping wrong?
Is the product missing a supported case?
Does the workflow need another approval step?
Is this actually a policy or judgment question?
Should the agent abstain next time?

Once those questions are answerable, improvements can be tested before release.

4. Keep approval gates where risk is real

OpenAI's surrounding Codex announcements point in the same direction: agents are becoming more capable, but the work still needs steering, review, and approval. In its May 14 post on working with Codex from anywhere, OpenAI describes users monitoring, guiding, and approving long-running work while files, credentials, permissions, and local setup stay on the machine where Codex operates.

That is a useful model for business automation too.

The agent can draft, extract, reconcile, summarize, and propose. The human should approve where the decision carries financial, legal, customer, or reputational risk.

Approval gates are not a sign that the system failed. They are how you make the system safe enough to use. They are part of responsible AI: clear accountability, human review, and a record of what changed.

5. Define data boundaries early

The Dell and OpenAI partnership announcement from May 18 frames Codex as moving closer to enterprise data and workflows, including reports, feedback routing, lead qualification, follow-ups, and coordination across business systems. That trend is important: useful agents need context from the places work already happens.

But more context also means more responsibility.

Before automating document-heavy work, decide:

Which systems can the agent access?
Which documents are out of scope?
Which credentials or secrets must never be exposed?
What data can be stored in traces and evals?
Who can review sensitive examples?
How long should evidence be retained?

Those are not cleanup questions. They belong at the start of the project. If your agent needs access to financial records, customer data, employee documents, or regulated information, your data security model is part of the product design.

6. Ship improvements on a cadence

The Tax AI accuracy curve improved over weeks because the team had a loop.

A practical SMB version might look like this:

daily review of failed or uncertain cases during the pilot
weekly grouping of recurring failure modes
weekly eval updates for the highest-value issues
guarded releases for prompt, workflow, or code changes
monthly review of time saved, review burden, error types, and user trust

The cadence matters because AI automation is rarely "set it and forget it." The workflow changes. Source documents change. Staff develop preferences. Customers send strange files. Policies evolve.

The system needs a way to keep up.

Where this applies outside tax

The tax example is specific, but the pattern is broad.

Self-improving AI agents are especially relevant when the work has:

messy documents
repeated review steps
domain experts correcting drafts
high-volume seasonal or operational pressure
structured outputs
clear downstream systems
risk that requires approval

That includes:

finance teams preparing monthly reporting packages
operations teams reviewing vendor documents
compliance teams checking forms, policies, or attestations
customer success teams summarizing account histories
sales teams qualifying leads and routing follow-ups
HR teams reviewing onboarding paperwork
legal teams screening standard contract terms
healthcare or insurance teams reconciling forms and records

The common thread is not the industry. It is the workflow.

If expert review already happens, the question is whether the review creates reusable signal. If the answer is no, the first automation opportunity may be the review process itself.

A practical first-step checklist

If you are considering AI agents for accounting, finance, reporting, compliance, document review, or customer operations, start smaller than the demo in your head.

Use this checklist before you build.

1. Pick one workflow

Choose a workflow where the same kind of document or request appears often. Avoid starting with the weirdest edge cases.

Good first candidates have:

clear inputs
a known reviewer
a clear definition of "done"
measurable time spent today
enough volume to learn from

2. Write down the current human process

Map the actual workflow, not the ideal one.

Include:

where documents arrive
who opens them
what they look for
what systems they update
when they ask for help
what they double-check
what causes rework

This is often where the biggest automation opportunities appear.

3. Define the evidence receipt

Before choosing a model or tool, decide what the agent must show its work with.

For each proposed output, ask:

What source supports this?
Where did the value come from?
What transformation happened?
What uncertainty should be visible?
What would a reviewer need to approve it quickly?

4. Build the review queue

Do not send agent output straight into a system of record unless the risk is low and the behavior is already proven.

Create a review step where people can:

approve
edit
reject
add a reason
flag a missing capability
escalate uncertain cases

That queue becomes the foundation for safer automation and better evals.

5. Create evals from real corrections

After the first batch, do not only ask, "Was it accurate?"

Ask:

Which errors repeated?
Which corrections were judgment calls?
Which fields lacked evidence?
Which documents confused the workflow?
Which failures would have caused real business risk?
Which changes can be tested before release?

That is the beginning of an improvement loop.

6. Set boundaries for data and permissions

Decide what the agent can read, write, store, and suggest. Keep the first version boring on purpose.

A good pilot should make people more confident each week. If trust drops, slow down and inspect the workflow before expanding scope.

The real lesson from OpenAI Tax AI

OpenAI's Tax AI pilot is useful because it shows the unglamorous side of production AI.

The agent improved because accountants corrected it. The product preserved those corrections as structured evidence. Engineers could diagnose failure modes. Evals could check whether changes helped. Codex could assist with the improvement work. Humans stayed in the loop where judgment and accountability mattered.

That is the blueprint business teams should pay attention to.

If you are exploring AI automation for document-heavy operations, do not start by asking, "Which agent can do this whole job?"

Start with a better question:

"What would our workflow need to capture so the agent can improve safely?"

That question leads to better pilots, clearer governance, and systems your team can actually trust.

BaristaLabs helps teams design AI pilots around real workflows, review gates, evals, and data boundaries. If you are sorting through where to start, our AI consulting work can help you choose the right first use case, and the AI readiness assessment is a low-pressure way to identify the workflows worth evaluating first.

Sources

OpenAI Engineering: "Building self-improving tax agents with Codex", May 27, 2026
OpenAI: "Work with Codex from anywhere", May 14, 2026
OpenAI: "OpenAI and Dell Technologies partner to bring Codex to hybrid and on-premises enterprise environments", May 18, 2026
OpenAI: "OpenAI named a Leader in enterprise coding agents by Gartner", May 22, 2026

AI Pilot Readiness Checklist

Turn the idea into a pilot you can defend.

AI agent articles are easy to bookmark and hard to operationalize. Use the readiness questions as a shared way to decide whether a workflow is specific enough, safe enough, and measurable enough to pilot. If they surface a strong candidate, BaristaLabs can review it with you and help shape a first version that fits your systems, approval process, and risk tolerance.

Turn this into a pilot plan Talk through a pilot candidate with BaristaLabs

Please do not submit PHI, customer records, credentials, or confidential workflow exports.

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Share this post

Share on X Share on LinkedIn Share on Bluesky

GitHub's Accessibility Agent Worked Because the Mess Was Already Organized

May 31, 2026

Enterprise IT agents just got a harder benchmark. The best models still missed half the incidents.

May 29, 2026

Google's Agent Executor shows why AI agents need runtime infrastructure

May 27, 2026