The most useful lesson from OpenAI's new tax-agent writeup is not that AI can help prepare tax returns.
It is that production AI agents get better when the surrounding workflow is designed to learn.
In its May 27, 2026 engineering post, "Building self-improving tax agents with Codex", OpenAI describes Tax AI, a system co-developed with Thrive Holdings for Crete, a network of more than 30 accounting firms. The pilot covered 7,000 tax returns during tax season and reportedly saved practitioners roughly one-third of prep time, drafted returns with up to 97% accuracy, and increased throughput by about 50%.
Those numbers are attention-grabbing. But the more important part is the operating model behind them.
Tax AI improved because practitioner corrections did not disappear into Slack threads, support tickets, or vague "the model got this wrong" notes. The workflow preserved evidence. Engineers could turn that evidence into evals. Codex could help convert production signals into scoped product and engineering improvements. Changes could be tested before they reached the next batch of work.
That pattern matters far beyond accounting. If your business handles documents, forms, approvals, exceptions, compliance checks, reporting, or customer records, the same lesson applies: AI workflow automation needs a feedback loop before it needs a bigger demo.
What OpenAI built
OpenAI and Thrive Holdings built Tax AI for Crete accountants working across a wide range of filings and source documents.
The target workflows included 1040 and 1041 returns, along with common forms and schedules such as W-2s, 1099s, K-1s, Schedule E rental income, Schedule C, and Schedule A. OpenAI notes that for medium- to large-complexity filings, data entry alone can take up to eight hours per return.
That is exactly the kind of work where AI agents look promising on paper:
- many documents
- repeated extraction and mapping tasks
- messy inputs
- prior-year records
- emails, spreadsheets, and notes
- high seasonal pressure
- expensive human review time
- real consequences when details are wrong
Tax AI was not presented as a fully autonomous replacement for accountants. It drafted returns, handled parts of the preparation workflow, and learned from practitioner review.
That distinction matters. In serious business process automation, the goal is rarely "remove the expert." The better goal is to give the expert a cleaner draft, better evidence, fewer repetitive steps, and a reliable way to teach the system when it misses.
Why the accuracy curve matters
The most revealing part of OpenAI's post is the improvement curve.
OpenAI measured accuracy by the share of scored returns that reached different levels of correct field completion. At launch:
- 25.13% of scored returns reached at least 75% field completion
- 1.26% reached at least 90%
- 0.01% were 100% field complete
By week 6:
- 86.24% reached at least 75% field completion
- 59.77% reached at least 90%
- 9.50% were 100% field complete
OpenAI also listed a mid-year target or projection of:
- 98% reaching at least 75% field completion
- 90% reaching at least 90%
- 50% reaching 100%
That curve is the point.
A brittle AI pilot often looks impressive in a controlled demo, then stalls in production. The team finds edge cases, argues about whether they are model problems or workflow problems, patches a few prompts, and slowly loses confidence.
OpenAI's Tax AI story shows a different path. The first version was incomplete. Real users found the gaps. The product captured enough context to understand those gaps. The engineering system turned recurring issues into eval-backed changes. Accuracy improved over weeks.
That is what self-improving AI agents should mean in a business setting. The agent is not magically teaching itself in a vacuum. The business process is producing structured learning material.
Production failures are messy
OpenAI is direct about one of the hardest parts: production failures are not clean benchmark failures.
A mismatched tax field might mean the system failed to extract a value from a document. It might mean the value was extracted correctly but mapped to the wrong destination. It might mean the product did not yet support that schedule. It might reflect a practitioner preference, a prior-year carry-forward, workflow noise, tax judgment, or a downstream change elsewhere in the filing workflow.
That messiness appears in every document-heavy operation.
In finance, a variance may come from a missing invoice, a coding rule, a timing issue, or a human override.
In compliance, a flagged clause may be a real risk, a template artifact, a jurisdiction-specific exception, or a policy that changed last week.
In customer operations, a bad recommendation may come from stale CRM data, a missing email, an unusual account history, or a support agent making a reasonable exception.
If an AI system only records "wrong answer," the team has very little to improve. If it records the source document, extracted field, evidence, workflow step, human correction, reviewer rationale, and final outcome, the team can start separating model errors from product gaps and process ambiguity.
That is where AI agent evals become practical. They stop being abstract benchmark scores and start testing whether the workflow produced the right result from the right evidence.
This is also why we often recommend building an AI approval queue before adding agents. Review screens are not only safety checkpoints. Done well, they become the place where human judgment turns into training signal, product feedback, and audit history.
What business teams can copy from the pattern
Most small and midsized businesses do not need a tax agent. They do need a better way to automate document-heavy work without losing control.
Here is the pattern worth copying.
1. Start with a narrow workflow
Tax AI focused on specific filing workflows and document types. It did not begin as a general "do accounting" agent.
That is the right instinct. Pick a workflow with clear inputs, known reviewers, measurable outcomes, and enough repetition to learn from. Examples:
- invoice intake and GL coding
- contract review for standard terms
- customer onboarding packet review
- monthly reporting package preparation
- insurance or compliance document checks
- support ticket triage with evidence summaries
For many teams, this starts with process automation: mapping the handoffs, documents, decision points, and review steps before adding an agent.
2. Capture evidence, not only outcomes
A useful agent workflow should preserve a receipt of what happened.
For document work, that usually means:
- source file or message
- extracted field
- location or citation
- confidence or uncertainty
- transformation or mapping rule
- human correction
- reviewer note
- final approved output
Without that chain, you cannot reliably answer why the agent failed. With it, you can build evals that test the actual job.
We have written about this in more detail in Agent evals should test workflow receipts. The short version: if the business process requires evidence, the eval should check evidence too.
3. Turn corrections into structured evals
Practitioner feedback is valuable only if the system can reuse it.
A correction buried in a comment is better than nothing. A correction linked to the source document, field name, expected value, reviewer rationale, and workflow state is far more useful.
That structure lets teams ask:
- Is this a recurring extraction problem?
- Is the mapping wrong?
- Is the product missing a supported case?
- Does the workflow need another approval step?
- Is this actually a policy or judgment question?
- Should the agent abstain next time?
Once those questions are answerable, improvements can be tested before release.
4. Keep approval gates where risk is real
OpenAI's surrounding Codex announcements point in the same direction: agents are becoming more capable, but the work still needs steering, review, and approval. In its May 14 post on working with Codex from anywhere, OpenAI describes users monitoring, guiding, and approving long-running work while files, credentials, permissions, and local setup stay on the machine where Codex operates.
That is a useful model for business automation too.
The agent can draft, extract, reconcile, summarize, and propose. The human should approve where the decision carries financial, legal, customer, or reputational risk.
Approval gates are not a sign that the system failed. They are how you make the system safe enough to use. They are part of responsible AI: clear accountability, human review, and a record of what changed.
5. Define data boundaries early
The Dell and OpenAI partnership announcement from May 18 frames Codex as moving closer to enterprise data and workflows, including reports, feedback routing, lead qualification, follow-ups, and coordination across business systems. That trend is important: useful agents need context from the places work already happens.
But more context also means more responsibility.
Before automating document-heavy work, decide:
- Which systems can the agent access?
- Which documents are out of scope?
- Which credentials or secrets must never be exposed?
- What data can be stored in traces and evals?
- Who can review sensitive examples?
- How long should evidence be retained?
Those are not cleanup questions. They belong at the start of the project. If your agent needs access to financial records, customer data, employee documents, or regulated information, your data security model is part of the product design.
6. Ship improvements on a cadence
The Tax AI accuracy curve improved over weeks because the team had a loop.
A practical SMB version might look like this:
- daily review of failed or uncertain cases during the pilot
- weekly grouping of recurring failure modes
- weekly eval updates for the highest-value issues
- guarded releases for prompt, workflow, or code changes
- monthly review of time saved, review burden, error types, and user trust
The cadence matters because AI automation is rarely "set it and forget it." The workflow changes. Source documents change. Staff develop preferences. Customers send strange files. Policies evolve.
The system needs a way to keep up.
Where this applies outside tax
The tax example is specific, but the pattern is broad.
Self-improving AI agents are especially relevant when the work has:
- messy documents
- repeated review steps
- domain experts correcting drafts
- high-volume seasonal or operational pressure
- structured outputs
- clear downstream systems
- risk that requires approval
That includes:
- finance teams preparing monthly reporting packages
- operations teams reviewing vendor documents
- compliance teams checking forms, policies, or attestations
- customer success teams summarizing account histories
- sales teams qualifying leads and routing follow-ups
- HR teams reviewing onboarding paperwork
- legal teams screening standard contract terms
- healthcare or insurance teams reconciling forms and records
The common thread is not the industry. It is the workflow.
If expert review already happens, the question is whether the review creates reusable signal. If the answer is no, the first automation opportunity may be the review process itself.
A practical first-step checklist
If you are considering AI agents for accounting, finance, reporting, compliance, document review, or customer operations, start smaller than the demo in your head.
Use this checklist before you build.
1. Pick one workflow
Choose a workflow where the same kind of document or request appears often. Avoid starting with the weirdest edge cases.
Good first candidates have:
- clear inputs
- a known reviewer
- a clear definition of "done"
- measurable time spent today
- enough volume to learn from
2. Write down the current human process
Map the actual workflow, not the ideal one.
Include:
- where documents arrive
- who opens them
- what they look for
- what systems they update
- when they ask for help
- what they double-check
- what causes rework
This is often where the biggest automation opportunities appear.
3. Define the evidence receipt
Before choosing a model or tool, decide what the agent must show its work with.
For each proposed output, ask:
- What source supports this?
- Where did the value come from?
- What transformation happened?
- What uncertainty should be visible?
- What would a reviewer need to approve it quickly?
4. Build the review queue
Do not send agent output straight into a system of record unless the risk is low and the behavior is already proven.
Create a review step where people can:
- approve
- edit
- reject
- add a reason
- flag a missing capability
- escalate uncertain cases
That queue becomes the foundation for safer automation and better evals.
5. Create evals from real corrections
After the first batch, do not only ask, "Was it accurate?"
Ask:
- Which errors repeated?
- Which corrections were judgment calls?
- Which fields lacked evidence?
- Which documents confused the workflow?
- Which failures would have caused real business risk?
- Which changes can be tested before release?
That is the beginning of an improvement loop.
6. Set boundaries for data and permissions
Decide what the agent can read, write, store, and suggest. Keep the first version boring on purpose.
A good pilot should make people more confident each week. If trust drops, slow down and inspect the workflow before expanding scope.
The real lesson from OpenAI Tax AI
OpenAI's Tax AI pilot is useful because it shows the unglamorous side of production AI.
The agent improved because accountants corrected it. The product preserved those corrections as structured evidence. Engineers could diagnose failure modes. Evals could check whether changes helped. Codex could assist with the improvement work. Humans stayed in the loop where judgment and accountability mattered.
That is the blueprint business teams should pay attention to.
If you are exploring AI automation for document-heavy operations, do not start by asking, "Which agent can do this whole job?"
Start with a better question:
"What would our workflow need to capture so the agent can improve safely?"
That question leads to better pilots, clearer governance, and systems your team can actually trust.
BaristaLabs helps teams design AI pilots around real workflows, review gates, evals, and data boundaries. If you are sorting through where to start, our AI consulting work can help you choose the right first use case, and the AI readiness assessment is a low-pressure way to identify the workflows worth evaluating first.
Sources
- OpenAI Engineering: "Building self-improving tax agents with Codex", May 27, 2026
- OpenAI: "Work with Codex from anywhere", May 14, 2026
- OpenAI: "OpenAI and Dell Technologies partner to bring Codex to hybrid and on-premises enterprise environments", May 18, 2026
- OpenAI: "OpenAI named a Leader in enterprise coding agents by Gartner", May 22, 2026
AI Pilot Readiness Checklist
Turn the idea into a pilot you can defend.
AI agent articles are easy to bookmark and hard to operationalize. The readiness checklist gives your team a shared way to decide whether a workflow is specific enough, safe enough, and measurable enough to pilot. If the checklist surfaces a strong candidate, BaristaLabs can review it with you and help shape a first version that fits your systems, approval process, and risk tolerance.
Please do not submit PHI, customer records, credentials, or confidential workflow exports.
Practical AI Workflow Notes
Want more practical AI operations ideas?
Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.
Share this post
