The revenue meeting starts the way these meetings always start now: someone types a question into the analytics agent instead of pinging an analyst.
"How many active customers do we have this quarter?"
The agent answers fast. It cites a table. The number looks clean, formatted, confident. Marketing has one on a slide already, pulled the same way an hour earlier. The two numbers do not match.
Nobody in the room thinks the AI made up data. Nobody's SQL was wrong. Both answers came from a real query against real tables, executed correctly, with no hallucinated SQL anywhere in the chain. The disagreement is not a bug in the model. It is that "active customer" means something different depending on which table you start from, and nobody had ever forced the business to pick one.
That gap, not model quality, is the subject of a June 3 Claude blog post from Anthropic describing how the company uses Claude for its own internal analytics. The headline numbers are strong. The more useful part of the post is everything underneath them: what it actually took to make an analytics agent trustworthy enough to run unsupervised.
The number that gets quoted, and the number that explains it
Anthropic's claim, stated plainly: "95% of business analytics queries are automated via Claude, with ~95% accuracy in aggregate." Read alone, that sounds like a case for pointing any LLM at a warehouse and letting it run.
Anthropic's own post argues against reading it that way. It warns that doing exactly this, giving an agent warehouse access and letting it execute, can create what the post calls a "false sense of precision." A wrong number that shows up formatted like a right one is more dangerous than a system that visibly fails, because nobody double-checks a well-dressed answer.
The post names three specific failure modes behind inaccurate responses:
- Concept and entity ambiguity — the business has more than one plausible definition of a term like "active," "conversion," or "customer," and the agent has no way to know which one the asker meant.
- Data staleness — the underlying table hasn't refreshed, so a technically correct query returns a confidently wrong answer.
- Retrieval failure — the agent finds the wrong table, view, or historical pattern to route the question through.
None of these are code-generation problems. Claude can write the SQL. The failures happen earlier, in the step where a business question gets mapped onto the right, current, unambiguous piece of the data model. That is a governance problem wearing an engineering costume.
The gap between 21% and 95% is not a bigger model
The most useful number in Anthropic's post is not the 95% headline. It's the number right before it: without what Anthropic calls "skills," Claude's accuracy on its internal analytics evals never topped 21%. Adding skills, curated reference material that tells the agent which datasets are canonical, how to handle known edge cases, and which query patterns are approved, pushed aggregate accuracy consistently above 95%, and close to 99% in some domains.
Same model, same weights. What changed was the maintained layer of business context sitting between Claude and the bare warehouse. Anthropic also tested raw retrieval over thousands of historical queries as a shortcut, and reports it moved accuracy by less than a single point. Distilling that history into structured, curated reference docs did the actual work.
The implication for anyone buying "chat with your data" as a product: the vendor's benchmark accuracy is a property of their maintained context, not a property of the model you're both using. Swap in your own warehouse, your own definitions, your own drift, and you inherit the 21% case until someone builds the layer that gets you to the other number.
Maintenance is the whole job, not a footnote
Anthropic includes a detail that should worry anyone planning to ship an analytics agent and walk away: offline accuracy on their own evals drifted from roughly 95% at launch down to about 65% over the course of a month, before they treated skill maintenance as an engineering problem with real process behind it, not a one-time setup task.
Their fix was structural, not vigilance-based. They colocate the skill markdown files that describe a metric with the transformation models that produce it, so a change to one sits next to a change to the other in the same codebase. They added a code-review hook that flags data-model pull requests which touch a reporting model without a corresponding skill-file update. The result, per the post: roughly 90% of Anthropic's data-model PRs now include a skill change in the same diff.
That's the load-bearing idea in the whole piece, and it's the one BaristaLabs readers should walk away with. Anthropic didn't solve analytics accuracy by writing a better prompt. They solved it by making metric-definition maintenance a mandatory side effect of changing the data model, enforced in the same review gate that already existed for code. Skip that discipline and the 95% number decays on its own schedule, whether or not anyone's watching. Reddit's r/BusinessIntelligence picked up on exactly this point: the thread's framing wasn't whether Claude could answer questions, it was how teams without Anthropic's engineering headcount are supposed to keep that context fresh once it ships.
It also lines up with a related complaint surfacing in r/analytics: a lot of analytics-team bottlenecks were never really about data access. They were about context, the tribal knowledge of what a metric actually means, that never made it into anything a machine, or a new hire, could read.
The artifact: a source-of-truth register, not a bigger warehouse grant
Here's the part most self-service analytics pitches skip. The lesson from Anthropic's own internal build is not "give the agent warehouse access." It's "decide, in writing, exactly which questions the agent is allowed to answer, and who is accountable for keeping each answer current."
That's small enough to start this week. Before an analytics agent gets production access, write down the handful of business questions people actually ask on repeat, and give each one a row.
The Analytics Agent Source-of-Truth Register
Metric-truth field packet
Analytics Agent Source-of-Truth Register
Every business question the agent is allowed to answer gets one row before it goes live.
- 01
Business question category
Required
Pins down: "Active customers," "conversion rate," "pipeline value"
Why it matters:Two teams asking the same words can mean two different queries. The category is what the register actually governs.
- 02
Canonical metric and dataset
Required
Pins down: The one query path the agent is allowed to use
Why it matters:If three tables could answer the question, the agent needs to be told which one is correct, not left to guess the plausible one.
- 03
Owner
Required
Pins down: A named person, not a team alias
Why it matters:A metric owned by "data@" has no one to page when the number looks wrong on a Friday.
- 04
Grain
Required
Pins down: Per day, per account, per session
Why it matters:"Active customers" summed at the wrong grain is a different, wrong number with the same label.
- 05
Required filters and exclusions
Required
Pins down: Test accounts, churned rows, internal traffic
Why it matters:The filters live in someone's head until they are written down, and the agent cannot read a head.
- 06
Freshness SLA
Required
Pins down: How current the source table has to be before the agent can answer
Why it matters:A confident answer from three-week-old data is worse than a refusal.
- 07
Approved semantic-layer or query path
Required
Pins down: The blessed view, metric, or skill the agent should route through
Why it matters:This is the difference between a 21% eval score and a 95% one, according to Anthropic's own numbers below.
- 08
Known gotchas
Sample
Pins down: The edge case that has burned someone before
Why it matters:Institutional memory that never got written down is exactly what an agent cannot infer.
- 09
Last changed date
Required
Pins down: When the metric definition or its underlying model last moved
Why it matters:A stale date is the fastest signal that a row needs a re-check before anyone trusts its answer.
- 10
Agent reference doc touched
Required
Pins down: Yes or no, and a link to the diff
Why it matters:If the data model changed and the reference doc did not, the agent is now answering from a memory of a table that no longer exists.
- 11
Validation sample
Required
Pins down: A known-correct answer the agent's output gets checked against
Why it matters:Without a ground truth to compare against, drift is invisible until someone in a meeting disputes the number.
- 12
Human review threshold
Required
Pins down: Which answers this metric produces need a person to sign off before they leave the chat
Why it matters:Exploratory curiosity and a number going into a board deck are not the same risk, and should not get the same review.
- 13
Decision-grade or exploratory
Required
Pins down: Whether this metric is safe to act on unreviewed
Why it matters:The register's whole job is to keep exploratory answers from quietly becoming decision-grade ones.
No row, no query path. The agent only answers questions that made it onto the register.

The register does two things a dashboard never will. It forces someone to notice when two teams have quietly been using different definitions of the same word. And it gives the agent a hard boundary: if a question isn't on the register, it doesn't get a canonical answer, it gets flagged for a human, the same way an agent receipt forces a written trail before an action goes out the door.
The operating loop that keeps the register honest
A register that's built once and never touched again decays exactly the way Anthropic's did, 95% down to 65% in a month. The fix that worked for them generalizes past their own stack: tie the register to the same change process that already governs the data model.
In practice, that means one rule. If a pull request changes a metric's underlying table, filter logic, or grain, it is not done until the matching register row and agent reference doc are updated in the same change. Not filed as a follow-up ticket. Not left for the next sprint. Same diff.
This is the same discipline BaristaLabs has argued for around evaluating the workflow receipt instead of just the final answer: checking whether an agent's output looks right is not the same as checking whether the process that produced it is still sound. An analytics answer that cites a real table and a real number can still be wrong, because the table it cites moved out from under the definition six weeks ago and nobody updated the row that says so.
It's also a sharper version of a pattern already visible in how adoption dashboards get judged: a metric without an owner and a freshness rule is a vanity number wearing a chart. The fix there was the same shape, tie the measurement to something a person is accountable for, not just something a tool reports.
Where this differs from "give the agent a data engineer"
It's worth distinguishing this from a related but different trend: autonomous data agents that write pipelines, fix broken transforms, and ship dashboards on their own, the kind covered in Databricks' Genie Code launch. That question is about who builds the pipeline, not who's allowed to ask it a question.
Anthropic's post is a trust-boundary story. It assumes the pipelines already exist and asks a narrower, arguably harder question: once the data is there, which business questions is an agent allowed to answer on its own, and which ones need a human before the number leaves the room? A team could deploy the most capable autonomous data agent on the market and still get the active-customers problem, because that failure lives in the metric layer, not the pipeline layer.
The buyer's question that actually matters
If you're evaluating a "chat with your data" product, or building one internally, Anthropic's own numbers hand you the diligence question: not "how accurate is your model," but "how accurate is your model on a metric layer nobody has maintained in a month?"
Ask the vendor what their accuracy looks like without their curated skills and reference docs. Ask who is responsible for updating your metric definitions after your data model changes, and whether that's a process or a hope. Ask what happens when someone asks a question that isn't on the register: does the agent guess, or does it say it doesn't know and route to a person.
None of that requires distrust of the technology. Anthropic's numbers say the technology works, when the metric layer underneath it is actively maintained. The register is what makes "actively maintained" checkable instead of assumed.
Bring one recurring dashboard question, the one that gets a different answer depending on who runs it, and BaristaLabs will help you map the source-of-truth register, the freshness rule, the answer receipt, and the review threshold before you hand the whole analytics queue to an agent. Or start by reading how we think about drawing the line on what an assistant may read, surface, and act on before production access expands.
Before you automate the analytics queue
Bring one recurring dashboard question
BaristaLabs will map the source-of-truth register, freshness rule, answer receipt, and review threshold before you let an agent field it unsupervised.
Best fit when the same question gets a different answer depending on who runs the query.
Practical AI Workflow Notes
Want more practical AI operations ideas?
Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.
Turn this idea into a pilot
Which workflow should go first?
Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.
- 3-5 minutes
- Deterministic score
- No sensitive data
Share this post
