The support lead opens the morning review sample. The dashboard is busy. Session count is up. Tokens are up. The trace log is full of model calls, retrieval steps, tool responses, retries, and timestamps.
Nobody in the room can answer the simple question.
Did customers actually get helped?
One customer asked about a refund. Another tried to finish an onboarding checklist. A sales rep asked the internal agent for an account-health review before a renewal call. The support team says ticket volume is down, but repeat contacts are creeping up. Engineering can show a trace waterfall. Product can show a session chart. Operations still wants to know which work got completed and which work quietly fell apart.
That is the measurement gap customer-facing agents create.
A chatbot experiment can survive on anecdotes. A production agent cannot. Once an AI agent talks to customers, handles service work, drafts responses, pulls account data, or routes cases, the dashboard has to move past activity. The useful question is not "how much did the model run?" It is "what happened to the job?"
Agent analytics is becoming its own lane
The recent Launch HN for Voker, titled "Analytics for AI Agents," is a useful signal because the discussion did not stay at the level of launch-copy reactions. The item had 59 points and 22 comments when captured in the last-30-day research window, and the comments circled the measurement problem operators are running into now.
How is this different from tracing tools like Langfuse? How does it compare to Amplitude's agent analytics? What data model lets teams compare agents that use different tools, policies, and flows? Should teams normalize around raw token and turn metrics, or around whether the user accomplished the thing they came to do?
One commenter put the split cleanly: raw token and turn metrics and accomplishment metrics can paint completely different pictures of "is this agent working."
That is the right argument.
Voker describes itself as "analytics for AI agents" and asks, "Do you really know what your agents are saying to your users?" Its site makes the same operational complaint many teams are feeling: scanning traces does not tell you whether agents are helpful, accurate, or hitting walls. The Voker docs position it as an instrumentation product, with Python and TypeScript SDKs and provider support for OpenAI, Anthropic, Gemini, and AI SDK.
The launch is not the point. The category shift is.
Teams are moving from "we shipped an agent" to "prove the agent helped."
Traces, evaluations, and analytics answer different questions
A production AI agent needs three kinds of visibility. They overlap, but they should not be collapsed into one dashboard.
Tracing shows what the model and tool chain did.
A trace can tell you which model ran, which prompt was used, which retrieval source was called, which tool failed, how long the step took, and what the intermediate output looked like. This matters. If the refund lookup tool times out or the agent sends the wrong argument to an account API, traces are how engineering debugs the failure.
The infrastructure side is getting more standardized. The OpenTelemetry generative AI semantic conventions define common signals for GenAI events, metrics, model spans, and agent spans. That helps teams describe what happened inside the system in a more portable way.
It still does not answer whether the customer got the refund answer they needed.
Evaluation asks whether an output or run passed a test.
Evaluation work often includes offline test sets, online evaluation, human annotation, evaluator feedback, and trace review. The LangSmith evaluation concepts docs separate offline and online evaluations and cover datasets, annotations, traces, and evaluator scores.
This is the quality lane. It helps you ask, "Did this response follow the rubric?" or "Did the new prompt do better against known examples?" We wrote about a related version of this problem in AI observability needs a quality lane, not just GPU charts.
Agent analytics asks whether the user's job reached the intended outcome.
This is where normal product analytics also comes up short. Clicks, pageviews, sessions, funnels, and conversions are useful, but an agent does not behave like a static product surface. The work happens inside conversation turns, tool calls, retrieved knowledge, handoffs, corrections, and policy boundaries.
A refund agent can have a short session and fail. It can have a long session and succeed. It can answer accurately but still miss the customer's real problem. It can deflect a ticket in the dashboard while creating a second contact tomorrow.
The agent analytics view starts from the job, then connects traces and product events around it.
Start the dashboard with the job
If you are instrumenting your first customer-facing AI agent, do not start with tokens, latency, or a giant trace waterfall. Those belong in the system. They should not be the first thing the support lead sees.
Start with a plain sentence:
"This agent is working when..."
Then finish the sentence with a job a customer or employee would recognize.
"This agent is working when a customer gets a refund eligibility answer and either receives the right next step or gets routed to a human with the required context."
"This internal account agent is working when a rep completes an account-health review before a renewal call, with current usage, open issues, renewal risk, and recommended follow-up."
"This onboarding agent is working when the user finishes the required setup checklist without opening a support ticket or repeating the same question."
Once the job is clear, instrumentation gets less abstract. You can map the lifecycle of the work instead of collecting every available metric and hoping the dashboard explains itself.
A first agent analytics pass should capture six events.
A small measurement framework for the first pass
Outcome event
The outcome event records the work that counts as completed.
For a support agent, this might be refund eligibility explained, password reset completed, appointment rescheduled, order status found, or escalation created with all required fields.
For an internal agent, it might be account review completed, invoice variance explained, onboarding checklist finished, or exception routed to the right queue.
The outcome event should be specific enough that the team can argue about it. "Session resolved" is often too vague. "Refund policy answer delivered with order status and next action" is better.
Good outcome events also separate agent confidence from business completion. The agent saying "I helped" is not enough. Look for the thing that happened after the conversation: no repeat contact, form submitted, case closed, order updated, meeting note accepted, or handoff completed.
Wall event
The wall event records where the agent got stuck or gave up.
This is not just a model failure. The wall might be a missing tool, an unavailable source, a user request outside scope, an ambiguous policy, a permissions issue, or a loop where the agent keeps asking for information the user already provided.
A wall event gives the morning review sample teeth. Instead of "the agent had 19 failed sessions," the team can see that 11 were caused by missing refund exceptions for subscriptions, 4 by identity verification failures, and 4 by the shipping carrier API returning partial data.
That changes the fix.
Rescue event
The rescue event records when a human had to step in.
A rescue is not automatically bad. Some work should go to a person. The question is whether the handoff happened at the right time, with the right context, and without making the customer repeat themselves.
If the agent escalates billing disputes early with a clean summary, that may be the right behavior. If it spends six turns guessing before a human rescues the case, the analytics should show the waste.
This is where agent analytics connects directly to approval queues. If the agent is about to refund money, change account status, send a legal-sensitive response, or touch a high-value customer, you need a handoff design. We covered the operating pattern in Build an AI approval queue before you build another agent.
Policy event
The policy event records when the agent approached a boundary.
Boundaries include refund authority, regulated advice, account access, privacy, discounts, security actions, cancellations, medical or legal language, and anything that should require human review.
The policy event should capture both near misses and approved actions. An agent that never crosses a boundary might be safe because it is well designed. It might also be useless because it refuses everything.
A policy event lets the team see whether the policy is doing real work. If 30 percent of sessions hit the same approval boundary, the issue may not be the model. The issue may be that the business has not decided what the agent is allowed to do.
That decision should be written before the tool choice. See Write the AI approval policy before choosing the agent.
Knowledge-gap event
The knowledge-gap event records what source, answer, or context was missing.
Most agent failures are not dramatic hallucinations. They are smaller and more expensive: the agent cannot find the updated warranty policy, does not know the edge case for prepaid accounts, retrieves stale setup instructions, or cannot see the customer tier.
A knowledge-gap event should name the missing artifact. Not "RAG failed." Something closer to "missing refund policy for annual plans canceled inside first 14 days" or "help center article conflicts with billing-system status."
This gives the content, support, and operations teams a repair queue. It also keeps engineering from treating every bad answer as a prompt problem.
Business event
The business event connects the agent interaction to the result the company cares about.
This might be ticket deflection, faster resolution, fewer repeat contacts, conversion, retention, onboarding completion, lower backlog, fewer manual reviews, or cleaner escalations.
Be careful here. Do not pretend every saved ticket is saved money. A deflected ticket that becomes two contacts tomorrow is not a win. A faster response that sends the wrong answer is not a win. A high resolution rate with rising refunds may hide a policy problem.
Business events need enough downstream data to catch the second-order effects.
What to instrument in the first week
A small team does not need a perfect analytics program before it ships an agent. It does need a first week of measurement that can survive contact with customers.
Pick one workflow. Not the whole support function. Not "customer service." One workflow.
Refund eligibility. Onboarding checklist completion. Appointment rescheduling. Account-health review. Order-status lookup. Internal IT access request.
For that workflow, define:
- the outcome event that means the job got done
- the wall events that mean the agent got stuck
- the rescue event that means a human took over
- the policy events that require approval or review
- the knowledge-gap event format the team will use to repair sources
- the business event that shows whether the workflow improved
Then tag 25 to 50 real sessions by hand.
This is not busywork. It is how you learn whether the labels make sense. The first pass will expose bad assumptions. The team may discover that "resolved" is too broad, that the agent is escalating too late, or that the most common wall is a policy nobody has written down.
After that, automate what is stable. Keep human review for the ambiguous cases.
A practical first dashboard might show:
- outcome rate by workflow
- wall rate by wall type
- rescue rate and average turns before rescue
- policy-boundary hits by policy type
- top knowledge gaps by missing source or answer
- repeat contact rate after agent sessions
- trace links for the sessions behind each metric
The trace is still there. It just sits behind the outcome, not in front of it.
Approval policies make the analytics cleaner
Agent analytics gets messy when the business has not decided what the agent is allowed to do.
Support teams feel this immediately. Can the agent offer a refund? Can it apply a discount? Can it tell a customer their account may be canceled? Can it summarize a disputed charge? Can it promise a callback time? Can it change a shipping address?
If every boundary is vague, the analytics will be vague too. The system will record a pile of "needs human review" events without explaining whether the agent behaved correctly.
A written approval policy gives the dashboard better categories.
For example:
- auto-answer allowed: explain published refund policy
- approval required: issue refund above $100
- human-only: dispute fraud-related charge
- blocked: request for another customer's account data
- review sample: frustrated customer with two prior contacts
Now the policy event can tell you whether the agent followed the rule, where the rule created friction, and which work should move into an approval queue.
This matters for process automation because the agent is rarely the whole workflow. It is one worker inside a queue. The queue needs states, owners, handoffs, exceptions, and audit trails.
The useful dashboard is boring in the right places
There is a temptation to make agent analytics look futuristic. Conversation maps. Sentiment arcs. Intent clouds. Token curves. Animated trace diagrams.
Some of that can help. Most of it should come later.
The first useful dashboard is usually more boring:
- how many users came with this job
- how many left with the job done
- where the agent got stuck
- how often a human rescued the work
- which policies caused review
- which missing sources caused bad answers
- what happened to the customer after the session
That dashboard is useful because a support lead, product manager, and engineer can sit in the same review and decide what to change.
Engineering can fix the tool call. Support can rewrite the knowledge article. Product can change the flow. Operations can move a boundary into an approval queue. Leadership can see whether the agent helped the business or just shifted work out of sight.
Before adding another agent, make the first one measurable
The next AI agent will always sound more exciting than the measurement work around the current one.
Build a sales agent. Build an onboarding agent. Build an internal help desk agent. Build an account-review agent. Build a back-office workflow agent.
Maybe you should.
But if the first agent is already talking to customers or touching a work queue, pause long enough to instrument the job. Do not let tokens, latency, or trace volume become a substitute for outcome visibility.
Traces tell you what happened inside the machine. Evaluations tell you whether an output passed a test. Agent analytics tells you whether the work got done.
If you want a practical starting point, take one production agent and review one workflow end to end. Define the outcome, wall, rescue, policy, knowledge-gap, and business events. Then sample the real sessions.
If the dashboard cannot answer whether the customer got helped, it is not an agent analytics dashboard yet.
For teams moving from AI experiments to customer-facing workflows, BaristaLabs can help review one agent's outcome metrics, approval handoffs, and measurement gaps. The goal is not another tool recommendation. It is a cleaner answer to the morning question: did the work get done?
Review one agent analytics dashboard
Can your agent dashboard prove the work got done?
Bring one production or pilot agent workflow and map the outcome, wall, rescue, policy, knowledge-gap, and business events.
No customer records or private transcripts needed. Use a sanitized workflow and a sample event schema.
Practical AI Workflow Notes
Want more practical AI operations ideas?
Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.
Share this post
