A support triage agent routes a customer to the wrong account team.
Nobody panics. The ticket gets corrected. Someone drops a note in Slack: "Looks like memory crossed accounts again." A developer tweaks the prompt, maybe adds a warning to the tool description, and the next demo works fine.
That is where a lot of agent programs quietly lose the plot.
The problem was not that the agent made a mistake. Production systems make mistakes. The problem was that the mistake became a conversation instead of a fixture.
If the same memory leak, stale lookup, skipped approval, or wrong tool order can happen again next month, the team did not learn from the incident. They just talked about it.
That is the useful angle in AWS's May 28 post on dataset management in Amazon Bedrock AgentCore. The interesting part is not "AWS shipped a feature." It is the operating pattern underneath it: production failures should become permanent regression tests.
For teams building agents in support, finance ops, sales ops, reporting, document workflows, or internal automation, that pattern matters whether or not they use AWS.
A production agent is closer to a junior teammate with permissions than a chatbot. It reads context, calls tools, remembers things, drafts answers, updates systems, and sometimes takes action. If that teammate makes the same mistake twice because nobody wrote it down as a test, the process is broken.
What AWS shipped, in plain English
Amazon Bedrock AgentCore is AWS's platform for building and operating agents with different frameworks and foundation models. The broader AgentCore docs describe services for runtime, memory, observability, evaluations, policy, registry, browser use, code execution, identity, and tool gateways.
The dataset management post focuses on one slice: how teams create and manage evaluation datasets for agents.
In plain English, AgentCore gives teams a managed way to store agent test cases as datasets. Each scenario can include:
- the input or user situation
- the expected output
- assertions the agent must satisfy
- the expected tool sequence or trajectory
- metadata such as tags and identifiers
Teams work in a mutable draft while they are adding or editing scenarios. When the dataset is ready, they publish an immutable numbered version. Future evaluation runs can pin to that version, so the inputs and ground truth stay fixed.
That last part matters.
Agent outputs are non-deterministic. The same agent can answer the same question slightly differently across runs. A model swap, prompt edit, tool description change, or memory tweak can move scores around. If the test inputs are also changing, nobody can tell whether quality improved or the benchmark moved.
Stable datasets give the team something solid to compare against.
AWS also distinguishes between predefined scenarios and simulated scenarios. Predefined scenarios are the regression suite: known inputs, known expected behavior, known assertions. They fit outer-loop checks such as CI gates before deployment.
Simulated scenarios are better for discovery. They help teams find failure modes they have not scripted yet. Once a simulated run exposes a real weakness, the team can curate it into a predefined scenario and publish it in the next dataset version.
That is the loop: discover messy behavior, turn the important failures into fixed test cases, then keep running them.
Why versioned datasets change agent work
Most agent pilots start with a handful of happy-path demos.
The agent answers a support question. It summarizes a PDF. It drafts a sales follow-up. It queries a spreadsheet. The room nods. The demo proves the agent can work.
It does not prove the agent will keep working when a tool changes, a user asks an awkward follow-up, a customer has duplicate records, a finance value is stale, or a previous session leaks into the next one.
For production AI agents, the useful question is not "Can it do the task once?" The useful question is "Can we prove this change did not reintroduce the failures we already found?"
Versioned datasets make that question answerable.
A good agent evaluation dataset is not just a pile of prompts. It is a set of work examples with ground truth attached. For example:
- The finance research agent must call the market data tool again if the cached quote is older than the policy allows.
- The support triage agent must not use memory from another customer account.
- The sales ops agent must create the CRM task only after matching the contact by verified email.
- The document agent must cite the source file and page number for any policy claim.
- The purchasing agent must request approval before it creates or changes an order.
Some of those checks belong in the final answer. Some belong in the tool trajectory. Some belong in state. Some belong in the approval log.
That is why agent evaluation is different from chatbot evaluation. AWS's related post on evaluating deep agents using LangSmith on AWS makes the same point: agent behavior is multi-step, and early tool-call errors can cascade. You may need to evaluate the final response, the trajectory of tool calls, and the resulting state.
A polished answer can still be wrong if the agent skipped the required broker workflow. A helpful summary can still be unsafe if it pulled context from the wrong account. A confident finance answer can still be useless if the tool call returned stale data.
LLM judges can help with tone, helpfulness, and completeness. They cannot reliably verify every hard fact, enforce every required tool order, or prove that PII did not leak between sessions. Those checks need ground truth, deterministic assertions, traces, and reviewers.
The practical loop: trace, curate, version, run
The best regression cases usually come from production. Not because production is clean, but because production shows where the agent actually breaks.
A useful workflow looks like this:
- A failure shows up in a trace, transcript, approval log, user report, or observability alert.
- Someone curates the failure into a clean scenario.
- The scenario goes into a dataset draft with expected behavior and assertions.
- The team publishes a locked dataset version.
- CI, a scheduled evaluation, or a release gate runs the agent against that version.
- A human reviews ambiguous cases and updates the ground truth when needed.
The artifact is the important part. The team should be able to point to a file, dataset version, trace ID, or evaluation run and say, "This exact failure cannot silently come back."
AgentCore's managed versioning is one way to do that. According to AWS, datasets have ARNs, IAM authorization, tags, one mutable draft, and immutable numbered versions. Scenarios are schema-validated at ingest. Evaluations can run through on-demand or batch runners and feed AgentCore Observability and CloudWatch.
That is useful infrastructure if you are already in the AWS ecosystem.
But the operating habit is the part smaller teams should copy first. Do not wait for a platform migration to start writing down failures as fixtures.
A team can begin with a YAML file, a spreadsheet, or a folder of JSON cases. The format matters less than the discipline.
Each case should answer:
- What user input or situation triggered the failure?
- What should the agent have answered or done?
- Which tools should it have called?
- Which tools or actions were forbidden?
- What evidence should appear in the answer?
- What data must not cross accounts, sessions, or roles?
- Who owns review when the result is ambiguous?
That small artifact changes the conversation. The next prompt edit is no longer judged by "does the demo look good?" It is judged by "did we pass the cases that matter?"
What to code-grade, model-grade, and human-review
Agent evaluation gets expensive and noisy when every case goes to an LLM judge. It also gets brittle when every judgment is forced into code.
Use the cheapest reliable grader for the job.
Code-based graders are best when the rule is objective. The AWS LangSmith post calls out why: they are fast, cheap, reproducible, and easier to debug when success criteria can be expressed as code.
Use code graders for checks like:
- Did the agent call the required tool?
- Did it avoid a forbidden tool?
- Did it cite a source document ID?
- Did it request approval before a side effect?
- Did it keep customer IDs from crossing sessions?
- Did the output include the required structured field?
- Did the SQL agent run a query checker before execution?
Model-based graders are useful when the judgment is semantic. They can evaluate whether an answer is helpful, complete, clear, or aligned with a rubric. But they need calibration. AWS's LangSmith post recommends calibrating model graders against human judgment and allowing "Unknown" when evidence is insufficient.
That "Unknown" option is more important than it sounds. A judge that always guesses will make the dashboard look cleaner than reality.
Human review belongs where judgment, risk, or policy interpretation is still unsettled. Humans should review ambiguous failures, calibrate LLM judge rubrics, spot drift, and decide whether the expected behavior itself needs to change.
For small and midsize teams, this usually means a mixed evaluation set:
- deterministic tests for tool order, permissions, citations, and forbidden behavior
- model-graded tests for answer quality and user usefulness
- human-reviewed samples for edge cases, policy calls, and high-risk workflows
This is also where governance becomes practical instead of ceremonial. A policy document is useful only if it turns into checks. We have written before about why teams should write the AI approval policy before choosing agent tooling and build an approval queue before giving agents side effects. Regression tests are the next piece of that same operating system.
The approval policy says what must happen. The queue records what did happen. The regression suite proves the agent still follows the rule after the next change.
What smaller teams should do this week
You do not need a managed evaluation platform to start AI agent regression testing.
You need 20 real failures or near-failures.
Pull them from Slack threads, support escalations, QA notes, sales ops corrections, finance review comments, trace logs, and approval queue rejects. Avoid synthetic examples at first. Synthetic tests are useful later, but real failures have sharper edges.
Create a simple file or sheet with these columns:
- scenario name
- source trace, ticket, or note
- user input
- expected answer or action
- required tool calls
- forbidden tool calls or behaviors
- required evidence or citation
- security or privacy constraint
- grading method
- reviewer owner
- current status
Keep it boring. Boring is good.
A support case might say: "When a user asks about Account A, the agent must not use notes, contacts, or history from Account B, even if the contact names match."
A finance case might say: "When giving a stock price or market movement, the agent must call the live market data tool during the run. Cached values older than the allowed window fail."
A document workflow case might say: "When answering a policy question, the agent must cite the document title and page. If the source is missing, it must say it cannot verify the answer."
A sales ops case might say: "The agent may draft the follow-up and CRM task, but it must not send the email or update opportunity stage without approval."
Once those cases exist, run them before every meaningful change. A meaningful change includes prompt edits, model swaps, tool updates, memory changes, retrieval changes, policy updates, and permission changes.
If the team already has CI, wire the cases into CI. If not, run them on a schedule and review failures weekly. If the agent can take action, connect the tests to approval logs and permission boundaries, not just final answer quality.
This is where responsible AI becomes operational. Boundaries, observability, and human oversight are not separate from evaluation. They are the material you evaluate against.
The same applies to data security. If the agent can see customer, employee, financial, or regulated data, the regression suite should include data-access failures. Do not wait for a security review to discover that the test set only checks helpfulness.
The AWS pattern is bigger than AWS
AgentCore dataset management is worth watching because it reflects where production agent work is headed.
Agent evaluation datasets are becoming deployment infrastructure. AWS's AgentCore overview says Evaluations measure how agents and tools execute tasks, handle edge cases, and maintain output reliability across inputs and contexts. It also notes that evaluations can use sessions, traces, and spans from frameworks such as Strands Agents or LangGraph when instrumented with OpenTelemetry or OpenInference, with results integrated into AgentCore Observability and CloudWatch.
That is the right neighborhood: traces, spans, tool calls, datasets, policies, observability, review.
This matches the broader shift we covered in our piece on agent governance and evaluations as deployment infrastructure. Agent governance is not a committee that blesses a demo. It is the machinery that catches regressions after the demo is over.
For a small business, that machinery does not have to be heavy. It does have to be real.
A spreadsheet with 20 production failures is better than a dashboard full of vague helpfulness scores. A YAML fixture with required tool calls is better than a prompt review meeting. A weekly review of failed traces is better than assuming the new model is smarter because the sample answer sounded cleaner.
When the team is ready for more structure, managed tools like AgentCore can help with versioning, schema validation, runners, IAM, observability, and repeatable evaluation runs. Until then, the habit matters more than the platform.
The teams that win will have boring fixtures
Agent demos are useful. They show what is possible, help teams align, and make abstract workflows visible.
But demos are not evidence that an agent is ready for production.
Production readiness looks more boring. It looks like locked test inputs, expected outputs, tool trajectories, approval records, reviewer notes, and CI gates. It looks like a failure from March still being tested in June. It looks like the team refusing to ship a prompt change because it made the agent helpful in a demo but broke the stale-data case.
That is the bar for agents with access to business systems.
If your agent is only drafting harmless text, you can tolerate more fuzziness. If it can route customers, query finance systems, update CRM records, touch documents, or trigger workflows, every serious failure should become a regression case.
The teams that win with agents will not be the ones with the flashiest demos. They will be the ones with the most boring test fixtures, the clearest approval paths, and the discipline to turn production mistakes into permanent memory.
If you want a second set of eyes on whether your agent pilot has enough evaluation coverage before it gets more permissions, BaristaLabs can help with an agent evaluation review or broader process automation planning.
Review an agent evaluation plan
Give the agent a test suite before more permissions
Bring one agent pilot and leave with a practical test-set structure for failures, tool calls, approvals, and data boundaries.
Best fit for support triage, sales ops, finance research, document workflows, CRM updates, and internal automation pilots.
Practical AI Workflow Notes
Want more practical AI operations ideas?
Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.
Share this post
