AI Development

Your AI Agent Needs a Bug Cemetery, Not Another Demo

AWS Bedrock AgentCore datasets point to a practical habit for reliable agents: turn production failures into versioned regression tests with locked inputs, expected tool calls, assertions, and CI gates.

Sean McLellan

Lead Architect & Founder

May 29, 20267 min read

A broker opens Slack at 8:47 a.m. and pastes the agent's answer into the client channel.

"Client holds 12,400 shares of VRTX. Current price $412.40. Recent news looks clean. Ready to discuss position sizing."

The number is wrong. The quote is from the previous close. The agent never called the live market-data endpoint. It pulled the broker's saved profile, matched the ticker, and generated fluent text around stale data.

By 9:15 the same broker has two more questions about the same client. The agent now gives a different answer because the prompt was lightly edited overnight.

No one opens a ticket. The thread scrolls away.

The next time the agent touches a live quote, the same failure is possible again.

The failure needs an address

AWS's May 28, 2026 post on dataset management in Amazon Bedrock AgentCore is useful because it treats every production trace that went wrong as a future test fixture.

When the stale-price run finished, the trace already contained the input, the missing tool call, the broker profile lookup, and the final text. That trace becomes the seed for a test case. The case records the exact input, the required market-data call, the freshness rule on the timestamp, and the assertion that the answer must reference live data.

Once the case is published as an immutable dataset version, every subsequent agent change runs against it.

The failure now has an address and a gate.

AgentCore datasets can include inputs, expected outputs, assertions, and tool sequences. Draft datasets stay mutable while the team curates cases. Published dataset versions are immutable, so the same test set can be reused in developer loops, CI/CD gates, scheduled regression checks, and AgentCore Optimization.

The useful idea is simple: the test suite becomes a permanent record of what the agent has already gotten wrong.

Bug cemetery checklist: a practical agent regression case records the failed input, the required tool call or workflow step, the freshness or approval rule, the expected trace evidence, the assertion that must pass, the owner who reviews the miss, and the CI gate that blocks release. Use that checklist when choosing an AI consulting pilot: if the team cannot name the gate, the agent is not ready to touch the workflow.

Record this	Why it matters
Failed user input	Replays the exact production miss instead of a generic prompt.
Required tool call or workflow step	Proves the agent used the right system, source, or approval path.
Freshness, privacy, or approval rule	Turns business risk into an explicit pass/fail condition.
Expected trace evidence and assertion	Checks the run receipt, not just the polished final paragraph.
Owner and CI gate	Makes the case durable enough to block the next unsafe release.

One-off eval scores do not create reliability

Agents are non-deterministic. The same input can produce different traces across runs.

That makes a single score slippery. If the score improves, did the agent improve? Did the model sample differently? Did the test questions change? Did the judge get more generous?

A locked dataset gives the team a fixed comparison point. The input stays the same. The ground truth stays the same. The assertions stay the same.

Now a before-and-after run can show whether a prompt change, tool update, memory change, or retrieval fix actually helped.

This matters because an LLM judge can only see part of the problem.

A judge may decide that an answer is helpful or well written. It cannot know whether the stock price is current unless the eval has ground truth. It cannot know whether the agent used the required pricing API unless the eval inspects tool calls. It cannot know whether the workflow ran in the right order unless the trace includes the expected sequence.

For workflow agents, the final answer is only one artifact. The path matters.

AWS's related post on evaluating deep agents using LangSmith on AWS makes the same point from another angle: agent evaluations have to inspect outputs, tool calls, arguments, traces, and final environment state. A bad early tool call can cascade through the whole workflow.

That is the gap most demos hide.

The practical pattern: trace, curate, lock, gate

A useful agent test suite grows in a simple loop.

AI agent bug cemetery loop turning workflow misses into locked regression cases and deployment gates.

A production run fails. The team reviews the trace. Someone turns the failure into a curated test case. The case enters a draft dataset. When the dataset is ready, the team publishes a version. That version becomes a deployment gate.

Now the failure has a permanent address.

For a support agent, the case might be:

Customer asks for a refund above policy.
Agent must retrieve the current refund policy.
Agent must not issue the refund.
Agent must create a manager approval request.
Agent must redact sensitive account notes from the customer-facing draft.

For a finance workflow, the case might be:

Broker asks for a market summary before a client call.
Agent must call the live market-data tool.
Quote timestamp must be within the allowed freshness window.
Answer must reference the correct client profile.
Fabricated SSNs or account numbers must not appear in the response.

For a sales-ops agent, the case might be:

Lead asks to change contract terms.
Agent may draft a summary.
Agent must not update the CRM stage without approval.
Legal-sensitive language must route to a reviewer.

This is where AI consulting and process automation should meet. Picking the workflow is only half the job. The team also needs to define what correct behavior looks like before the agent starts touching real systems.

A good test case is not a vibe. It is a locked input, a required path, and a set of checks.

The receipt matters more than the paragraph

The final answer still matters. It should be accurate, useful, and appropriate for the person receiving it.

But workflow agents usually fail before the final paragraph appears.

A support agent might write a perfectly polite refund response after skipping the current policy lookup. The visible answer looks fine. The trace shows the real problem: it never checked the rule that decides whether a manager has to approve the refund.

A finance agent might summarize a client account correctly but call the wrong market-data endpoint first. The paragraph is readable. The run receipt shows a stale quote, a missing freshness check, and no evidence that the answer was safe to send.

A sales-ops agent might draft a clean contract-summary email and quietly update the CRM stage without approval. Nobody catches the risk by scoring the email for tone. They catch it by inspecting the action the agent took.

That is why agent evaluation has to look at the run, not just the response. Tool calls, arguments, order, data freshness, approval gates, PII handling, stop conditions, and trace artifacts are part of the output.

For teams working with finance, support, customer data, or regulated workflows, human review, approval queues, and trace retention belong in the design. BaristaLabs covers that operating discipline in Responsible AI and data security.

Predefined cases catch regressions. Simulated cases find new misses.

AWS separates predefined scenarios from simulated scenarios.

Predefined scenarios are backward-looking. You already know the input. You know the expected output. You know which tool calls should happen. You know which assertions must pass.

These are regression cases.

They belong in the deployment gate because pass and fail are explicit. If the stale-price bug comes back, the test should catch it before the new agent version ships.

Simulated scenarios do a different job. They use personas and multi-turn conversations to discover new failure modes. A simulated broker might ask follow-up questions, change constraints midstream, or introduce a messy client request the team did not script by hand.

Simulations explore. Regression tests protect.

Small teams need both, but they do not need to start big.

What SMB teams should do before deploying agents

A small business does not need a giant eval lab to use this pattern.

Start with the failures the team has already seen.

Take the broker case above. Strip or mask any PII. Lock the input, the required tool call, the 60-second freshness window, and the rule that the answer must not be sent without a live quote. Add it to the dataset as a regression test. The next prompt tweak or model swap cannot ship until that case passes.

Do the same for the first three support tickets where the agent issued a refund without approval, the first sales handoff where it updated the CRM stage on its own, and the first analytics run that surfaced an internal employee ID in the client-facing summary.

Ten real failures, turned into ten locked cases, give a team more protection than a hundred generic helpfulness scores.

Add one simulated persona for the workflow. If the agent supports brokers, create a broker persona who asks urgent follow-up questions and mixes market context with client-specific instructions. If the agent supports customer service, create a frustrated customer persona who provides partial information and asks for an exception.

Then make one rule non-negotiable: every serious production miss becomes a regression test.

This is especially important in financial services, where a polished answer with the wrong source can create audit, trust, and compliance problems. The broker example in AWS's post works because it is not a toy chatbot. It is a workflow where data freshness, source selection, memory, and reviewability all matter.

The same pattern applies to less regulated work too. A support-routing agent, invoice exception agent, sales-research agent, or reporting assistant can all regress after a prompt update. If nobody locks the old failure as a test, the team finds out later, usually from a customer or a frustrated employee.

The bug cemetery is the asset

Most agent demos are built to show the happy path.

Production reliability is built from the embarrassing paths.

The stale quote. The skipped approval. The wrong customer record. The tool call with the bad argument. The answer that looked correct until someone checked the trace. The PII that should never have appeared in the run summary.

Those failures are uncomfortable, but they are also the raw material for a better system.

A versioned test suite turns them into institutional memory. It gives engineering a gate. It gives operations a review trail. It gives leadership a way to ask whether the agent is getting safer, not just more impressive.

The demo asks, "Can this agent do the work once?"

The test suite asks, "Can it keep doing the work after we change it?"

That second question is where agent projects become real.

Implementation help

Turn agent failures into a regression system

BaristaLabs helps teams convert production misses into locked inputs, expected tool calls, receipt assertions, and review gates before the next release.

Review the eval harness

Best fit when a demo works, but the team cannot yet explain which failures would block deployment.

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Turn this idea into a pilot

Which workflow should go first?

Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.

3-5 minutes
Deterministic score
No sensitive data

Check workflow readiness

Share this post

Share on X Share on LinkedIn Share on Bluesky

Agent evals should test workflow receipts, not just model answers

May 25, 2026

OpenAI's eval playbook makes the harness part of the result

May 30, 2026

Your AI dashboard needs a quality lane, not just GPU charts

May 30, 2026

AI Development

Your AI Agent Needs a Bug Cemetery, Not Another Demo

Sean McLellan

Lead Architect & Founder

May 29, 20267 min read

A broker opens Slack at 8:47 a.m. and pastes the agent's answer into the client channel.

"Client holds 12,400 shares of VRTX. Current price $412.40. Recent news looks clean. Ready to discuss position sizing."

By 9:15 the same broker has two more questions about the same client. The agent now gives a different answer because the prompt was lightly edited overnight.

No one opens a ticket. The thread scrolls away.

The next time the agent touches a live quote, the same failure is possible again.

The failure needs an address

AWS's May 28, 2026 post on dataset management in Amazon Bedrock AgentCore is useful because it treats every production trace that went wrong as a future test fixture.

Once the case is published as an immutable dataset version, every subsequent agent change runs against it.

The failure now has an address and a gate.

The useful idea is simple: the test suite becomes a permanent record of what the agent has already gotten wrong.

Record this	Why it matters
Failed user input	Replays the exact production miss instead of a generic prompt.
Required tool call or workflow step	Proves the agent used the right system, source, or approval path.
Freshness, privacy, or approval rule	Turns business risk into an explicit pass/fail condition.
Expected trace evidence and assertion	Checks the run receipt, not just the polished final paragraph.
Owner and CI gate	Makes the case durable enough to block the next unsafe release.

One-off eval scores do not create reliability

Agents are non-deterministic. The same input can produce different traces across runs.

That makes a single score slippery. If the score improves, did the agent improve? Did the model sample differently? Did the test questions change? Did the judge get more generous?

A locked dataset gives the team a fixed comparison point. The input stays the same. The ground truth stays the same. The assertions stay the same.

Now a before-and-after run can show whether a prompt change, tool update, memory change, or retrieval fix actually helped.

This matters because an LLM judge can only see part of the problem.

For workflow agents, the final answer is only one artifact. The path matters.

That is the gap most demos hide.

The practical pattern: trace, curate, lock, gate

A useful agent test suite grows in a simple loop.

Now the failure has a permanent address.

For a support agent, the case might be:

Customer asks for a refund above policy.
Agent must retrieve the current refund policy.
Agent must not issue the refund.
Agent must create a manager approval request.
Agent must redact sensitive account notes from the customer-facing draft.

For a finance workflow, the case might be:

Broker asks for a market summary before a client call.
Agent must call the live market-data tool.
Quote timestamp must be within the allowed freshness window.
Answer must reference the correct client profile.
Fabricated SSNs or account numbers must not appear in the response.

For a sales-ops agent, the case might be:

Lead asks to change contract terms.
Agent may draft a summary.
Agent must not update the CRM stage without approval.
Legal-sensitive language must route to a reviewer.

A good test case is not a vibe. It is a locked input, a required path, and a set of checks.

The receipt matters more than the paragraph

The final answer still matters. It should be accurate, useful, and appropriate for the person receiving it.

But workflow agents usually fail before the final paragraph appears.

Predefined cases catch regressions. Simulated cases find new misses.

AWS separates predefined scenarios from simulated scenarios.

Predefined scenarios are backward-looking. You already know the input. You know the expected output. You know which tool calls should happen. You know which assertions must pass.

These are regression cases.

They belong in the deployment gate because pass and fail are explicit. If the stale-price bug comes back, the test should catch it before the new agent version ships.

Simulations explore. Regression tests protect.

Small teams need both, but they do not need to start big.

What SMB teams should do before deploying agents

A small business does not need a giant eval lab to use this pattern.

Start with the failures the team has already seen.

Ten real failures, turned into ten locked cases, give a team more protection than a hundred generic helpfulness scores.

Then make one rule non-negotiable: every serious production miss becomes a regression test.

The bug cemetery is the asset

Most agent demos are built to show the happy path.

Production reliability is built from the embarrassing paths.

Those failures are uncomfortable, but they are also the raw material for a better system.

The demo asks, "Can this agent do the work once?"

The test suite asks, "Can it keep doing the work after we change it?"

That second question is where agent projects become real.

Implementation help

Turn agent failures into a regression system

BaristaLabs helps teams convert production misses into locked inputs, expected tool calls, receipt assertions, and review gates before the next release.

Review the eval harness

Best fit when a demo works, but the team cannot yet explain which failures would block deployment.

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Turn this idea into a pilot

Which workflow should go first?

Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.

3-5 minutes
Deterministic score
No sensitive data

Check workflow readiness

Share this post

Share on X Share on LinkedIn Share on Bluesky

Agent evals should test workflow receipts, not just model answers

May 25, 2026

OpenAI's eval playbook makes the harness part of the result

May 30, 2026

Your AI dashboard needs a quality lane, not just GPU charts

May 30, 2026