AI Development

Your AI Workflow Tests Should Try to Falsify the Promise

Stateful AI workflows fail around queues, retries, locks, ledgers, and approvals. Test the promise before production falsifies it for you.

Sean McLellan

Lead Architect & Founder

June 10, 20267 min read

An invoice approval workflow looks safe in the demo.

The agent reads the invoice, checks the purchase order, writes a reviewer note, and submits the approval. The queue moves from "pending" to "approved." Finance sees the receipt. The vendor gets paid.

Then production adds the boring part: a timeout.

The agent submits the approval, but the response never comes back. It retries. The approval endpoint accepts the second request too. Now the same invoice has two approvals, two audit entries, and a support ticket that starts with "why did this happen?"

The workflow did not fail because the model wrote bad prose. It failed because the system broke a promise:

"An invoice can be approved exactly once."

That is the promise the test should have tried to falsify.

Happy-path tests do not exercise the state

Most AI workflow tests still look like form checks.

Given this email, did the agent classify it correctly? Given this invoice, did it extract the amount? Given this support ticket, did it route to the right queue?

Those checks matter, but they do not cover the failure modes that break stateful work.

A stateful workflow remembers something. It claims a job, writes a ledger entry, takes a lock, updates a record, schedules a retry, marks a document reviewed, or sends a notification that should not be sent twice.

That memory is where the dangerous bugs live.

A retry after a timeout is not the same as a clean first attempt. A queue worker crash after writing to the database is not the same as a failed validation rule. A human override during an agent run is not the same as a single user clicking a button.

Ordinary integration tests usually prove that the happy path can happen. They rarely prove that the bad path cannot corrupt the business record.

Distributed systems engineers have been living with this problem for years. Jepsen's analyses have found replica divergence, data loss, stale reads, read skew, lock conflicts, and other failure modes across databases, coordination services, and queues. The lesson for AI workflows is not that every SMB approval queue needs a Jepsen cluster. The lesson is simpler: systems that coordinate state fail in ways a normal test script will not notice.

AI automation pushes more business work into coordinated state. A model may be the visible part, but the risk often sits in the queue, lock, retry policy, approval boundary, or ledger write around it.

Test the claim, not the setup

A new open source project called Distributed Systems Testing Skills packages this discipline into two Markdown skills for AI coding agents.

One skill designs a structured test plan. The other executes it and writes findings. The project says it works with Claude Code, Codex, Copilot CLI, Cursor, Gemini, or any agent that can read Markdown and run shell commands.

The useful pattern is claim-driven testing.

Instead of naming tests after implementation details, the plan starts from product claims. Each scenario is named after the claim it tries to falsify.

That sounds like a small wording choice. It is not.

A test named retry_timeout_invoice_approval can drift into checking whether a mock timeout fires. A test named invoice_is_approved_at_most_once_under_retry keeps the reviewer focused on the business promise.

The project goes further for consistency-critical scenarios. The plan binds each scenario to an abstract model such as a register, queue, log, lock, lease, or ledger. It defines an operation-history schema, a named checker, a nemesis that injects the fault, and observable evidence that the fault actually happened.

The findings report then records verdicts and classifies blame across the system under test, harness, checker, or environment.

That last part matters in real teams. When a workflow test fails, the next question is not "is AI bad?" It is "who owns the fix?"

Was the approval service accepting duplicate idempotency keys? Did the harness fail to record the first attempt? Did the checker misunderstand the business rule? Did the staging environment drop webhooks?

Blame classification keeps the failure from becoming a vague concern.

Translate the distributed systems words into workflow words

The terminology can sound academic until you map it to everyday business automation.

A queue is the set of invoices, tickets, onboarding tasks, warranty claims, or data-sync jobs waiting for work.

A ledger is any record where append-only history matters: payments, credits, inventory adjustments, compliance events, subscription changes.

A lock is the claim that only one actor may work on something at a time. In workflow terms, that might mean one agent owns the support ticket, one reviewer owns the exception, or one sync job owns the account record.

A lease is a lock with an expiration time. It says, "This worker owns the job for the next 90 seconds unless it renews." Leases are common in queue workers and background jobs.

A register is a single current value: approval status, account tier, assigned owner, next follow-up date.

An approval queue is a controlled handoff between automation and a person. It should define what the agent may propose, what the reviewer must see, and what action gets committed after approval. We wrote more about that boundary in build an AI approval queue before the agent.

Support routing is a coordination problem too. If an agent classifies a ticket, assigns it, escalates it, and notifies the customer, the system now has ordering promises. The customer should not get a resolution note before escalation. A ticket should not bounce between two teams because two workers race on stale state.

Sync jobs have their own promises. If a CRM record syncs to an accounting system, replaying yesterday's job should not duplicate credits, reopen closed invoices, or overwrite a newer human edit.

These are not exotic distributed systems problems. They are business workflow problems with distributed systems failure modes.

Promises worth trying to break

A useful claim-driven test has four parts:

The product promise.
The fault you will inject.
The evidence you will collect.
The owner who can fix the failure.

Product promise	Fault to inject	Evidence to collect	Likely owner
An invoice is approved at most once.	Timeout after submit, followed by agent retry.	Approval count, idempotency key log, audit trail, payment trigger count.	Workflow service owner
A support ticket is assigned to one team at a time.	Two workers claim the same ticket concurrently.	Claim history, lock records, assignment changes, customer notifications.	Queue or routing owner
A ledger entry is never lost after acknowledgement.	Crash after acknowledgement but before replica or downstream write completes.	Operation history, ledger rows, acknowledgement log, reconciliation report.	Ledger or persistence owner
A retry never sends a duplicate customer email.	Network failure after email provider accepts the send.	Provider message ID, workflow attempt log, customer communication timeline.	Notification owner
A human rejection stops the agent from acting.	Agent resumes from stale state after reviewer rejects.	Review receipt, state version, action log, blocked action evidence.	Approval controls owner
A sync job does not overwrite newer human edits.	Replay old job after a user updates the target record.	Source version, target version, conflict record, final field values.	Sync or data integration owner
A timeout is classified as unknown, not failed.	Drop the response after the business action may have committed.	Attempt status, reconciliation query, retry decision, final action count.	Orchestration owner
A queue worker can recover after a crash.	Kill worker after job claim and before completion marker.	Lease expiration, requeue event, duplicate side effects, final job status.	Job runner owner

Unmarked colored state tokens moving through a transparent workflow test lane. — Claim-driven tests attack the workflow promise, then collect the evidence needed to assign ownership.

Notice what is missing from the table: "Did the model give a good answer?"

For these workflows, answer quality is only one layer. The deeper question is whether the system can keep its business promise when the surrounding machinery behaves badly.

This is adjacent to agent evals, but it is not the same work. Agent evals often measure output quality, tool choice, or task completion. Claim-driven workflow tests ask whether the state survived.

The two should meet.

AWS made a related point in its AgentCore dataset management post: agent evaluation needs stable offline baselines alongside fast online signals. Its dataset approach can include inputs, expected outputs, assertions, and tool sequences, with published dataset versions that are immutable. Production failures can become permanent test cases.

That fits the same testing culture. Keep a bug cemetery. Turn production failures into locked fixtures. Re-run them after every change. We covered that pattern in our post on AgentCore test suites.

For stateful workflows, the permanent test case should include the receipt and the state transition, not just the prompt and response. We have written about that in agent evals should test workflow receipts.

A five-step way to turn a promise into a falsification test

Start with one workflow. Pick the one where a duplicate, lost, stale, or out-of-order action would create real cleanup work.

Then write the test like a skeptic.

1. Write the promise in business language

Avoid implementation phrasing at first.

Use:

"An approved invoice is paid once."

"Rejected refunds do not issue store credit."

"A customer sees one escalation notice per ticket."

"Inventory cannot go negative after concurrent reservations."

If the promise sounds vague, the workflow is not ready for automation. Tighten the business rule before writing the test.

2. Name the state object

Decide what kind of thing you are protecting.

Is it a queue item, ledger entry, lock, lease, register, log, or approval record?

This forces the team to say what must remain true after the workflow runs. For example, "the invoice status is approved" is a register claim. "the payment event appears once" is a ledger or log claim. "only one worker owns this job" is a lock or lease claim.

You can test each one differently.

3. Choose the fault that would embarrass the promise

Do not start with the fault that is easiest to simulate. Start with the fault that would make the promise false.

If the promise is "at most once," inject retry after an ambiguous timeout.

If the promise is "no lost acknowledgement," crash after the system tells the caller the action succeeded.

If the promise is "newer human edits win," replay an old sync after a manual update.

If the promise is "rejected means stopped," resume the agent from stale state after rejection.

The fault should feel unfair. Production is unfair.

4. Collect evidence from the system, not the agent's explanation

A model can explain what it thinks happened. That is not evidence.

Collect the operation history, state versions, database rows, queue events, idempotency keys, provider message IDs, reviewer receipts, and audit entries.

For AI workflows, the receipt matters. The reviewer should be able to see what the agent saw, what it proposed, what tool it called, what changed, and why the system allowed it. If that receipt does not exist, the test should fail or at least return an "unverified" verdict.

5. Assign blame before assigning urgency

A failed test should tell the team where to look.

SUT means the workflow itself violated the claim.

Harness means the test setup was wrong or incomplete.

Checker means the oracle or assertion misunderstood the rule.

Environment means the staging system, credentials, sandbox, clock, or dependency made the result unreliable.

This classification is not bureaucracy. It prevents every failed workflow test from becoming a meeting about whether the model is trustworthy.

What this changes for AI workflow teams

Claim-driven testing changes the shape of the pre-production conversation.

Instead of asking, "Did the agent pass the test set?" the team asks:

"What promise did this workflow make?"

"What fault did we use to attack it?"

"What evidence proves the fault actually happened?"

"What stayed unverified?"

"Who owns the fix if it fails?"

Those questions are useful for software teams, but they are just as useful for operations leads and business owners. A finance lead may not care whether the checker is called linearizability or idempotency. They do care whether an invoice can be approved twice when a retry fires after a timeout.

That is the right level for an AI workflow control.

BaristaLabs' AI workflow controls work usually starts there: define the approval boundary, the receipt, the fallback path, and the state that must not be corrupted. Testing should match those controls. A gate that cannot be falsified is mostly decoration.

The same applies when building production automation. If you are planning agentic workflows around approvals, invoices, support routing, sync jobs, or ledgers, the implementation plan should include the falsification tests before launch. Our process automation work treats those tests as part of the workflow design, not as a cleanup phase after the agent is wired in.

Stronger models do not remove the need for adversarial tests

A better model may classify the invoice more accurately. It may write a better reviewer note. It may choose the right tool more often.

It still cannot make a duplicate approval safe if the workflow accepts duplicate commits.

It cannot recover a lost acknowledgement if the system has no durable record.

It cannot protect a human override if the next agent run resumes from stale state.

Stateful workflows need tests named after the promises they can break.

No lost acknowledgements. Idempotency under replay. Safe retries. Ordering. Crash recovery. Timeout behavior. Correct blame when something fails.

Those are not edge cases. They are the contract.

The point of claim-driven testing is to make the contract attackable before production does it for you.

Implementation help

Build tests around the business promise

BaristaLabs helps teams design AI workflow tests for retries, duplicate actions, stale state, approval gates, receipts, and rollback paths.

Review my workflow tests

Best fit when the demo works, but nobody has attacked the retry, queue, approval, or ledger promise yet.

Turn this idea into a pilot

Which workflow should go first?

Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.

3-5 minutes
Deterministic score
No sensitive data

Check workflow readiness

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Share this post

Share on X Share on LinkedIn Share on Bluesky

Your AI Agent Needs a Bug Cemetery, Not Another Demo

May 29, 2026

Agent evals should test workflow receipts, not just model answers

May 25, 2026

Build the approval queue before you build the agent

May 25, 2026

AI Development

Your AI Workflow Tests Should Try to Falsify the Promise

Stateful AI workflows fail around queues, retries, locks, ledgers, and approvals. Test the promise before production falsifies it for you.

Sean McLellan

Lead Architect & Founder

June 10, 20267 min read

An invoice approval workflow looks safe in the demo.

The agent reads the invoice, checks the purchase order, writes a reviewer note, and submits the approval. The queue moves from "pending" to "approved." Finance sees the receipt. The vendor gets paid.

Then production adds the boring part: a timeout.

The workflow did not fail because the model wrote bad prose. It failed because the system broke a promise:

"An invoice can be approved exactly once."

That is the promise the test should have tried to falsify.

Happy-path tests do not exercise the state

Most AI workflow tests still look like form checks.

Given this email, did the agent classify it correctly? Given this invoice, did it extract the amount? Given this support ticket, did it route to the right queue?

Those checks matter, but they do not cover the failure modes that break stateful work.

That memory is where the dangerous bugs live.

Ordinary integration tests usually prove that the happy path can happen. They rarely prove that the bad path cannot corrupt the business record.

Test the claim, not the setup

A new open source project called Distributed Systems Testing Skills packages this discipline into two Markdown skills for AI coding agents.

The useful pattern is claim-driven testing.

Instead of naming tests after implementation details, the plan starts from product claims. Each scenario is named after the claim it tries to falsify.

That sounds like a small wording choice. It is not.

The findings report then records verdicts and classifies blame across the system under test, harness, checker, or environment.

That last part matters in real teams. When a workflow test fails, the next question is not "is AI bad?" It is "who owns the fix?"

Blame classification keeps the failure from becoming a vague concern.

Translate the distributed systems words into workflow words

The terminology can sound academic until you map it to everyday business automation.

A queue is the set of invoices, tickets, onboarding tasks, warranty claims, or data-sync jobs waiting for work.

A ledger is any record where append-only history matters: payments, credits, inventory adjustments, compliance events, subscription changes.

A lease is a lock with an expiration time. It says, "This worker owns the job for the next 90 seconds unless it renews." Leases are common in queue workers and background jobs.

A register is a single current value: approval status, account tier, assigned owner, next follow-up date.

Sync jobs have their own promises. If a CRM record syncs to an accounting system, replaying yesterday's job should not duplicate credits, reopen closed invoices, or overwrite a newer human edit.

These are not exotic distributed systems problems. They are business workflow problems with distributed systems failure modes.

Promises worth trying to break

A useful claim-driven test has four parts:

The product promise.
The fault you will inject.
The evidence you will collect.
The owner who can fix the failure.

Product promise	Fault to inject	Evidence to collect	Likely owner
An invoice is approved at most once.	Timeout after submit, followed by agent retry.	Approval count, idempotency key log, audit trail, payment trigger count.	Workflow service owner
A support ticket is assigned to one team at a time.	Two workers claim the same ticket concurrently.	Claim history, lock records, assignment changes, customer notifications.	Queue or routing owner
A ledger entry is never lost after acknowledgement.	Crash after acknowledgement but before replica or downstream write completes.	Operation history, ledger rows, acknowledgement log, reconciliation report.	Ledger or persistence owner
A retry never sends a duplicate customer email.	Network failure after email provider accepts the send.	Provider message ID, workflow attempt log, customer communication timeline.	Notification owner
A human rejection stops the agent from acting.	Agent resumes from stale state after reviewer rejects.	Review receipt, state version, action log, blocked action evidence.	Approval controls owner
A sync job does not overwrite newer human edits.	Replay old job after a user updates the target record.	Source version, target version, conflict record, final field values.	Sync or data integration owner
A timeout is classified as unknown, not failed.	Drop the response after the business action may have committed.	Attempt status, reconciliation query, retry decision, final action count.	Orchestration owner
A queue worker can recover after a crash.	Kill worker after job claim and before completion marker.	Lease expiration, requeue event, duplicate side effects, final job status.	Job runner owner

Notice what is missing from the table: "Did the model give a good answer?"

For these workflows, answer quality is only one layer. The deeper question is whether the system can keep its business promise when the surrounding machinery behaves badly.

This is adjacent to agent evals, but it is not the same work. Agent evals often measure output quality, tool choice, or task completion. Claim-driven workflow tests ask whether the state survived.

The two should meet.

That fits the same testing culture. Keep a bug cemetery. Turn production failures into locked fixtures. Re-run them after every change. We covered that pattern in our post on AgentCore test suites.

A five-step way to turn a promise into a falsification test

Start with one workflow. Pick the one where a duplicate, lost, stale, or out-of-order action would create real cleanup work.

Then write the test like a skeptic.

1. Write the promise in business language

Avoid implementation phrasing at first.

Use:

"An approved invoice is paid once."

"Rejected refunds do not issue store credit."

"A customer sees one escalation notice per ticket."

"Inventory cannot go negative after concurrent reservations."

If the promise sounds vague, the workflow is not ready for automation. Tighten the business rule before writing the test.

2. Name the state object

Decide what kind of thing you are protecting.

Is it a queue item, ledger entry, lock, lease, register, log, or approval record?

You can test each one differently.

3. Choose the fault that would embarrass the promise

Do not start with the fault that is easiest to simulate. Start with the fault that would make the promise false.

If the promise is "at most once," inject retry after an ambiguous timeout.

If the promise is "no lost acknowledgement," crash after the system tells the caller the action succeeded.

If the promise is "newer human edits win," replay an old sync after a manual update.

If the promise is "rejected means stopped," resume the agent from stale state after rejection.

The fault should feel unfair. Production is unfair.

4. Collect evidence from the system, not the agent's explanation

A model can explain what it thinks happened. That is not evidence.

Collect the operation history, state versions, database rows, queue events, idempotency keys, provider message IDs, reviewer receipts, and audit entries.

5. Assign blame before assigning urgency

A failed test should tell the team where to look.

SUT means the workflow itself violated the claim.

Harness means the test setup was wrong or incomplete.

Checker means the oracle or assertion misunderstood the rule.

Environment means the staging system, credentials, sandbox, clock, or dependency made the result unreliable.

This classification is not bureaucracy. It prevents every failed workflow test from becoming a meeting about whether the model is trustworthy.

What this changes for AI workflow teams

Claim-driven testing changes the shape of the pre-production conversation.

Instead of asking, "Did the agent pass the test set?" the team asks:

"What promise did this workflow make?"

"What fault did we use to attack it?"

"What evidence proves the fault actually happened?"

"What stayed unverified?"

"Who owns the fix if it fails?"

That is the right level for an AI workflow control.

Stronger models do not remove the need for adversarial tests

A better model may classify the invoice more accurately. It may write a better reviewer note. It may choose the right tool more often.

It still cannot make a duplicate approval safe if the workflow accepts duplicate commits.

It cannot recover a lost acknowledgement if the system has no durable record.

It cannot protect a human override if the next agent run resumes from stale state.

Stateful workflows need tests named after the promises they can break.

No lost acknowledgements. Idempotency under replay. Safe retries. Ordering. Crash recovery. Timeout behavior. Correct blame when something fails.

Those are not edge cases. They are the contract.

The point of claim-driven testing is to make the contract attackable before production does it for you.

Implementation help

Build tests around the business promise

BaristaLabs helps teams design AI workflow tests for retries, duplicate actions, stale state, approval gates, receipts, and rollback paths.

Review my workflow tests

Best fit when the demo works, but nobody has attacked the retry, queue, approval, or ledger promise yet.

Turn this idea into a pilot

Which workflow should go first?

Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.

3-5 minutes
Deterministic score
No sensitive data

Check workflow readiness

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Share this post

Share on X Share on LinkedIn Share on Bluesky

Your AI Agent Needs a Bug Cemetery, Not Another Demo

May 29, 2026

Agent evals should test workflow receipts, not just model answers

May 25, 2026

Build the approval queue before you build the agent

May 25, 2026