Your research agent leaks through the questions it asks

Agent evals should test workflow receipts, not just model answers

Agent receipts: what to log before AI touches customer work

Run the security review before the AI workflow gets access

Map one query-spill ledger

Article-specific next step

Map one query-spill ledger before your agent searches

Best fit for teams testing research agents over sales notes, support tickets, compliance packets, customer interviews, or vendor reviews.

Sensitive systems

Stalled infrastructure work can be scoped without exposing private details.

For an anonymized certification board, BaristaLabs completed an AKS upgrade in 1 week with zero downtime and restored a vendor-supported Kubernetes version path.

0
application downtime: 4x
more subnet IP capacity

Anonymized case study for regulated technical work.

Client and infrastructure details stay confidential.

Read case study

Share this post

Agent evals should test workflow receipts, not just model answers

Agent receipts: what to log before AI touches customer work

Run the security review before the AI workflow gets access

Map one query-spill ledger Review the security worksheet

Keep Reading

Industry Insights

Your research agent leaks through the questions it asks

Companies watch what their agents read and write. A new benchmark says watch what they ask, too. The search trail is a data surface.

Sean McLellan

Lead Architect & Founder

June 24, 20267 min read

Picture the log a deep research agent leaves behind after one run over your internal documents. Not the report. The search trail: the list of queries it fired at the open web while it worked.

"healthcare data center lease expirations 2024"
"HIPAA-eligible cloud regions added January 2025"
"hospital network cloud migration case studies"
"MediConn vendor partnerships announcement"
"largest US hospital networks completing cloud migration 2025"

What the benchmark actually measured

Why telling it not to leak does not hold

That is the move you want to engineer into a workflow whether or not you can train a model: more questions, fewer fingerprints.

The query-spill ledger

Ten fields. Copy them into a sheet and fill one row per private fact class your agent could reach.

Private fact class: the category of internal fact in play, such as client name, migration metric, contract date, deal stage, patient cohort, or incident detail.
External query that carried a fragment: the actual or expected search string that exported part of it.
Necessary or unnecessary: did resolving the task genuinely require that fragment in the query, or was it convenience?
Exposure level: intent, answer, or full information, using the MosaicLeaks ladder.
Safer rewrite: the same query with specifics generalized, like category instead of company, range instead of exact metric, or quarter instead of date.
Allowed source or tool: which external endpoint this query may go to, if any.
Stop-and-ask trigger: the condition under which the agent must halt and route to a human instead of searching.
Evidence required before allow: what must be true for the query to proceed, such as de-identified bridge entity, approved source, or no client name present.
Post-run log review owner: the named person who reads the query trail after the run.
Eval case created: the test you add to your harness so this exact spill is caught next time.

Here is one row, filled, using the MediConn-style illustration:

Scroll sideways to see all 2 columns.

Field	Value
Fact class	Cloud-migration completion metric
Query that carried it	`"MediConn 70% cloud migration January 2025"`
Necessary?	No. The agent needed a public migration benchmark, not the client's name and figure.
Exposure	Full information
Safer rewrite	`"typical cloud migration timelines for mid-size healthcare networks"`
Allowed source	Approved industry-report endpoint only
Stop-and-ask trigger	Any query naming a client and a quantified internal metric
Evidence to allow	Bridge entity de-identified before it reaches the web tool
Log review owner	Data security lead
Eval case	Chain that tempts the agent to name client plus metric in one search

Running it on one workflow

List the private fact classes that workflow handles: five to ten is plenty. These are your ledger rows.
Dry-run the agent and capture the query log: before it touches anything sensitive, watch what it searches on a benign version of the task. The trail tells you which fact classes leak as fragments.
Write the safer rewrite and the stop-and-ask trigger for each row: generalize the specifics, and name the conditions that should route to a human instead of the open web.
Assign the post-run reviewer: someone reads the real query log after each run, not just the report. This is where the report may be clean while the search trail is dirty stops being a slogan and becomes a check.
Turn each real spill into an eval case: so the next model, prompt, or vendor swap gets tested against the leak you already found.

Map one agent's query spill before it searches

Research-agent privacy help

Map one query-spill ledger before your agent searches

Best fit for teams testing research agents over sales notes, support tickets, compliance packets, customer interviews, or vendor reviews.

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Turn this idea into a pilot

Which workflow should go first?

Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.

3-5 minutes
Deterministic score
No sensitive data

Check workflow readiness

Share this post

Agent evals should test workflow receipts, not just model answers

Agent receipts: what to log before AI touches customer work

Run the security review before the AI workflow gets access

Map one query-spill ledger

Article-specific next step

Map one query-spill ledger before your agent searches

Best fit for teams testing research agents over sales notes, support tickets, compliance packets, customer interviews, or vendor reviews.

Sensitive systems

Stalled infrastructure work can be scoped without exposing private details.

For an anonymized certification board, BaristaLabs completed an AKS upgrade in 1 week with zero downtime and restored a vendor-supported Kubernetes version path.

0
application downtime: 4x
more subnet IP capacity

Anonymized case study for regulated technical work.

Client and infrastructure details stay confidential.

Read case study

Share this post

Agent evals should test workflow receipts, not just model answers

Agent receipts: what to log before AI touches customer work

Run the security review before the AI workflow gets access