Quick path
In this article
Quick read: what changed, why it matters, and what to do next.
Picture a support lead running one small ritual before an escalation review. She pulls a ticket the AI kicked to a human overnight, hands it to an agent who has never seen it, and starts a two-minute timer. No backchannel on Slack. No "let me re-read the thread." Just the ticket, exactly as the bot left it.
Ninety seconds in, the agent looks up. The subject line says escalated. The body is a transcript with no summary. Somewhere in the middle the AI told the customer a refund was "being processed." Nobody can tell whether that is true, who would process it, or when. The clock runs out with three of four questions unanswered.
The bot hadn't refused to help. It answered plenty. It just left behind something no human could pick up and run with — and the deflection dashboard graded that as a win.
The test, not the dashboard
Most teams measure their support AI the way the standup measures it: deflection rate, CSAT, tickets opened. Those numbers describe what the AI closed. They say nothing about what it dropped on the way out, and the dropped tickets are where the cost quietly moves.
So measure the drop directly. Take one escalated ticket exactly as the AI handed it over, give it to a human cold, and time whether they can answer four questions in two minutes:
- What did the customer ask?
- What did the AI already try?
- What was the customer promised?
- Who owns the next move?
That's the cold-read handoff test. It is deliberately unfair in the way real life is unfair — the next agent doesn't get the AI's internal reasoning or the original context the conversation had at the start. They get the residue. If four clean answers don't come out in two minutes, the handoff isn't done; it's just been moved to a slower, more expensive desk.

Cold-read test card
- Pick the escalation trigger that fires most often, not the gnarliest one. Common beats dramatic.
- Grab one real ticket it produced, untouched — no re-reading the thread, no asking the bot, no tribal knowledge.
- Hand it to an agent who's never seen it. Start a two-minute timer.
- Score the four reads: the ask, the attempts, the promise, the next owner.
- Pass means four answers inside the window. A miss isn't a vibe — it points at the exact field the AI dropped.
Run it on five tickets from one queue and the failures cluster. Usually it's the same one or two questions going blank every time, which is the opposite of a vague "the AI handoffs feel bad" — it's a punch list.
Reading a failed test
The four questions aren't trivia. Each maps to a specific thing the AI either carried forward or threw away, and the blank tells you which.
When "what did the customer ask" comes back empty, the AI shipped a raw transcript with no summary, and the human has to reconstruct intent from a forty-message thread before they can even start. When "what did the AI already try" is blank, the human is about to repeat steps the customer has already done twice — the fastest way to turn one annoyed customer into a furious one. When "what was the customer promised" can't be answered, you have a liability sitting in the queue: the AI committed to a refund or a callback or a fix window, and now nobody knows whether to honor it or walk it back. And when "who owns the next move" is unclear, the ticket enters the worst state a support queue has — visible to everyone, owned by no one, aging against an SLA clock that may or may not have started.
That last one is the quiet killer. A cold ticket with no owner doesn't get worked; it gets glanced at, deprioritized, and surfaced three days later when the customer writes back angry. The deflection metric never flinches.
The platforms already let you pass this test
Here's the part that should change the conversation in your next review: almost every blank the cold-read test exposes is a configuration you already own, not a model limitation you have to wait out.
Intercom's documentation for Fin handoffs describes three things teams can set before a conversation leaves the AI layer: sending the customer a message to set expectations, collecting missing information from the customer before the conversation moves, and triggering a data connector to make a web request to an external system via API. The framing is blunt — when Fin can't resolve a conversation, "teams have full control over what happens next." That's the promise question and the what-the-AI-tried question, addressable on purpose. None of it is on by default.
Intercom's workflow documentation goes further, describing how teams can shape different escalation experiences by customer segment — paying customers route to a live teammate, non-paying customers route to the Help Center — with office hours respected and conflict checks available to preview before going live. That's the who-owns-the-next-move question, answered by routing rule instead of by luck.
Salesforce Agentforce treats the handoff as an architecture principle rather than a fallback — "Humans with Agents drive customer success together" — with low-code guardrails that admins and business users can configure. The handoff sits next to answering and resolving as a first-class job, not the sad path the product takes when it gives up.
At the developer layer, the OpenAI Agents SDK defines handoffs as an orchestration primitive with inputs, filters, recommended instructions, results, and human-in-the-loop support built in — the machinery for deciding exactly what travels with a ticket and what gets left out. And its guardrails documentation makes a distinction that bites teams in practice: input guardrails run for the first agent, output guardrails run for the agent producing final output, and tool guardrails run on every function-tool call. If you need a check around each step in a workflow with managers, handoffs, or specialists, you need tool-level guardrails — the input/output checks sail right past the middle, which is exactly where a promise gets made and lost.
So the cold-read test usually fails for an unglamorous reason. The configuration that would fill in the blanks exists and is sitting at its default. Nobody decided what the handoff should carry, so it carries nothing.
Somebody already named this
A thread on the CustomerSuccess subreddit this week put it plainly: "The best AI use case in CS might be escalation hygiene, not customer replies." Read that as a useful label for the pattern, not as proof by itself. The expensive problem is one layer over from the reply: what gets summarized, what gets routed, who catches it, and whether the next human can act without reconstructing the whole conversation.
That's a definition problem wearing a technology costume. The model can write a clean summary, set an expectation, and tag an owner all day. It just won't do any of it until you say so.
If you want the AI to leave a usable trail behind every escalation, that's the same instinct behind an agent receipt — a structured record of what the AI did, tried, decided, and handed off, attached to the ticket so the cold read has something to read.
What this isn't
This is not the longer-horizon question of whether you could move your support AI to a different vendor without losing the rules — that's a separate piece of planning, and it matters, but it's about a future you might not face this quarter. The cold-read test is about the queue that's live right now, today, on this morning's escalations.
It's also narrower than monitoring. The AI observability quality lane watches your pipeline in aggregate. The cold-read test watches one human try to act on one ticket. Both are useful; only one tells you whether the agent who picks up the next escalation can actually do their job.
Run it tomorrow
You don't need a project for this. Tomorrow morning, before the standup, pull one ticket your AI escalated overnight and hand it to whoever sits closest. Two-minute timer. Four questions. Watch where they stall.
Most teams have never done this, which is why the deflection chart and the human queue tell two different stories. The chart counts what the AI closed. The cold-read test counts what a person can pick up. When those numbers disagree, the test is the one telling the truth.
Have a queue where deflection is up but the human side feels worse? Bring one ticket your AI handed off badly. We'll run the cold-read test on it and rebuild the handoff so the next one lands.
Handoff review
Cold-read one AI handoff with us
BaristaLabs helps support teams run the cold-read test on one real AI handoff — the trigger, the missing context, the unclaimed next move — and rebuild it so a human can act in two minutes.
Best fit when deflection numbers look good but your human agents say the AI is making their queue worse.
Practical AI Workflow Notes
Want more practical AI operations ideas?
Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.
Turn this idea into a pilot
Which workflow should go first?
Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.
- 3-5 minutes
- Deterministic score
- No sensitive data
Share this post