Monday morning, a small team running a support-ticket triage agent swaps one string in a config file: claude-sonnet-4-6 becomes claude-sonnet-5. It's the kind of change that usually takes ten minutes and a changelog skim. Anthropic announced Sonnet 5 on June 30, calling it "the most agentic Sonnet model yet," and the pricing looked friendly enough to justify skipping the usual caution: $2 per million input tokens and $10 per million output through August 31, stepping up to $3 and $15 after.
By Tuesday the agent is doing something it never used to do. It opens a browser tab to check a customer's order history before answering a question nobody asked it to verify. It retries a flaky internal API four times instead of failing fast. It's not wrong, exactly. The answers are often better. But the bill for the triage queue jumps, and nobody on the team changed a prompt. Anthropic's own framing explains why: Sonnet 5 "can make plans, use tools like browsers and terminals, and run autonomously at a level that, just a few months ago, required larger and more expensive models." That capability didn't arrive as an option. It arrived switched on.
The question the team actually needed to answer before Monday wasn't which model. It was how hard is this workflow allowed to try.
The dial nobody read the manual for
That question has an answer now, buried a few clicks past the pricing page in Anthropic's docs on the effort parameter. Effort controls how eagerly the model works before it stops: how many tool calls it makes, how long it reasons, how thorough a pass it takes at a task. The docs are careful about what it isn't ("effort is a behavioral signal, not a strict token budget") and specific about what it changes ("lower effort would mean Claude makes fewer tool calls"). Five levels are available: max, xhigh, high, medium, low. Sonnet 5 ships defaulting to high.
High-by-default is the detail worth sitting with. It means every workflow that adopts Sonnet 5 without touching the effort setting inherits Anthropic's judgment about how hard the model should try, applied uniformly across a support bot, a coding agent, and a batch classification job that runs ten thousand times a day. Those three have almost nothing in common in terms of what "trying harder" should cost you.
This is the same lesson we've written about from the other direction: when better benchmarks produce worse production outcomes, the gap is usually a setting nobody thought to check. Here it's not accuracy that moves without your permission. It's spend and autonomy, together, on a dial you didn't know existed until the invoice moved.

What the discourse got right, and where it stopped
The Hacker News thread on the launch (over a thousand points and climbing past 600 comments within a day) spent a lot of its energy on cost-per-task charts, and one exchange gets at the real shape of the problem better than the announcement did. Commenter doctoboggan looked at the numbers and drew a boundary: "The cost per task chart is telling me that I should never use Sonnet 5 above medium effort level - Opus always performs better for a given cost. So I guess the takeaway is that if Sonnet 5 medium isn't good enough for you, switch models, not effort levels." Treat that as one engineer's read of one chart, not a rule. It's still the right kind of question: routing logic, not a verdict.
A reply from pmarreck reframed what effort actually is: "In a skilled senior's hands, this is like an expert power tool… It's a magnifier." And then the line that matters most for anyone about to flip a model string in production: "It's being reviewed, measured, and controlled against. Because... you WILL need more controls to take full advantage." A magnifier held by someone without a policy for it doesn't magnify judgment. It magnifies whatever was already in the workflow, mistakes included.
The uncomfortable part is that Anthropic isn't going to write that policy for you. It ships a default. The routing, deciding which workflows deserve high effort and which ones should never see above medium, is now your job, the same way choosing which requests get a bigger, slower model has always been your job. Effort just gives you a second axis to route on, cheaper and finer-grained than swapping models, and easy to forget precisely because it's cheap.
The agent effort budget card
The fix isn't a new dashboard. It's one page you fill in per workflow before the model flag changes in production: a card that routes each agent by task class and effort level, with the guardrails that keep "run autonomously" from becoming "run unsupervised."
Scroll sideways to see all 4 columns.
| Field | Question it answers | Example: coding agent (PR drafts) | Example: support triage bot |
|---|---|---|---|
| Task class | What kind of work is this, and how much does a wrong answer cost? | Multi-file code change, reviewed before merge | Customer-facing reply, sent directly |
| Effort level | How hard should the model try before stopping? | xhigh for the hardest refactors, high default | medium: good enough, cost-bounded |
| Tool-call / retry limit | When does trying harder become thrashing? | Up to 12 tool calls, 2 retries per failed test | 3 tool calls, no silent retries |
| Latency target | What's the slowest this can run before it blocks someone? | Under 4 minutes, async | Under 15 seconds, customer waiting |
| Review gate | Who or what checks the output before it acts on the world? | Human review before merge, always | Confidence threshold; below it, escalate to human |
| Tokenizer adjustment | Did this workflow's cost baseline just move? | Re-baseline: same diffs, ~1.1-1.35x tokens | Re-baseline: short tickets, closer to 1.0x |
| Fallback model | What do you drop to when effort isn't the lever? | Opus 4.8, if Sonnet 5 at xhigh still misses | Sonnet 4.6, if regression appears |
| Stop rule | What condition ends the run regardless of confidence? | Test suite fails twice in a row | Escalation triggered or budget per ticket hit |
Fill this in for the one workflow closest to a customer or closest to your production repo, not for everything at once. The point isn't paperwork. It's that "effort" stops being a hidden default and becomes a decision your team can point to: one row that reads back as this is how hard we let it try, and this is where it stops.
Two rows deserve a second look before you trust the card.
Tokenizer adjustment exists because Sonnet 5 doesn't tokenize input the same way its predecessor did. Anthropic's own footnote: "the same input can map to more tokens: roughly 1.0-1.35x depending on the content type," and the intro pricing is designed to make the transition roughly cost-neutral through August. Simon Willison's independent read, working through the numbers on real prompts, put it more bluntly: "the same input text produces approximately 30% more tokens than on Claude Sonnet 4.6." Call it his estimate, not Anthropic's claim. Either way, a workflow's cost baseline moved the same week its capability did, and the price per token rises again after August 31. If you measured your context cost once, before Sonnet 5, that calorie label is stale now. Re-baseline before you compare invoices.
Stop rule is the row teams skip, and it's the one doctoboggan and pmarreck were circling from opposite directions. A model that tries harder needs a condition that ends the trying, not "until it seems confident" but a concrete trigger: two failed test runs, a fixed tool-call ceiling, a per-ticket cost cap. Without it, "run autonomously" quietly becomes "run until something else notices," and the something else is usually a customer or an invoice.
Treat the flag change like a migration, not a toggle
None of this means Sonnet 5 is risky to adopt. The announcement's safety claim, that it behaves better than Sonnet 4.6 in agentic contexts, is a real reason to move. It means the flag change deserves the same discipline as any other model migration: a window where old and new behavior run side by side, a rollback path, and a defined moment where you declare the migration done rather than letting it drift. An effort policy is what makes that window legible: you're not just watching whether outputs changed, you're watching whether the model's willingness to act changed, on a workflow that used to be predictable because nobody had given it permission to try this hard.
The team that swapped the model string on Monday didn't do anything wrong. They just answered a question (which model) when the actual question, the one the release notes buried under a price cut, was how hard they wanted it to try. Write the card before the next workflow makes that upgrade. It's shorter than the outage report.
If you want a second set of eyes on one workflow's effort policy before it goes to production, that's exactly the kind of narrow, concrete session BaristaLabs' process automation work is built for: bring one agent, one task class, and we'll help you fill in the card before the model flag ships.
Before the model flag changes
Set one workflow's effort policy before you upgrade
Bring one agent workflow that's about to move to Claude Sonnet 5: coding, browser, or internal-tool automation. We'll map its task class, effort level, tool-call and retry limits, review gate, and fallback model before anyone flips the switch in production.
Best fit when a team runs agents against a pay-per-token API and is about to change the model string in production.
Practical AI Workflow Notes
Want more practical AI operations ideas?
Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.
Turn this idea into a pilot
Which workflow should go first?
Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.
- 3-5 minutes
- Deterministic score
- No sensitive data
Share this post
