AI Development

Claude Sonnet 5 didn't just get cheaper. Your agents now need an effort policy.

The upgrade note said Sonnet 5 was the most agentic version yet, and everyone read it as a price cut. The operator question buried in the release is different: how hard should this workflow be allowed to try?

Sean McLellan

Lead Architect & Founder

July 1, 20268 min read

Monday morning, a small team running a support-ticket triage agent swaps one string in a config file: claude-sonnet-4-6 becomes claude-sonnet-5. It's the kind of change that usually takes ten minutes and a changelog skim. Anthropic announced Sonnet 5 on June 30, calling it "the most agentic Sonnet model yet," and the pricing looked friendly enough to justify skipping the usual caution: $2 per million input tokens and $10 per million output through August 31, stepping up to $3 and $15 after.

By Tuesday the agent is doing something it never used to do. It opens a browser tab to check a customer's order history before answering a question nobody asked it to verify. It retries a flaky internal API four times instead of failing fast. It's not wrong, exactly. The answers are often better. But the bill for the triage queue jumps, and nobody on the team changed a prompt. Anthropic's own framing explains why: Sonnet 5 "can make plans, use tools like browsers and terminals, and run autonomously at a level that, just a few months ago, required larger and more expensive models." That capability didn't arrive as an option. It arrived switched on.

The question the team actually needed to answer before Monday wasn't which model. It was how hard is this workflow allowed to try.

The dial nobody read the manual for

That question has an answer now, buried a few clicks past the pricing page in Anthropic's docs on the effort parameter. Effort controls how eagerly the model works before it stops: how many tool calls it makes, how long it reasons, how thorough a pass it takes at a task. The docs are careful about what it isn't ("effort is a behavioral signal, not a strict token budget") and specific about what it changes ("lower effort would mean Claude makes fewer tool calls"). Five levels are available: max, xhigh, high, medium, low. Sonnet 5 ships defaulting to high.

High-by-default is the detail worth sitting with. It means every workflow that adopts Sonnet 5 without touching the effort setting inherits Anthropic's judgment about how hard the model should try, applied uniformly across a support bot, a coding agent, and a batch classification job that runs ten thousand times a day. Those three have almost nothing in common in terms of what "trying harder" should cost you.

This is the same lesson we've written about from the other direction: when better benchmarks produce worse production outcomes, the gap is usually a setting nobody thought to check. Here it's not accuracy that moves without your permission. It's spend and autonomy, together, on a dial you didn't know existed until the invoice moved.

Frosted glass effort tokens arranged around a translucent control cube on a dark studio surface. — Effort is a behavioral signal, not a token budget, which is exactly why it needs a policy, not a default.

What the discourse got right, and where it stopped

The Hacker News thread on the launch (over a thousand points and climbing past 600 comments within a day) spent a lot of its energy on cost-per-task charts, and one exchange gets at the real shape of the problem better than the announcement did. Commenter doctoboggan looked at the numbers and drew a boundary: "The cost per task chart is telling me that I should never use Sonnet 5 above medium effort level - Opus always performs better for a given cost. So I guess the takeaway is that if Sonnet 5 medium isn't good enough for you, switch models, not effort levels." Treat that as one engineer's read of one chart, not a rule. It's still the right kind of question: routing logic, not a verdict.

A reply from pmarreck reframed what effort actually is: "In a skilled senior's hands, this is like an expert power tool… It's a magnifier." And then the line that matters most for anyone about to flip a model string in production: "It's being reviewed, measured, and controlled against. Because... you WILL need more controls to take full advantage." A magnifier held by someone without a policy for it doesn't magnify judgment. It magnifies whatever was already in the workflow, mistakes included.

The uncomfortable part is that Anthropic isn't going to write that policy for you. It ships a default. The routing, deciding which workflows deserve high effort and which ones should never see above medium, is now your job, the same way choosing which requests get a bigger, slower model has always been your job. Effort just gives you a second axis to route on, cheaper and finer-grained than swapping models, and easy to forget precisely because it's cheap.

The agent effort budget card

The fix isn't a new dashboard. It's one page you fill in per workflow before the model flag changes in production: a card that routes each agent by task class and effort level, with the guardrails that keep "run autonomously" from becoming "run unsupervised."

Scroll sideways to see all 4 columns.

Field	Question it answers	Example: coding agent (PR drafts)	Example: support triage bot
Task class	What kind of work is this, and how much does a wrong answer cost?	Multi-file code change, reviewed before merge	Customer-facing reply, sent directly
Effort level	How hard should the model try before stopping?	`xhigh` for the hardest refactors, `high` default	`medium`: good enough, cost-bounded
Tool-call / retry limit	When does trying harder become thrashing?	Up to 12 tool calls, 2 retries per failed test	3 tool calls, no silent retries
Latency target	What's the slowest this can run before it blocks someone?	Under 4 minutes, async	Under 15 seconds, customer waiting
Review gate	Who or what checks the output before it acts on the world?	Human review before merge, always	Confidence threshold; below it, escalate to human
Tokenizer adjustment	Did this workflow's cost baseline just move?	Re-baseline: same diffs, ~1.1-1.35x tokens	Re-baseline: short tickets, closer to 1.0x
Fallback model	What do you drop to when effort isn't the lever?	Opus 4.8, if Sonnet 5 at `xhigh` still misses	Sonnet 4.6, if regression appears
Stop rule	What condition ends the run regardless of confidence?	Test suite fails twice in a row	Escalation triggered or budget per ticket hit

Fill this in for the one workflow closest to a customer or closest to your production repo, not for everything at once. The point isn't paperwork. It's that "effort" stops being a hidden default and becomes a decision your team can point to: one row that reads back as this is how hard we let it try, and this is where it stops.

Two rows deserve a second look before you trust the card.

Tokenizer adjustment exists because Sonnet 5 doesn't tokenize input the same way its predecessor did. Anthropic's own footnote: "the same input can map to more tokens: roughly 1.0-1.35x depending on the content type," and the intro pricing is designed to make the transition roughly cost-neutral through August. Simon Willison's independent read, working through the numbers on real prompts, put it more bluntly: "the same input text produces approximately 30% more tokens than on Claude Sonnet 4.6." Call it his estimate, not Anthropic's claim. Either way, a workflow's cost baseline moved the same week its capability did, and the price per token rises again after August 31. If you measured your context cost once, before Sonnet 5, that calorie label is stale now. Re-baseline before you compare invoices.

Stop rule is the row teams skip, and it's the one doctoboggan and pmarreck were circling from opposite directions. A model that tries harder needs a condition that ends the trying, not "until it seems confident" but a concrete trigger: two failed test runs, a fixed tool-call ceiling, a per-ticket cost cap. Without it, "run autonomously" quietly becomes "run until something else notices," and the something else is usually a customer or an invoice.

Treat the flag change like a migration, not a toggle

None of this means Sonnet 5 is risky to adopt. The announcement's safety claim, that it behaves better than Sonnet 4.6 in agentic contexts, is a real reason to move. It means the flag change deserves the same discipline as any other model migration: a window where old and new behavior run side by side, a rollback path, and a defined moment where you declare the migration done rather than letting it drift. An effort policy is what makes that window legible: you're not just watching whether outputs changed, you're watching whether the model's willingness to act changed, on a workflow that used to be predictable because nobody had given it permission to try this hard.

The team that swapped the model string on Monday didn't do anything wrong. They just answered a question (which model) when the actual question, the one the release notes buried under a price cut, was how hard they wanted it to try. Write the card before the next workflow makes that upgrade. It's shorter than the outage report.

If you want a second set of eyes on one workflow's effort policy before it goes to production, that's exactly the kind of narrow, concrete session BaristaLabs' process automation work is built for: bring one agent, one task class, and we'll help you fill in the card before the model flag ships.

Before the model flag changes

Set one workflow's effort policy before you upgrade

Bring one agent workflow that's about to move to Claude Sonnet 5: coding, browser, or internal-tool automation. We'll map its task class, effort level, tool-call and retry limits, review gate, and fallback model before anyone flips the switch in production.

Map the effort policy Explore process automation

Best fit when a team runs agents against a pay-per-token API and is about to change the model string in production.

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Turn this idea into a pilot

Which workflow should go first?

Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.

3-5 minutes
Deterministic score
No sensitive data

Check workflow readiness

Share this post

Share on X Share on LinkedIn Share on Bluesky

Your coding agent needs a calorie label for context

June 28, 2026

Why Better Benchmarks Can Produce Worse Production Outcomes

February 27, 2026

The New AI Stack Is Being Won in Migration Windows, Not Demos

February 27, 2026

Keep Reading

AI Development

Claude Sonnet 5 didn't just get cheaper. Your agents now need an effort policy.

Sean McLellan

Lead Architect & Founder

July 1, 20268 min read

The question the team actually needed to answer before Monday wasn't which model. It was how hard is this workflow allowed to try.

The dial nobody read the manual for

What the discourse got right, and where it stopped

The agent effort budget card

Scroll sideways to see all 4 columns.

Field	Question it answers	Example: coding agent (PR drafts)	Example: support triage bot
Task class	What kind of work is this, and how much does a wrong answer cost?	Multi-file code change, reviewed before merge	Customer-facing reply, sent directly
Effort level	How hard should the model try before stopping?	`xhigh` for the hardest refactors, `high` default	`medium`: good enough, cost-bounded
Tool-call / retry limit	When does trying harder become thrashing?	Up to 12 tool calls, 2 retries per failed test	3 tool calls, no silent retries
Latency target	What's the slowest this can run before it blocks someone?	Under 4 minutes, async	Under 15 seconds, customer waiting
Review gate	Who or what checks the output before it acts on the world?	Human review before merge, always	Confidence threshold; below it, escalate to human
Tokenizer adjustment	Did this workflow's cost baseline just move?	Re-baseline: same diffs, ~1.1-1.35x tokens	Re-baseline: short tickets, closer to 1.0x
Fallback model	What do you drop to when effort isn't the lever?	Opus 4.8, if Sonnet 5 at `xhigh` still misses	Sonnet 4.6, if regression appears
Stop rule	What condition ends the run regardless of confidence?	Test suite fails twice in a row	Escalation triggered or budget per ticket hit

Two rows deserve a second look before you trust the card.

Treat the flag change like a migration, not a toggle

Before the model flag changes

Set one workflow's effort policy before you upgrade

Map the effort policy Explore process automation

Best fit when a team runs agents against a pay-per-token API and is about to change the model string in production.

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Turn this idea into a pilot

Which workflow should go first?

Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.

3-5 minutes
Deterministic score
No sensitive data

Check workflow readiness

Share this post

Share on X Share on LinkedIn Share on Bluesky

Your coding agent needs a calorie label for context

June 28, 2026

Why Better Benchmarks Can Produce Worse Production Outcomes

February 27, 2026

The New AI Stack Is Being Won in Migration Windows, Not Demos

February 27, 2026