Quick path
In this article
Quick read: what changed, why it matters, and what to do next.
A coding agent picks up a failing test. To fix it, it reads the test file, the module under test, three imports, and the config that wires them together. You skim the diff and ask for one small change. The agent reads the test file, the module under test, the same three imports, the same config — all of it, again, fresh through the model. On a pay-per-token API, that second read costs exactly what the first one did.
None of it looks like waste. The agent is moving, the diff lands, the tests go green. The cost hides because it arrives a few thousand tokens at a time, smeared across every turn, and only surfaces a month later as an invoice line you can't quite account for. One new open-source tool prints the problem across its front page: "Your AI agent is paying to send the same file dump five times."
It's worth taking apart — not because it's the answer (it's two weeks old, and says so itself) but because the idea inside it is the one most teams are missing. You can't make a coding agent cheaper until you can read what it spends. It needs a calorie label.
What tokdiet is, and what it's careful not to claim
Tokdiet is a local streaming reverse proxy. In plain terms, it sits on your own machine between your coding agent — Claude Code, Cursor, Codex — and the model API it talks to, whether that's Anthropic, OpenAI, Gemini, or MiniMax. Every request flows through it on the way out. MIT licensed, written in TypeScript, Node 20 and up. You start it with npx tokdiet start. It went up on June 16 and had collected 63 stars by the time I looked.
Sitting in that position lets it do three things, and the order matters.
First, it meters. Every token in and out, every dollar, in the style of the ccusage tools developers already use to read their Claude Code spend, except live and on a local dashboard. This is the part nobody should skip. Before any cutting, you get a number.
Second, it compacts. The proxy treats your context the way an operating system treats memory. The author's own framing is the clearest version: hot context stays resident, cold context gets paged out to a local SQLite store and is recoverable by id. The file dump you sent five times doesn't get sent five times; the duplicate is recognized and dropped. Other bloat — stale tool output, logs, dead branches of the conversation — gets paged out rather than deleted, so the model can pull it back if it turns out to matter.
Third, and this is the part that earns the attention, it checks its own work. Tokdiet runs what it calls a shadow-eval: a parallel comparison meant to prove the slimmer context didn't quietly make the answer worse. The project's phrase for the whole arrangement is "savings stop before quality does." There's a brake wired to the diet.
The launch write-up puts numbers on it. Across a 66-task run, a coding agent's input tokens fell from 5.07 million to 1.46 million — a 71% cut. Quality: 63 of 66 tasks solved with tokdiet, against 64 of 66 at baseline. Across 198 paired runs, an LLM judge scored 92% similarity. A second model came in around 72% reduction. The author is unusually honest about the gap: one task swung the wrong way, and that's described as within model noise, not as lossless. The dedup is genuinely loss-free. The rest is recoverable, not magic.
Hold onto that honesty, because it's the whole reason the tool is interesting. A context cutter with no quality check is a tool that saves you money by sometimes removing the one clue the model needed. The dangerous version of this idea is easy to build. The useful version is the one that can tell you when it went too far.
The context calorie label
Here's the trap, and it has nothing to do with tokdiet specifically. The instinct, when someone shows you a 71% number, is to install the thing and turn it on. Skip straight to the diet. But you can't tell whether a diet worked if you never weighed yourself first, and most teams running coding agents every day have never measured a single workflow's context cost in isolation. They have a monthly invoice and a vague unease.
So before any optimizer, build the label. It's a one-page artifact you fill in for one coding-agent workflow — your noisiest or most expensive one — and it answers the only questions that decide whether cutting context will actually save you anything.

The rule that goes with it: do not put the agent on a diet until you can read the calorie label. And the corollary that catches most of the mistakes: a smaller context is not automatically a cheaper workflow. It's cheaper only when the meter, the quality result, and the billing model all agree.
Reading the fields
Scroll sideways to see all 4 columns.
| Field | The question it answers | Where tokdiet's facts land it | What breaks if it's blank |
|---|---|---|---|
| Baseline input tokens | What does this workflow actually cost per run today? | 5.07M across a 66-task run, before any change | You're optimizing a number you've never seen |
| Repeated-context sources | Which payloads get re-sent every turn? | Same file dump, logs, stale tool output, sent again and again | You cut blindly and can't say what was redundant |
| Governed input tokens | What does it cost after compaction? | 1.46M — a 71% reduction; ~72% on a second model | A savings claim with no after-number to check |
| Quality result | Did the answer get worse? | 63/66 vs 64/66 baseline; 92% judge similarity; within noise, not lossless | You bought cheaper wrong answers and called it a win |
| Recoverability class | Is cut context gone, or paged out? | Dedup is loss-free; the rest is cold storage in SQLite, recoverable by id | You confuse "deleted" with "recoverable" and lose the clue the model needed |
| Billing mode | Does saving tokens save you money here? | Direct savings only on pay-per-token APIs; flat plans don't get per-token relief | You celebrate a token cut that changes your bill by zero |
| Quality brake / rollback | When does cutting context stop, and who pulls it back? | The shadow-eval is the brake; you set the threshold the score can't drop below | Nobody notices when the diet starts costing answers |
Read it top to bottom and the shape of an honest decision falls out. The first three rows are the meter. The middle two are the safety check — quality and recoverability, the two ways a cut can quietly hurt you. The last two are the part that has nothing to do with the tool and everything to do with your situation: how you're billed, and where the quality brake is set before anyone trusts the savings.
That billing row is where most of the excitement should cool by exactly the right amount. If your team is on a flat-rate subscription plan, a 71% token cut is an interesting engineering result and a roughly zero-dollar event. The per-token savings are real only when you're paying per token. Fill in the billing row first and you'll know in one line whether the rest of the exercise changes your invoice or just your dashboard.
Where the idea stops
A good tool tells you its own limits. So here are tokdiet's.
Subscription billing breaks the premise. Said once more because it's the most common way to get this wrong: on a flat plan, you can compact context all day and your bill doesn't move. The token meter still has value — it's a capacity and latency signal — but the dollar savings are a pay-per-token phenomenon. Know which one you are before you invest a sprint in it.
Compaction is not deletion, and that's the point — but it has a cost too. Paging cold context to local storage and pulling it back by id is recoverable by design. That's the right design. It also means the magic is bounded: the loss-free part is the deduplication, and the rest is a bet that what got paged out won't be needed, with a recovery path for when the bet is wrong. A 71% cut is not 71% of free lunch. It's mostly the same information moved off the expensive path, plus a small, measured amount of judgment about what was safe to set aside.
A benchmark is a starting line, not a guarantee. Sixty-six tasks with a quality result inside model noise is a strong, well-reported signal — and it's still someone else's tasks on someone else's repo. The number that matters is the one you measure on your workflow, which is the entire reason the label has a baseline row. We've written before about why a better benchmark doesn't automatically mean a better production outcome; this is that lesson pointed at cost instead of capability. Treat 71% as the reason to run your own measurement, not as your expected result.
Locality is the feature — don't trade it away. A proxy that sees every request between your agent and the model is sitting on the most sensitive stream you have. Running it on your own machine is the privacy-friendly version: your code and prompts don't detour through anyone else's service. That's only worth something if it stays true, so hold anything in this category to one bar — does it keep the data local, and can you see what it stores. A meter you can't audit is just a second place for the data to leak.
This isn't the first time the meter has been the real story. When MCP servers started quietly inflating token counts, the fix wasn't a clever model — it was being able to see the tax. And visible budget units, like the credit meter on agentic browsing tools, do more to change behavior than any after-the-fact invoice. A number you can watch in real time is a number you manage.
Run it on one workflow
You don't need to adopt anything to get the lesson. You need one workflow and one careful measurement.
Pick the coding-agent task you run most, or the one you suspect is the most wasteful — the daily refactor loop, the test-fixing grind, the agent that reads the whole repo before every small change. Fill in the label for that one, in this order. Billing mode first, because if you're on a flat plan the rest is curiosity, not savings. Then the baseline token count, which you can read straight off a metering tool before you change a single setting. Then name the repeated-context sources by hand — you usually already know what they are, because they're the files the agent keeps re-reading. Only now do you turn on something like tokdiet and read the governed number and, more importantly, the quality result. Recoverability class and a stop rule close it out: know whether cut context can come back, and decide in advance who rolls the change back and at what quality threshold.
Seven lines. If you can't fill in the quality result or the billing mode, that's not a paperwork gap — that's the part of the decision you were about to make blind.
Bring one workflow
The change sneaks up on teams because it's so small per turn. A coding agent that re-sends the same context a few thousand tokens at a time looks productive the whole way, and the cost only resolves into focus a month later, as a number with no story attached. The fix isn't a bigger context window or a cheaper model plan. It's the willingness to weigh the workflow before you cut it.
If you've got one coding-agent workflow that's getting expensive or noisy, bring us that one workflow and we'll map its context calorie label with you: baseline tokens, repeated payloads, the governed number, a quality guard, recoverability, billing mode, and a stop rule. You'll leave able to read what the agent actually spends — and to tell, before you change anything, whether cutting it will help. If you'd rather wire the measurement into a repeatable path, that's process automation; if you're sketching the broader guardrails around agents that act on their own, our AI workflow controls cover the same ground at the policy level.
Don't put the agent on a diet until you can read the calorie label. A smaller context is only a cheaper one when the meter, the quality result, and the bill all agree.
Before you put the agent on a diet
Read the calorie label for one coding-agent workflow
Bring one expensive or noisy coding-agent workflow. We will map its baseline context, repeated payloads, quality guard, recoverability, and billing mode before anyone changes a setting.
Best fit when a coding agent runs every day on a pay-per-token API and the context bill is becoming budget noise.
Practical AI Workflow Notes
Want more practical AI operations ideas?
Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.
Turn this idea into a pilot
Which workflow should go first?
Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.
- 3-5 minutes
- Deterministic score
- No sensitive data
Share this post
