Scalekit ran 75 head-to-head tasks — same model (Claude Sonnet 4), same prompts, same operations — comparing MCP servers against CLI. The simplest task: check a repo's primary language. CLI cost: 1,365 tokens. MCP cost: 44,026 tokens. The entire overhead came from injecting all 43 tool definitions into the context window before the agent could do anything at all.
That number keeps circling back through everything else that happened in AI today. NVIDIA launched a CPU. Mistral dropped a formal-proof model. Meta's newly acquired agent social network updated its terms. Each story is different, but all of them are about the same friction — agent infrastructure that gets expensive fast once it leaves the demo and touches real work.
The benchmark the AI press skipped
Apideck published the Scalekit numbers alongside their own tool-call audit. The range across 75 tasks: MCP ran 4 to 32× more tokens than CLI for identical operations. One team cited in the post connected three MCP servers — GitHub, Slack, Sentry — and burned 55,000 tokens of tool definitions before a single user message hit the model. That is more than a quarter of Claude's 200k context limit spent on schema declarations.
David Zhang at Duet described the trilemma clearly: load all tools up front and lose working memory; limit integrations and lose coverage; build dynamic tool loading and add latency and middleware. He ripped MCP out of their stack after getting OAuth working because the math was irreducible.
The Apideck post proposes a CLI alternative that compresses tool schemas into about 2,000 tokens total regardless of API surface. Whether that specific approach becomes standard or not, the problem it is naming is real: the current MCP default is a context-burn pattern that scales badly once you connect more than a couple of services.
The practical implication for anyone building agent workflows is auditing tool definition weight before production. If your agent is tool-rich and context-poor before the first user message, the orchestration overhead is already eating the budget that was supposed to go to reasoning.
NVIDIA's Vera CPU is the hardware answer to the same problem
Announced at GTC today, the NVIDIA Vera CPU is the first processor NVIDIA has positioned explicitly for agentic inference rather than model training. Key numbers from the primary source: 88 custom Olympus cores, twice the efficiency of traditional rack-scale CPUs, 50% faster, and a new rack configuration sustaining 22,500 concurrent CPU environments, each running independently at full performance.
The Vera CPU pairs with NVIDIA GPUs via NVLink-C2C at 1.8 TB/s of coherent bandwidth — 7× PCIe Gen 6. Customers confirmed for Vera deployment include Alibaba, ByteDance, CoreWeave, Meta, and Oracle Cloud Infrastructure.
The subtext matters here. NVIDIA is not building Vera for general-purpose computing. It is building it because the workload that matters for AI is increasingly not training a large model but orchestrating thousands of simultaneous agent tasks — tool calls, context lookups, plan validation, code execution. That is exactly the workload the MCP context-burn problem affects most. More concurrent agent environments per rack means lower cost per agent-task, which changes the unit economics of products built on agentic loops.
For anyone evaluating cloud provider infrastructure choices over the next 12 months, Vera's adoption list (including Oracle Cloud Infrastructure alongside the obvious hyperscalers) is the signal worth tracking. Inference capacity that costs less per agent call is what makes context-heavy orchestration viable at production scale.
Mistral Leanstral: a proof agent for $36 versus $549
Mistral released Leanstral, a 6B active parameter formal proof agent built for Lean 4 under Apache 2.0. On FLTEval — a benchmark grounded in completing proofs in actual pull requests to the Fermat's Last Theorem formalization project, not isolated competition math — Leanstral at pass@2 scored 26.3, beating Claude Sonnet 4.6 (23.7) by 2.6 points. Cost to achieve that: $36. Sonnet's cost for the same run: $549.
This is not a story about replacing frontier models across the board. Opus 4.6 still leads FLTEval at 39.6. But the 15× price gap for a task where Leanstral outperforms is a real procurement argument for specialized narrow agents.
The structural point is the same as the MCP token tax: the cost structure of AI is being repriced by narrow, task-optimized systems. A 6B model that formally verifies code against specs is more useful for one workflow than a 120B+ model that does everything adequately. As agent pipelines mature, the question is not "which model is best overall" but "which model costs the least to complete this specific loop reliably." Leanstral is Mistral's first public answer to that question in a formal verification context.
Leanstral is available via Mistral's API and as open weights, with MCP support through lean-lsp-mcp.
Moltbook's new ToS: the agent liability question, in writing
Meta acquired Moltbook — a platform positioning itself as a social network for AI agents — just days before Moltbook updated its terms of service to state that users are "solely responsible" for actions taken by their AI agents, "whether they act autonomously or otherwise, and irrespective of whether such actions or omissions were intended."
Age requirement: 13+. Liability for agent behavior: 100% on the user.
The liability framing is worth noting because it is the same direction the rest of the agent ecosystem is heading. Platforms are not willing to own what agents do on their behalf. That means every operator running agents against external systems — web platforms, APIs, tools — is accumulating liability that does not yet have standard coverage. The ToS clause is not unique to Moltbook; it is the clause that will appear in virtually every platform agreement as agents get normalized.
The recurring cost tonight was not benchmark theater or keynote polish. It was infrastructure overhead: context burn from MCP schemas, hardware cost per concurrent agent, per-token price for formal verification, liability exposure per autonomous action. The vendors who answer those four questions precisely — and cheaply — are the ones worth watching in the second half of 2026.
