Sebastian Aaltonen, a veteran game engine developer, posted a thread this weekend that deserves more attention than it got. In a week full of model announcements, he quietly dropped something more useful: a systematic breakdown of how he made his custom LLM runner over 10x faster in a single week -- without switching models or upgrading hardware.
The lever he pulled was not compute. It was architecture.
Specifically: the roundtrip problem.
What a Roundtrip Actually Costs You
Here is what the naive agentic loop looks like in most implementations:
- Send full prompt + history to the model
- Model responds with a tool call (JSON)
- Execute the tool
- Send the entire history again -- prompt + all prior tool calls + responses -- plus the new result
- Model responds with the next tool call
- Repeat
Every iteration ships the entire accumulated context back to the API. On iteration one that is a few hundred tokens. By iteration ten, you might be sending 20,000+ tokens per call just to communicate what already happened. And each of those calls carries network latency on top of inference latency.
Aaltonen frames the two optimization targets clearly:
"We have two optimization targets: minimize the amount of tokens we dump to the LLM and minimize the amount of roundtrips, since each roundtrip sends all tokens again. Roundtrips also add latency as you need to send data to server and wait for the LLM again. Thus roundtrips are the most important thing to optimize. But since each roundtrip sends all the tokens again, optimizing the number of tokens each tool call adds is massively important too. Both must be optimized."
This is the core tension. Tokens and roundtrips are multiplicative, not additive. A 50% reduction in tokens per call combined with a 50% reduction in roundtrip count does not save you 50% -- it saves you 75%.
Before: The Naive Pattern
Most tutorial-level agent implementations follow a single-tool-call-per-response pattern:
- LLM decides it needs to read a file: one roundtrip
- LLM decides it needs to check a directory: another roundtrip
- LLM decides it needs to search for a symbol: another roundtrip
- LLM has enough context to act: one final roundtrip
Four roundtrips. Each one carries the full growing context. If the first call added 500 tokens to history, and the second added another 500, the third call is already shipping 1,500 tokens of context overhead before you even get to your actual prompt.
At GPT-4o pricing, this is not just slow -- it is expensive. At o3 pricing, it is punishing.
After: What High-Performance Runners Do Differently
Batch tool calls per response. Modern model APIs -- including the OpenAI function calling spec -- support returning multiple tool calls in a single response. If the model can predict that it will need to read three files and query one database, it can issue all four calls simultaneously and get all four results in a single roundtrip. Most agent frameworks do not exploit this by default.
Compress the history. Instead of appending raw tool responses verbatim, summarize completed sub-tasks into compact state objects. "Read config.yaml: found DB connection string" is 8 tokens. The full YAML file is potentially thousands. The LLM only needs the result, not the transcript.
Structure tool outputs for minimal token footprint. Design your tool response schemas to return exactly what the next decision step needs -- no more. If your file-reading tool returns the entire file but the LLM only needs line counts and imports, you are paying for tokens it will never use.
Use state machines instead of open-ended history. For bounded workflows (form processing, data extraction, code review), structured state objects can replace the full message history entirely. The model receives current state, not a transcript of how it got there. This caps context growth regardless of task complexity.
The Real-World Impact
Aaltonen reports gains that compound: fewer roundtrips mean less time waiting, and less time waiting means faster feedback on whether your tool design is even working. Debugging a 10-roundtrip workflow is an order of magnitude harder than debugging a 2-roundtrip one.
For production deployments, the implications extend beyond speed:
- Cost ceiling: A naive 10-roundtrip workflow at 5,000 tokens per call burns 50,000 tokens per task. An optimized 3-roundtrip workflow at 1,500 tokens per call burns 4,500 tokens. Same outcome, 11x cheaper.
- Reliability: Every roundtrip is a failure point. Network errors, rate limits, and context window overflows all become less likely when you send fewer, smaller requests.
- Predictability: Compact history means more consistent behavior. Long histories tend to cause the model to fixate on earlier decisions or lose track of the current goal.
A 30-Minute Audit You Can Run Today
If you have an agentic pipeline in production or staging, here is a fast way to quantify your roundtrip overhead:
- Add logging to capture
(call_number, prompt_token_count, completion_token_count)for every model call in a sample task - Chart prompt token count by call number -- a linear increase signals naive history appending
- Count how many tool calls appear in each completion -- anything below 1.5 average means you are leaving batching gains on the table
- Identify your three highest-token tool responses and ask: does the model actually use all of that in the next step, or just a summary?
The pattern that surfaces in almost every audit: the first two calls are efficient, and everything after call three is paying compound interest on earlier verbosity.
Framework Support Varies Widely
It is worth noting that roundtrip optimization is not uniformly supported across popular frameworks:
- LangChain / LangGraph: Multi-tool-call batching requires explicit configuration; default behavior is sequential
- AutoGen: Supports parallel tool calls in group chat patterns but not single-agent flows by default
- OpenAI Assistants API: Tool batching is supported in the function calling spec but depends on model behavior
- Custom runners (like Aaltonen's): Full control, full responsibility
If you are using a managed framework, check whether it is requesting parallel tool calls in the API payload (parallel_tool_calls: true in the OpenAI spec). If it is not, you are almost certainly leaving performance on the table.
The Stack Question
None of this requires a new model. The gains Aaltonen achieved came from rethinking the communication layer between the model and the tools it calls -- the part most developers treat as boilerplate.
For teams evaluating whether to build custom runners versus using managed frameworks: this is one of the strongest arguments for the custom path. Frameworks optimize for developer experience. Custom runners can optimize for token efficiency and roundtrip minimization. The gap is measurable and, at scale, financially significant.
Related: How AI Agents Are Reshaping Workflow Orchestration for Small Business -- and The 3 Endpoint Decisions That Change Agent Rollouts for more on structuring agentic deployment decisions.
