Brewing...

AI Development

The Hidden Tax in Your AI Agent: How Tool Call Roundtrips Are Killing Performance

A game developer shaved 10x off his LLM runner overhead this week by attacking the roundtrip problem. Here is the teardown -- and what it means for anyone building agentic pipelines.

Sean McLellan

Lead Architect & Founder

February 28, 20266 min read

Sebastian Aaltonen, a veteran game engine developer, posted a thread this weekend that deserves more attention than it got. In a week full of model announcements, he quietly dropped something more useful: a systematic breakdown of how he made his custom LLM runner over 10x faster in a single week -- without switching models or upgrading hardware.

The lever he pulled was not compute. It was architecture.

Specifically: the roundtrip problem.

What a Roundtrip Actually Costs You

Here is what the naive agentic loop looks like in most implementations:

Send full prompt + history to the model
Model responds with a tool call (JSON)
Execute the tool
Send the entire history again -- prompt + all prior tool calls + responses -- plus the new result
Model responds with the next tool call
Repeat

Every iteration ships the entire accumulated context back to the API. On iteration one that is a few hundred tokens. By iteration ten, you might be sending 20,000+ tokens per call just to communicate what already happened. And each of those calls carries network latency on top of inference latency.

Aaltonen frames the two optimization targets clearly:

"We have two optimization targets: minimize the amount of tokens we dump to the LLM and minimize the amount of roundtrips, since each roundtrip sends all tokens again. Roundtrips also add latency as you need to send data to server and wait for the LLM again. Thus roundtrips are the most important thing to optimize. But since each roundtrip sends all the tokens again, optimizing the number of tokens each tool call adds is massively important too. Both must be optimized."

This is the core tension. Tokens and roundtrips are multiplicative, not additive. A 50% reduction in tokens per call combined with a 50% reduction in roundtrip count does not save you 50% -- it saves you 75%.

Before: The Naive Pattern

Most tutorial-level agent implementations follow a single-tool-call-per-response pattern:

LLM decides it needs to read a file: one roundtrip
LLM decides it needs to check a directory: another roundtrip
LLM decides it needs to search for a symbol: another roundtrip
LLM has enough context to act: one final roundtrip

Four roundtrips. Each one carries the full growing context. If the first call added 500 tokens to history, and the second added another 500, the third call is already shipping 1,500 tokens of context overhead before you even get to your actual prompt.

At GPT-4o pricing, this is not just slow -- it is expensive. At o3 pricing, it is punishing.

After: What High-Performance Runners Do Differently

Batch tool calls per response. Modern model APIs -- including the OpenAI function calling spec -- support returning multiple tool calls in a single response. If the model can predict that it will need to read three files and query one database, it can issue all four calls simultaneously and get all four results in a single roundtrip. Most agent frameworks do not exploit this by default.

Compress the history. Instead of appending raw tool responses verbatim, summarize completed sub-tasks into compact state objects. "Read config.yaml: found DB connection string" is 8 tokens. The full YAML file is potentially thousands. The LLM only needs the result, not the transcript.

Structure tool outputs for minimal token footprint. Design your tool response schemas to return exactly what the next decision step needs -- no more. If your file-reading tool returns the entire file but the LLM only needs line counts and imports, you are paying for tokens it will never use.

Use state machines instead of open-ended history. For bounded workflows (form processing, data extraction, code review), structured state objects can replace the full message history entirely. The model receives current state, not a transcript of how it got there. This caps context growth regardless of task complexity.

The Real-World Impact

Aaltonen reports gains that compound: fewer roundtrips mean less time waiting, and less time waiting means faster feedback on whether your tool design is even working. Debugging a 10-roundtrip workflow is an order of magnitude harder than debugging a 2-roundtrip one.

For production deployments, the implications extend beyond speed:

Cost ceiling: A naive 10-roundtrip workflow at 5,000 tokens per call burns 50,000 tokens per task. An optimized 3-roundtrip workflow at 1,500 tokens per call burns 4,500 tokens. Same outcome, 11x cheaper.
Reliability: Every roundtrip is a failure point. Network errors, rate limits, and context window overflows all become less likely when you send fewer, smaller requests.
Predictability: Compact history means more consistent behavior. Long histories tend to cause the model to fixate on earlier decisions or lose track of the current goal.

A 30-Minute Audit You Can Run Today

If you have an agentic pipeline in production or staging, here is a fast way to quantify your roundtrip overhead:

Add logging to capture (call_number, prompt_token_count, completion_token_count) for every model call in a sample task
Chart prompt token count by call number -- a linear increase signals naive history appending
Count how many tool calls appear in each completion -- anything below 1.5 average means you are leaving batching gains on the table
Identify your three highest-token tool responses and ask: does the model actually use all of that in the next step, or just a summary?

The pattern that surfaces in almost every audit: the first two calls are efficient, and everything after call three is paying compound interest on earlier verbosity.

Framework Support Varies Widely

It is worth noting that roundtrip optimization is not uniformly supported across popular frameworks:

LangChain / LangGraph: Multi-tool-call batching requires explicit configuration; default behavior is sequential
AutoGen: Supports parallel tool calls in group chat patterns but not single-agent flows by default
OpenAI Assistants API: Tool batching is supported in the function calling spec but depends on model behavior
Custom runners (like Aaltonen's): Full control, full responsibility

If you are using a managed framework, check whether it is requesting parallel tool calls in the API payload (parallel_tool_calls: true in the OpenAI spec). If it is not, you are almost certainly leaving performance on the table.

The Stack Question

None of this requires a new model. The gains Aaltonen achieved came from rethinking the communication layer between the model and the tools it calls -- the part most developers treat as boilerplate.

For teams evaluating whether to build custom runners versus using managed frameworks: this is one of the strongest arguments for the custom path. Frameworks optimize for developer experience. Custom runners can optimize for token efficiency and roundtrip minimization. The gap is measurable and, at scale, financially significant.

Related: How AI Agents Are Reshaping Workflow Orchestration for Small Business -- and The 3 Endpoint Decisions That Change Agent Rollouts for more on structuring agentic deployment decisions.

Share this post

Share on X Share on LinkedIn Share on Bluesky

Keep Reading

The 3 Endpoint Decisions That Change Agent Rollouts

February 27, 2026

Claude Code Now Authors 4% of GitHub Commits -- Should You Be in That Number?

February 28, 2026

Why Better Benchmarks Can Produce Worse Production Outcomes

February 27, 2026

AI Development

The Hidden Tax in Your AI Agent: How Tool Call Roundtrips Are Killing Performance

A game developer shaved 10x off his LLM runner overhead this week by attacking the roundtrip problem. Here is the teardown -- and what it means for anyone building agentic pipelines.

Sean McLellan

Lead Architect & Founder

February 28, 20266 min read

The lever he pulled was not compute. It was architecture.

Specifically: the roundtrip problem.

What a Roundtrip Actually Costs You

Here is what the naive agentic loop looks like in most implementations:

Send full prompt + history to the model
Model responds with a tool call (JSON)
Execute the tool
Send the entire history again -- prompt + all prior tool calls + responses -- plus the new result
Model responds with the next tool call
Repeat

Aaltonen frames the two optimization targets clearly:

"We have two optimization targets: minimize the amount of tokens we dump to the LLM and minimize the amount of roundtrips, since each roundtrip sends all tokens again. Roundtrips also add latency as you need to send data to server and wait for the LLM again. Thus roundtrips are the most important thing to optimize. But since each roundtrip sends all the tokens again, optimizing the number of tokens each tool call adds is massively important too. Both must be optimized."

Before: The Naive Pattern

Most tutorial-level agent implementations follow a single-tool-call-per-response pattern:

LLM decides it needs to read a file: one roundtrip
LLM decides it needs to check a directory: another roundtrip
LLM decides it needs to search for a symbol: another roundtrip
LLM has enough context to act: one final roundtrip

At GPT-4o pricing, this is not just slow -- it is expensive. At o3 pricing, it is punishing.

After: What High-Performance Runners Do Differently

The Real-World Impact

For production deployments, the implications extend beyond speed:

Cost ceiling: A naive 10-roundtrip workflow at 5,000 tokens per call burns 50,000 tokens per task. An optimized 3-roundtrip workflow at 1,500 tokens per call burns 4,500 tokens. Same outcome, 11x cheaper.
Reliability: Every roundtrip is a failure point. Network errors, rate limits, and context window overflows all become less likely when you send fewer, smaller requests.
Predictability: Compact history means more consistent behavior. Long histories tend to cause the model to fixate on earlier decisions or lose track of the current goal.

A 30-Minute Audit You Can Run Today

If you have an agentic pipeline in production or staging, here is a fast way to quantify your roundtrip overhead:

Add logging to capture (call_number, prompt_token_count, completion_token_count) for every model call in a sample task
Chart prompt token count by call number -- a linear increase signals naive history appending
Count how many tool calls appear in each completion -- anything below 1.5 average means you are leaving batching gains on the table
Identify your three highest-token tool responses and ask: does the model actually use all of that in the next step, or just a summary?

The pattern that surfaces in almost every audit: the first two calls are efficient, and everything after call three is paying compound interest on earlier verbosity.

Framework Support Varies Widely

It is worth noting that roundtrip optimization is not uniformly supported across popular frameworks:

LangChain / LangGraph: Multi-tool-call batching requires explicit configuration; default behavior is sequential
AutoGen: Supports parallel tool calls in group chat patterns but not single-agent flows by default
OpenAI Assistants API: Tool batching is supported in the function calling spec but depends on model behavior
Custom runners (like Aaltonen's): Full control, full responsibility

The Stack Question

None of this requires a new model. The gains Aaltonen achieved came from rethinking the communication layer between the model and the tools it calls -- the part most developers treat as boilerplate.

Related: How AI Agents Are Reshaping Workflow Orchestration for Small Business -- and The 3 Endpoint Decisions That Change Agent Rollouts for more on structuring agentic deployment decisions.

Share this post

Share on X Share on LinkedIn Share on Bluesky

The Hidden Tax in Your AI Agent: How Tool Call Roundtrips Are Killing Performance

What a Roundtrip Actually Costs You

Before: The Naive Pattern

After: What High-Performance Runners Do Differently

The Real-World Impact

A 30-Minute Audit You Can Run Today

Framework Support Varies Widely

The Stack Question

Share this post

Related Posts

The 3 Endpoint Decisions That Change Agent Rollouts

Claude Code Now Authors 4% of GitHub Commits -- Should You Be in That Number?

Why Better Benchmarks Can Produce Worse Production Outcomes

Keep Reading

The 3 Endpoint Decisions That Change Agent Rollouts

Claude Code Now Authors 4% of GitHub Commits -- Should You Be in That Number?

Why Better Benchmarks Can Produce Worse Production Outcomes

The Hidden Tax in Your AI Agent: How Tool Call Roundtrips Are Killing Performance

What a Roundtrip Actually Costs You

Before: The Naive Pattern

After: What High-Performance Runners Do Differently

The Real-World Impact

A 30-Minute Audit You Can Run Today

Framework Support Varies Widely

The Stack Question

Share this post

Related Posts

The 3 Endpoint Decisions That Change Agent Rollouts

Claude Code Now Authors 4% of GitHub Commits -- Should You Be in That Number?

Why Better Benchmarks Can Produce Worse Production Outcomes

Keep Reading

The 3 Endpoint Decisions That Change Agent Rollouts

Claude Code Now Authors 4% of GitHub Commits -- Should You Be in That Number?

Why Better Benchmarks Can Produce Worse Production Outcomes