AI Development

JetBrains Mellum2 shows where agent costs are really going

JetBrains' Mellum2 release is a useful signal for teams building AI workflows: stop treating model choice as one default setting and start routing each step to the smallest model that can pass its receipt.

Sean McLellan

Lead Architect & Founder

June 5, 20266 min read

A support agent has five jobs before a customer ever sees a reply.

Classify the ticket. Fetch the right policy. Summarize the account history. Draft the response. Decide whether the refund needs approval.

Most teams make the same expensive mistake at the start: they send all five steps to the same large model.

That feels simpler. One model. One instruction style. One vendor bill.

It also means the workflow pays frontier-model prices for chores that do not need frontier-model judgment. The classifier does not need literary nuance. The policy lookup does not need deep reasoning. The context summary needs to be accurate, short, and fast. The refund decision might need a stronger model, a human approval step, or both.

That is the cost problem hiding inside agentic workflows. The bill does not come from one dramatic request. It comes from the repeated middle steps that run hundreds or thousands of times.

JetBrains' Mellum2 release is interesting for that reason. The headline is not "JetBrains launched another model." The useful signal is that production AI is moving toward routed systems: smaller, cheaper, specialized models handling frequent steps, with larger models reserved for the parts that actually need them.

What JetBrains actually released

On June 2, 2026, JetBrains announced that Mellum2 is open source. It is released under Apache 2.0 and built for software engineering workflows.

The numbers matter. Mellum2 is a 12B total parameter Mixture-of-Experts model, but only 2.5B parameters are active per token. JetBrains says that MoE design reduces compute cost while supporting high-throughput, low-latency inference.

JetBrains describes Mellum2 as useful for routing, Q&A, sub-agents, private AI use, summarization, intermediate reasoning, low-latency RAG, and local or self-hosted deployment.

The Mellum2 technical report, submitted to arXiv on May 29, gives the same core shape: an open-weight 12B MoE model with 2.5B active parameters per token, specialized for software engineering.

The report lists code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance.

The Hugging Face Mellum2 collection shows a model family rather than a single artifact, including Thinking, Instruct, SFT, and Base variants.

That family shape matters. Production teams rarely need one universal model. They need components that can be matched to the job.

JetBrains puts the point plainly in the announcement: the future belongs to coordinated systems, not single models.

For operators, that is the line worth keeping.

Why focal models matter

A focal model has a narrow job.

It does not need to win every leaderboard. It needs to perform one part of your workflow reliably, quickly, and cheaply enough that you can afford to run it often.

That changes the design conversation.

If your workflow handles 5,000 inbound requests a month, and each request triggers six AI calls, you are not buying "an AI model." You are operating 30,000 model decisions.

Some are cheap classification calls. Some are retrieval checks. Some are summaries. Some are drafts. A few are risky decisions.

Treating all of those as the same kind of intelligence is lazy architecture.

A frontier model may still be the right choice for high-ambiguity work: messy negotiation, strategic synthesis, sensitive customer communication, novel code review, or a decision that could cost real money.

But the middle of the workflow is different. It is full of repeatable tasks with clear inputs and checkable outputs.

That is where a model like Mellum2 points. It is aimed at places where latency, throughput, privacy, and cost become operational constraints.

Agent tooling is moving the same way. The OpenAI Agents SDK describes agents alongside handoffs, guardrails, tracing, sessions, and human-in-the-loop patterns. The Model Context Protocol intro describes MCP as an open-source standard for connecting AI applications to external systems.

Those are not just developer conveniences. They are signs that AI work is becoming orchestration work.

The model is one part of the system. The route, tool contract, receipt, and approval boundary matter just as much.

Start with the workflow, then pick the model

The wrong question is: "Which model should our business use?"

A better question is: "What does each step need to prove before the next step runs?"

A ticket classifier needs a label and confidence score. A retrieval step needs source IDs. A summarizer needs to preserve specific facts and omit anything unsupported. A draft writer needs tone, policy compliance, and a citation trail. A refund decision needs a threshold, an approval rule, or a human reviewer.

Each step should have a receipt.

We wrote about this in agent evals should test workflow receipts: the final answer is not enough. If an agent gets to a polished response through a broken chain, the workflow is still unsafe.

Receipts make routing possible.

Once you know what a step must prove, you can choose the smallest model that can pass that receipt. If a lightweight local model can classify 98% of tickets correctly under your eval, that step should not be routed to a premium frontier model. If a summarizer preserves key facts and stays inside the retrieved context, it does not need to be creative.

This is where benchmark thinking can mislead teams. A model can lead a public benchmark and still be the wrong fit for a production step if it is too slow, too expensive, weak at tool contracts, or hard to roll out safely. That is the argument in why better benchmarks can produce worse production outcomes.

Production value comes from fit, not rank.

A routed workflow example

Here is the support workflow again, but with model choice treated as an operational decision instead of a default setting.

Workflow step	What it needs	Model class	Receipt to check
Classify ticket	Fast intent and urgency label	Small classifier or specialized model	Label, confidence, fallback if confidence is low
Fetch policy	Correct source retrieval	Retrieval system plus small reranker	Policy IDs, timestamps, source links
Summarize context	Short account history with no invented facts	Small or mid-sized summarization model	Cited facts, omitted unsupported claims
Draft response	Clear customer-facing language	Mid-sized instruction model, with frontier escalation for sensitive cases	Policy match, tone check, source trace
Decide approval	Risk threshold and business rule	Rule, human, or frontier model when ambiguity is high	Approval flag, reason, reviewer handoff if needed

A routed AI workflow with small specialized model loops feeding a guarded frontier-model decision point. — Agent workflows need routing decisions, not one default model for every step.

The table is intentionally plain. Good agent architecture often looks less like magic and more like operations design.

Inputs. Outputs. Receipts. Budgets. Escalation paths.

The payoff is not just lower spend. It is more control.

A routed system can say: this low-risk step runs locally. This step can use a cheap hosted model. This step needs a larger model because the customer is upset and the refund amount is high. This step cannot proceed without approval.

That is a better business system than one oversized instruction trying to do everything.

Where operators should tighten the system

Start by writing down every AI call the process might make.

Include the hidden ones: classification, routing, query rewriting, summarization, source ranking, draft revision, policy checks, and final review.

Then look at frequency.

The call that runs on every ticket deserves a different budget than the call that runs only on escalations. A classifier might need to return almost instantly. A final draft can tolerate more latency. A high-stakes approval decision can cost more because it runs less often.

Next, separate the steps by failure mode.

A bad classification wastes time. A bad summary can mislead the draft. A bad policy decision can give away money, violate a customer promise, or create compliance exposure.

Those are not the same problem. They should not have the same model, eval, or approval path.

Receipts make the work inspectable. A summarizer passes if every claim can be traced to retrieved context. A code-editing sub-agent passes if it produces a diff, test result, and explanation of touched files. A policy decision passes if it cites the policy section and flags uncertainty.

This connects directly to responsible AI governance. Receipts are not paperwork for their own sake. They give teams a way to inspect decisions, set approval boundaries, and avoid pretending the final answer tells the whole story.

If your team needs a starting point, BaristaLabs has a Responsible AI resource hub and an AI approval policy template.

Privacy belongs in the same conversation.

Mellum2 is notable because JetBrains explicitly talks about private AI use, local deployment, and self-hosting. That matters for software teams handling proprietary code, internal policies, customer data, or regulated workflows.

Open source models do not remove governance work. They do give teams more deployment options.

That connects to the broader open-source cost argument in Hugging Face training costs and open source AI in 2026. Open models are becoming practical economic infrastructure, not just research artifacts.

The cost curve is in the middle

Most AI cost conversations focus on the obvious premium moment: the large model writing the final answer.

That is often the wrong place to look first.

In agent workflows, the repeated middle steps can dominate the bill. The classifier runs every time. The summarizer runs every time. The reranker runs every time. The tool planner may run more than once. The sub-agent may loop.

If those steps all use the most expensive default, cost grows quietly until the workflow becomes too expensive to trust at scale.

The same thing happens with latency.

One slow call is annoying. Six slow calls chained together can make the product feel broken. A support workflow that takes 45 seconds to draft a response will not feel operationally useful, even if the response is good.

This is why Mellum2 is a useful signal. Its MoE design, 2.5B active parameters per token, software engineering focus, and local deployment story are all aimed at the high-frequency parts of the system.

The model itself may or may not fit your stack. The architecture lesson travels.

What to do next

If your team is already using agents, audit the workflow this week.

Do not start with vendor comparison sheets. Start with the route.

List each step. Count how often it runs. Write the receipt. Measure current latency and cost. Mark which steps touch sensitive data. Mark which decisions require approval.

Then test smaller models against the receipt.

If the smaller model passes, route that step down. If it fails, either improve the step design or keep the stronger model. If the step is risky, add a guardrail or human review instead of hoping the model behaves.

The winners in production AI will not be the teams with the fanciest default model.

They will be the teams that know which steps deserve intelligence, which steps need speed, and which steps require approval.

That is where agent costs are going.

Model-routing receipt worksheet

Map your agent cost curve before model spend scales

Download the worksheet to turn this article's routing table into rows for your own workflow: step, frequency, model class, receipt, latency budget, cost budget, sensitive data, and approval boundary.

Review this routing map with BaristaLabs Explore AI workflow consulting

The worksheet is designed for non-sensitive workflow shape only: do not submit customer records, prompts, credentials, or private traces.

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Turn this idea into a pilot

Which workflow should go first?

Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.

3-5 minutes
Deterministic score
No sensitive data

Check workflow readiness

Share this post

Share on X Share on LinkedIn Share on Bluesky

Claude Code 2.1.214 changed permission behavior across shells and rules

July 20, 2026

Grok Build is open source. Vendor independence is a separate question.

July 19, 2026

Copilot code review now reads instructions from the pull request branch

July 19, 2026

Article-specific next step

Bring one routing receipt to the review

Fill the worksheet for one agent workflow, then use the review to decide which steps can move to smaller models and which need stronger models, guardrails, or human approval.

Review this routing map with BaristaLabs

Use the article table as a worksheet: list each workflow step, count frequency, write the receipt, estimate cost and latency, then mark data and approval boundaries.

Share tools and related posts stay near the article end so mobile does not parse duplicate hidden desktop modules during first load.

AI Development

JetBrains Mellum2 shows where agent costs are really going

Sean McLellan

Lead Architect & Founder

June 5, 20266 min read

A support agent has five jobs before a customer ever sees a reply.

Classify the ticket. Fetch the right policy. Summarize the account history. Draft the response. Decide whether the refund needs approval.

Most teams make the same expensive mistake at the start: they send all five steps to the same large model.

That feels simpler. One model. One instruction style. One vendor bill.

That is the cost problem hiding inside agentic workflows. The bill does not come from one dramatic request. It comes from the repeated middle steps that run hundreds or thousands of times.

What JetBrains actually released

On June 2, 2026, JetBrains announced that Mellum2 is open source. It is released under Apache 2.0 and built for software engineering workflows.

JetBrains describes Mellum2 as useful for routing, Q&A, sub-agents, private AI use, summarization, intermediate reasoning, low-latency RAG, and local or self-hosted deployment.

The Mellum2 technical report, submitted to arXiv on May 29, gives the same core shape: an open-weight 12B MoE model with 2.5B active parameters per token, specialized for software engineering.

The report lists code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance.

The Hugging Face Mellum2 collection shows a model family rather than a single artifact, including Thinking, Instruct, SFT, and Base variants.

That family shape matters. Production teams rarely need one universal model. They need components that can be matched to the job.

JetBrains puts the point plainly in the announcement: the future belongs to coordinated systems, not single models.

For operators, that is the line worth keeping.

Why focal models matter

A focal model has a narrow job.

It does not need to win every leaderboard. It needs to perform one part of your workflow reliably, quickly, and cheaply enough that you can afford to run it often.

That changes the design conversation.

If your workflow handles 5,000 inbound requests a month, and each request triggers six AI calls, you are not buying "an AI model." You are operating 30,000 model decisions.

Some are cheap classification calls. Some are retrieval checks. Some are summaries. Some are drafts. A few are risky decisions.

Treating all of those as the same kind of intelligence is lazy architecture.

But the middle of the workflow is different. It is full of repeatable tasks with clear inputs and checkable outputs.

That is where a model like Mellum2 points. It is aimed at places where latency, throughput, privacy, and cost become operational constraints.

Those are not just developer conveniences. They are signs that AI work is becoming orchestration work.

The model is one part of the system. The route, tool contract, receipt, and approval boundary matter just as much.

Start with the workflow, then pick the model

The wrong question is: "Which model should our business use?"

A better question is: "What does each step need to prove before the next step runs?"

Each step should have a receipt.

We wrote about this in agent evals should test workflow receipts: the final answer is not enough. If an agent gets to a polished response through a broken chain, the workflow is still unsafe.

Receipts make routing possible.

Production value comes from fit, not rank.

A routed workflow example

Here is the support workflow again, but with model choice treated as an operational decision instead of a default setting.

Workflow step	What it needs	Model class	Receipt to check
Classify ticket	Fast intent and urgency label	Small classifier or specialized model	Label, confidence, fallback if confidence is low
Fetch policy	Correct source retrieval	Retrieval system plus small reranker	Policy IDs, timestamps, source links
Summarize context	Short account history with no invented facts	Small or mid-sized summarization model	Cited facts, omitted unsupported claims
Draft response	Clear customer-facing language	Mid-sized instruction model, with frontier escalation for sensitive cases	Policy match, tone check, source trace
Decide approval	Risk threshold and business rule	Rule, human, or frontier model when ambiguity is high	Approval flag, reason, reviewer handoff if needed

The table is intentionally plain. Good agent architecture often looks less like magic and more like operations design.

Inputs. Outputs. Receipts. Budgets. Escalation paths.

The payoff is not just lower spend. It is more control.

That is a better business system than one oversized instruction trying to do everything.

Where operators should tighten the system

Start by writing down every AI call the process might make.

Include the hidden ones: classification, routing, query rewriting, summarization, source ranking, draft revision, policy checks, and final review.

Then look at frequency.

Next, separate the steps by failure mode.

A bad classification wastes time. A bad summary can mislead the draft. A bad policy decision can give away money, violate a customer promise, or create compliance exposure.

Those are not the same problem. They should not have the same model, eval, or approval path.

If your team needs a starting point, BaristaLabs has a Responsible AI resource hub and an AI approval policy template.

Privacy belongs in the same conversation.

Open source models do not remove governance work. They do give teams more deployment options.

The cost curve is in the middle

Most AI cost conversations focus on the obvious premium moment: the large model writing the final answer.

That is often the wrong place to look first.

If those steps all use the most expensive default, cost grows quietly until the workflow becomes too expensive to trust at scale.

The same thing happens with latency.

The model itself may or may not fit your stack. The architecture lesson travels.

What to do next

If your team is already using agents, audit the workflow this week.

Do not start with vendor comparison sheets. Start with the route.

List each step. Count how often it runs. Write the receipt. Measure current latency and cost. Mark which steps touch sensitive data. Mark which decisions require approval.

Then test smaller models against the receipt.

The winners in production AI will not be the teams with the fanciest default model.

They will be the teams that know which steps deserve intelligence, which steps need speed, and which steps require approval.

That is where agent costs are going.

Model-routing receipt worksheet

Map your agent cost curve before model spend scales

Download the worksheet to turn this article's routing table into rows for your own workflow: step, frequency, model class, receipt, latency budget, cost budget, sensitive data, and approval boundary.

Review this routing map with BaristaLabs Explore AI workflow consulting

The worksheet is designed for non-sensitive workflow shape only: do not submit customer records, prompts, credentials, or private traces.

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Turn this idea into a pilot

Which workflow should go first?

Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.

3-5 minutes
Deterministic score
No sensitive data

Check workflow readiness

Share this post

Share on X Share on LinkedIn Share on Bluesky

Claude Code 2.1.214 changed permission behavior across shells and rules

July 20, 2026

Grok Build is open source. Vendor independence is a separate question.

July 19, 2026

Copilot code review now reads instructions from the pull request branch

July 19, 2026

Article-specific next step

Bring one routing receipt to the review

Fill the worksheet for one agent workflow, then use the review to decide which steps can move to smaller models and which need stronger models, guardrails, or human approval.

Review this routing map with BaristaLabs

Use the article table as a worksheet: list each workflow step, count frequency, write the receipt, estimate cost and latency, then mark data and approval boundaries.

Share tools and related posts stay near the article end so mobile does not parse duplicate hidden desktop modules during first load.