A support agent has five jobs before a customer ever sees a reply.
Classify the ticket. Fetch the right policy. Summarize the account history. Draft the response. Decide whether the refund needs approval.
Most teams make the same expensive mistake at the start: they send all five steps to the same large model.
That feels simpler. One model. One instruction style. One vendor bill.
It also means the workflow pays frontier-model prices for chores that do not need frontier-model judgment. The classifier does not need literary nuance. The policy lookup does not need deep reasoning. The context summary needs to be accurate, short, and fast. The refund decision might need a stronger model, a human approval step, or both.
That is the cost problem hiding inside agentic workflows. The bill does not come from one dramatic request. It comes from the repeated middle steps that run hundreds or thousands of times.
JetBrains' Mellum2 release is interesting for that reason. The headline is not "JetBrains launched another model." The useful signal is that production AI is moving toward routed systems: smaller, cheaper, specialized models handling frequent steps, with larger models reserved for the parts that actually need them.
What JetBrains actually released
On June 2, 2026, JetBrains announced that Mellum2 is open source. It is released under Apache 2.0 and built for software engineering workflows.
The numbers matter. Mellum2 is a 12B total parameter Mixture-of-Experts model, but only 2.5B parameters are active per token. JetBrains says that MoE design reduces compute cost while supporting high-throughput, low-latency inference.
JetBrains describes Mellum2 as useful for routing, Q&A, sub-agents, private AI use, summarization, intermediate reasoning, low-latency RAG, and local or self-hosted deployment.
The Mellum2 technical report, submitted to arXiv on May 29, gives the same core shape: an open-weight 12B MoE model with 2.5B active parameters per token, specialized for software engineering.
The report lists code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance.
The Hugging Face Mellum2 collection shows a model family rather than a single artifact, including Thinking, Instruct, SFT, and Base variants.
That family shape matters. Production teams rarely need one universal model. They need components that can be matched to the job.
JetBrains puts the point plainly in the announcement: the future belongs to coordinated systems, not single models.
For operators, that is the line worth keeping.
Why focal models matter
A focal model has a narrow job.
It does not need to win every leaderboard. It needs to perform one part of your workflow reliably, quickly, and cheaply enough that you can afford to run it often.
That changes the design conversation.
If your workflow handles 5,000 inbound requests a month, and each request triggers six AI calls, you are not buying "an AI model." You are operating 30,000 model decisions.
Some are cheap classification calls. Some are retrieval checks. Some are summaries. Some are drafts. A few are risky decisions.
Treating all of those as the same kind of intelligence is lazy architecture.
A frontier model may still be the right choice for high-ambiguity work: messy negotiation, strategic synthesis, sensitive customer communication, novel code review, or a decision that could cost real money.
But the middle of the workflow is different. It is full of repeatable tasks with clear inputs and checkable outputs.
That is where a model like Mellum2 points. It is aimed at places where latency, throughput, privacy, and cost become operational constraints.
Agent tooling is moving the same way. The OpenAI Agents SDK describes agents alongside handoffs, guardrails, tracing, sessions, and human-in-the-loop patterns. The Model Context Protocol intro describes MCP as an open-source standard for connecting AI applications to external systems.
Those are not just developer conveniences. They are signs that AI work is becoming orchestration work.
The model is one part of the system. The route, tool contract, receipt, and approval boundary matter just as much.
Start with the workflow, then pick the model
The wrong question is: "Which model should our business use?"
A better question is: "What does each step need to prove before the next step runs?"
A ticket classifier needs a label and confidence score. A retrieval step needs source IDs. A summarizer needs to preserve specific facts and omit anything unsupported. A draft writer needs tone, policy compliance, and a citation trail. A refund decision needs a threshold, an approval rule, or a human reviewer.
Each step should have a receipt.
We wrote about this in agent evals should test workflow receipts: the final answer is not enough. If an agent gets to a polished response through a broken chain, the workflow is still unsafe.
Receipts make routing possible.
Once you know what a step must prove, you can choose the smallest model that can pass that receipt. If a lightweight local model can classify 98% of tickets correctly under your eval, that step should not be routed to a premium frontier model. If a summarizer preserves key facts and stays inside the retrieved context, it does not need to be creative.
This is where benchmark thinking can mislead teams. A model can lead a public benchmark and still be the wrong fit for a production step if it is too slow, too expensive, weak at tool contracts, or hard to roll out safely. That is the argument in why better benchmarks can produce worse production outcomes.
Production value comes from fit, not rank.
A routed workflow example
Here is the support workflow again, but with model choice treated as an operational decision instead of a default setting.
| Workflow step | What it needs | Model class | Receipt to check |
|---|---|---|---|
| Classify ticket | Fast intent and urgency label | Small classifier or specialized model | Label, confidence, fallback if confidence is low |
| Fetch policy | Correct source retrieval | Retrieval system plus small reranker | Policy IDs, timestamps, source links |
| Summarize context | Short account history with no invented facts | Small or mid-sized summarization model | Cited facts, omitted unsupported claims |
| Draft response | Clear customer-facing language | Mid-sized instruction model, with frontier escalation for sensitive cases | Policy match, tone check, source trace |
| Decide approval | Risk threshold and business rule | Rule, human, or frontier model when ambiguity is high | Approval flag, reason, reviewer handoff if needed |

The table is intentionally plain. Good agent architecture often looks less like magic and more like operations design.
Inputs. Outputs. Receipts. Budgets. Escalation paths.
The payoff is not just lower spend. It is more control.
A routed system can say: this low-risk step runs locally. This step can use a cheap hosted model. This step needs a larger model because the customer is upset and the refund amount is high. This step cannot proceed without approval.
That is a better business system than one oversized instruction trying to do everything.
Where operators should tighten the system
Start by writing down every AI call the process might make.
Include the hidden ones: classification, routing, query rewriting, summarization, source ranking, draft revision, policy checks, and final review.
Then look at frequency.
The call that runs on every ticket deserves a different budget than the call that runs only on escalations. A classifier might need to return almost instantly. A final draft can tolerate more latency. A high-stakes approval decision can cost more because it runs less often.
Next, separate the steps by failure mode.
A bad classification wastes time. A bad summary can mislead the draft. A bad policy decision can give away money, violate a customer promise, or create compliance exposure.
Those are not the same problem. They should not have the same model, eval, or approval path.
Receipts make the work inspectable. A summarizer passes if every claim can be traced to retrieved context. A code-editing sub-agent passes if it produces a diff, test result, and explanation of touched files. A policy decision passes if it cites the policy section and flags uncertainty.
This connects directly to responsible AI governance. Receipts are not paperwork for their own sake. They give teams a way to inspect decisions, set approval boundaries, and avoid pretending the final answer tells the whole story.
If your team needs a starting point, BaristaLabs has a Responsible AI resource hub and an AI approval policy template.
Privacy belongs in the same conversation.
Mellum2 is notable because JetBrains explicitly talks about private AI use, local deployment, and self-hosting. That matters for software teams handling proprietary code, internal policies, customer data, or regulated workflows.
Open source models do not remove governance work. They do give teams more deployment options.
That connects to the broader open-source cost argument in Hugging Face training costs and open source AI in 2026. Open models are becoming practical economic infrastructure, not just research artifacts.
The cost curve is in the middle
Most AI cost conversations focus on the obvious premium moment: the large model writing the final answer.
That is often the wrong place to look first.
In agent workflows, the repeated middle steps can dominate the bill. The classifier runs every time. The summarizer runs every time. The reranker runs every time. The tool planner may run more than once. The sub-agent may loop.
If those steps all use the most expensive default, cost grows quietly until the workflow becomes too expensive to trust at scale.
The same thing happens with latency.
One slow call is annoying. Six slow calls chained together can make the product feel broken. A support workflow that takes 45 seconds to draft a response will not feel operationally useful, even if the response is good.
This is why Mellum2 is a useful signal. Its MoE design, 2.5B active parameters per token, software engineering focus, and local deployment story are all aimed at the high-frequency parts of the system.
The model itself may or may not fit your stack. The architecture lesson travels.
What to do next
If your team is already using agents, audit the workflow this week.
Do not start with vendor comparison sheets. Start with the route.
List each step. Count how often it runs. Write the receipt. Measure current latency and cost. Mark which steps touch sensitive data. Mark which decisions require approval.
Then test smaller models against the receipt.
If the smaller model passes, route that step down. If it fails, either improve the step design or keep the stronger model. If the step is risky, add a guardrail or human review instead of hoping the model behaves.
The winners in production AI will not be the teams with the fanciest default model.
They will be the teams that know which steps deserve intelligence, which steps need speed, and which steps require approval.
That is where agent costs are going.
Model-routing receipt worksheet
Map your agent cost curve before model spend scales
Download the worksheet to turn this article's routing table into rows for your own workflow: step, frequency, model class, receipt, latency budget, cost budget, sensitive data, and approval boundary.
The worksheet is designed for non-sensitive workflow shape only: do not submit customer records, prompts, credentials, or private traces.
Practical AI Workflow Notes
Want more practical AI operations ideas?
Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.
