Mistral AI released Mistral Small 4 on March 16, 2026 as an Apache 2.0 Mixture-of-Experts model with 119B total parameters, 128 experts, 4 active experts per token, 6.5B activated parameters per token, and a 256K context window. The model also folds multimodal input, configurable reasoning, and agentic tooling into one stack instead of splitting those jobs across separate model families.
That is the real release.
The headline parameter count will travel farther on social, but the activated path is the number that changes the operating math. Mistral is selling a much larger surface area of capability than a dense 6B or 8B class model can offer, while the per-token active footprint stays far closer to a mid-sized deployment budget than a 119B dense model would. If you care about inference cost, routing efficiency matters more than the brag number.
The interesting number is not 119B
Mistral's own release page frames Small 4 as one model replacing three: a fast instruct model, a reasoning model, and a multimodal assistant. The Hugging Face card makes the engineering tradeoff more explicit. This is a 128-expert MoE model with only four experts active per token.
That sparse path is why Small 4 is more commercially interesting than another giant open checkpoint. A dense 119B model announces capability and then sends the hardware team looking for a budget. A sparse 119B model says something else: you may be able to keep the broader knowledge surface, specialist routing, and long context window without paying dense-model tax on every token.
Mistral claims a 40% reduction in end-to-end completion time and 3x more requests per second versus Mistral Small 3. Those performance gains are not side notes. They are the proof that the architecture decision was aimed at serving economics, not only benchmark screenshots.
Thinking becomes a switch, not a separate product
The most practical feature in this release may be reasoning_effort.
Mistral Small 4 lets teams turn reasoning on or off per request. That sounds like a product-detail footnote until you look at where inference budgets usually leak. Most teams either overpay by sending simple requests to a heavyweight reasoning model or underperform by forcing everything through a cheaper instruct model and patching the misses with retries, prompts, or human cleanup.
Small 4 changes that tradeoff. Routine classification, extraction, and tool-calling tasks can run with reasoning_effort=\"none\". Harder code, math, and multi-step planning tasks can opt into a higher-effort path only when the task justifies it. Thinking stops being a separate SKU and starts acting like a controllable runtime setting.
That matters because test-time compute is easiest to justify when it is selective. Paying for reasoning only on the requests that need it is a cleaner business model than standing up separate model chains and hoping the router guesses right.
The model consolidation story is bigger than the benchmark story
Mistral describes Small 4 as the first model in its lineup to combine capabilities previously spread across Magistral, Pixtral, and Devstral. That matters operationally even if you ignore every benchmark chart in the announcement.
One multimodal model with instruction following, reasoning, image input, function calling, and JSON output removes a lot of orchestration glue. Fewer model hops means fewer fallback rules, fewer prompt variants, fewer output-shape mismatches, and fewer support tickets caused by one stage of the pipeline behaving differently from the others.
For teams self-hosting under Apache 2.0, that simplification is part of the value. Open weights matter, but open deployment with a smaller orchestration surface matters just as much. If one model can cover document parsing, coding help, support triage, and deeper reasoning without crossing providers or licenses, the deployment story gets dramatically cleaner.
Fewer model handoffs, fewer line items
Mistral Small 4 is not interesting because it is another 119B model. It is interesting because Mistral used sparsity, configurable reasoning, and Apache 2.0 licensing to make one open model do the work that often gets spread across three specialized systems. The winning number here is not the top-line parameter count. It is the activated path you actually serve, the reasoning budget you can control per request, and the amount of model-switching complexity you no longer have to carry.
