I audited the prior weekly roundups before writing this. Only two matching files exist in the repo so far, and both already used different shapes: one capability-lane board and one migration-risk map. This week, the useful frame is narrower: which buried detail in each release actually changes deployment behavior?
If you missed the prior two, start with 11 Model Releases That Changed Deployment Plans This Week and When to Switch Models This Week vs Stay Put: The Migration Risk Map for March 6. For the broader production argument, pair this with Why Better Benchmarks Keep Producing Worse Production Outcomes.
Buried detail board: the releases that actually moved something
1) GPT-5.4 (OpenAI) — adopt if your bottleneck is software interaction, not pure coding
- Release window: March 7
- Capability: One endpoint for reasoning, coding, and native computer use.
- Eval delta: OSWorld-Verified jumped from 47.3% on GPT-5.2 to 75.0% on GPT-5.4, while SWE-Bench Pro only moved from 56.8% to 57.7%.
- Pricing/licensing: Closed model under OpenAI API terms.
- Availability: Publicly released and positioned as the new general-purpose flagship.
- Migration friction: Medium. API integration is easy; prompt and workflow assumptions are not.
This was the biggest change in operating assumptions this week.
Most coverage treated GPT-5.4 like a standard intelligence upgrade. That misses the real shift. The useful number is not the 0.9-point coding gain. It is the 27.7-point jump in computer use. That means the handoff layer between “the model drafted it” and “a human still has to click through the software” just got materially thinner.
If you run browser workflows, back-office form entry, spreadsheet updates, or software QA loops, read GPT-5.4 Barely Moved the Coding Needle. The Computer-Use Score Is a Different Story.. If you only use models for code generation, the migration is much less urgent.
2) Gemini Embedding 2 (Google) — adopt when retrieval complexity is costing you more than model cost
- Release window: March 10
- Capability: First natively multimodal embedding model in the Gemini line.
- Eval delta: Google positions it as state of the art across text, image, video, and speech retrieval tasks, though the announcement emphasized modality coverage more than a single leaderboard number.
- Pricing/licensing: Closed API model via Gemini API and Vertex AI.
- Availability: Public preview now.
- Migration friction: Medium-low if you already use Google tooling; medium-high if your current stack depends on separate modality pipelines.
The undercovered fact from Google's release notes is not just “multimodal.” It is the hard limits: 8192 text tokens, up to 6 images per request, 120 seconds of video, PDFs up to 6 pages, and output dimensions that can scale down from 3072 to 1536 or 768. That last detail matters more than the headline.
Why? Because moving from 3072 dimensions to 768 cuts vector width by 75%. For a large retrieval index, that changes RAM pressure, index size, and latency before you touch your model bill. Read Do You Still Need a Separate RAG Stack If Gemini Embedding 2 Maps Audio, Video, and Docs Together? if your retrieval pipeline has turned into a pile of transcription glue, OCR glue, and image-search glue.
Migration warning: if your business logic is wrapped around provider-neutral embeddings, this is not a drop-in switch. You will want a side-by-side relevance test over 2,000 to 10,000 real assets before reindexing production content.
3) Granite 4.0 1B Speech (IBM) — adopt when multilingual voice is useful and hardware is scarce
- Release window: March 9
- Capability: Compact multilingual speech model for ASR and bidirectional speech translation.
- Eval delta: IBM says it delivers higher English transcription accuracy than the prior 2B predecessor while cutting parameter count in half and taking the #1 spot on the OpenASR leaderboard.
- Pricing/licensing: Apache 2.0.
- Availability: Released with native support in Transformers and vLLM.
- Migration friction: Low to medium. Easier than most speech-model swaps if you already run open tooling.
This is the most undercovered practical release of the week. Granite 4.0 1B Speech added Japanese ASR, keyword-list biasing for names and acronyms, and support across 6 languages: English, French, German, Spanish, Portuguese, and Japanese. If you handle support calls, field transcripts, or multilingual voice notes, those are not cosmetic upgrades.
The migration case is straightforward: smaller model, open license, normal open-model serving paths, and a clear enterprise use case. The only real friction is evaluation of your own audio quality and jargon. A speech model that wins general benchmarks and still butchers product names is not ready.
4) Nemotron 3 Super (NVIDIA) — adopt only if throughput is the real constraint
- Release window: March 10
- Capability: Open long-context systems model aimed at Blackwell-class inference.
- Eval delta: NVIDIA claims up to 2.2x higher throughput than GPT-OSS-120B and 7.5x higher throughput than Qwen3.5-122B on an 8K input / 16K output setup. On quality, it is mixed: 60.47 on SWE-Bench OpenHands versus Qwen3.5-122B at 66.40, and 64.36 on TauBench V2 Telecom versus 95.00 for Qwen3.5-122B.
- Pricing/licensing: Open model, but economically optimized around NVIDIA's own hardware stack.
- Availability: Released now.
- Migration friction: High. Hardware assumptions are the whole story.
The buried fact in the technical report is brutal and useful: during NVFP4 pretraining, about 7% of parameters reached zero-valued weight gradients. NVIDIA says the effect was reversible if training switched back to BF16, but it chose not to switch for the shipped model. That tells you exactly how this model should be bought: as a capacity play, not as a universal assistant.
If your workload is long-context ticket triage, internal support routing, or retrieval-heavy agent traffic, read 7% of Nemotron 3 Super’s Parameters Went Silent During Training. That’s the Detail to Test First.. If your workload is customer-facing dialogue or code repair, the model card itself gives you reasons to be skeptical.
5) BitNet b1.58 2B4T (Microsoft) — adopt when privacy and fixed cost matter more than frontier quality
- Release window: March 11
- Capability: Native 1.58-bit open model designed for CPU inference.
- Eval delta: The headline here is not one benchmark; it is the runtime claim. Microsoft reports 2.37x to 6.17x speedups on x86 CPUs and 1.37x to 5.07x on ARM CPUs with the optimized
bitnet.cpppath. - Pricing/licensing: Open-source release.
- Availability: Public now, but real performance depends on the dedicated runtime and GGUF path.
- Migration friction: Medium. Easy conceptually, annoying operationally if you expect standard Transformers behavior.
This release matters because it shifts the floor under private AI. A 2B model trained on 4 trillion tokens that runs on CPUs changes the economics for internal summarization, policy search, and private document work. The buried caveat is equally important: you do not get the headline efficiency gains unless you use the dedicated bitnet.cpp runtime instead of a default Transformers stack.
Read Microsoft BitNet b1.58 Makes Private AI Cheap Enough for Any Small Business if your current objection to AI is recurring token cost or data leaving your environment.
Ignore this for now
Ignore Granite 4.0 1B Speech for now if you do not process audio in production. It is a good release, but good speech infrastructure that solves no actual workflow is still dead weight. Text, retrieval, and software-operation models changed more operating math this week than voice did for the median team.
The migration order I would actually use
- GPT-5.4 first if you already automate work inside software interfaces.
- Gemini Embedding 2 second if your retrieval stack is bloated by modality-specific preprocessing.
- BitNet third if privacy, budget predictability, or offline operation are blocking deployment.
- Granite 4.0 1B Speech fourth if multilingual audio is a real revenue or operations surface.
- Nemotron 3 Super last unless you already own the hardware profile it was designed for.
That ordering is why “best model” is still the wrong question. The better question is which release deletes the most infrastructure, labor, or review burden in your current system.
My decisive verdict: GPT-5.4 changed operating assumptions most because it turned software interaction into something you can plan around rather than merely demo. Gemini Embedding 2 was the cleanest infrastructure simplifier. BitNet was the most important pricing signal. Nemotron 3 Super was the easiest release to overread. Granite 4.0 1B Speech was the quietest good launch. If you make one mistake this week, it will be treating all five as peers. They are not peers. Three change deployment choices now; two mostly change what belongs on your test bench.
