NVIDIA released Nemotron 3 Super on March 10 as a 120B-total, 12B-active open model built for Blackwell-class systems, with official claims of up to 2.2x higher inference throughput than GPT-OSS-120B and 7.5x higher throughput than Qwen3.5-122B on an 8K-input, 16K-output setup. The more useful number in the technical report, though, is 7%: by the end of NVFP4 pretraining, that share of the model’s parameters had zero-valued weight gradients.
That is not launch-day trivia. It is the clearest clue in the release about where Nemotron 3 Super is strong, where it is brittle, and what kind of evaluation an operator should run before swapping it into production.
The buried number in the report
Most coverage will stop at the headline metrics: 1M-token context window, LatentMoE, native speculative decoding, Blackwell optimization, open weights. All of that is real. The report backs it up with a fairly unusual design:
- 88 total layers, with only 8 attention layers
- 512 experts with top-22 routing in the latent expert path
- GQA with just 2 KV heads
- a latent expert dimension of 1024 against a 4096 hidden dimension
- two shared-weight MTP heads to improve speculative decoding stability
That recipe is the reason the model can post aggressive throughput numbers without looking like a toy release. NVIDIA is clearly trading classic transformer habits for a system built around bandwidth pressure, KV-cache limits, and long-sequence economics.
But the 7% figure matters because it shows what that efficiency push costs. According to the technical report, low-magnitude expert channels under NVFP4 quantization underflowed fast enough that zero-valued gradients appeared about three times faster than in BF16. NVIDIA says the effect was reversible if training switched back to BF16 mid-run. It chose not to do that in the final model.
That does not mean Nemotron 3 Super is broken. It means the model is highly engineered around a hardware and numeric regime, not just around benchmark prestige.
Why the speed claim is credible
On paper, Nemotron 3 Super has one of the more grounded performance stories in open-model land right now.
The architecture does three smart things at once.
First, it minimizes attention overhead. With only a handful of attention layers and heavy use of Mamba-2 blocks, it cuts the KV-cache burden that drags down long generations.
Second, LatentMoE moves routed computation into a smaller latent space, which lowers communication and memory traffic by roughly the same ratio as the hidden-to-latent compression. In the white paper, NVIDIA describes this as a way to increase expert count and top-K activity without blowing up inference cost.
Third, MTP is doing real work here. The Super report says the model reaches an average acceptance length of 3.45 tokens per verification step on SPEED-Bench with draft length 7. That is the kind of detail generic model announcements usually hide, and it matters because speculative decoding only helps if the draft predictions are accepted often enough to offset coordination overhead.
In plain English: Nemotron 3 Super looks built for teams that care about tokens-per-second under ugly real workloads, not for leaderboard screenshots alone.
Where the model is weaker than the announcement suggests
The press story is “fast, open, long-context, competitive.” The fuller read is narrower.
Nemotron 3 Super looks strongest for high-volume internal agent workloads with long documents, retrieval-heavy context, and repeated structured actions. NVIDIA explicitly calls out IT ticket automation, and that seems directionally right.
The weaker spots are just as important:
- On SWE-Bench OpenHands, the report shows 60.47, behind Qwen3.5-122B at 66.40.
- On TauBench V2 Telecom, it posts 64.36 versus Qwen3.5-122B at 95.00.
- On Arena-Hard-V2, it trails GPT-OSS-120B by a wide margin.
That combination tells me Nemotron 3 Super is not the obvious default for every coding agent or every customer-facing assistant. It looks more like a systems model than a universal model: excellent when the workload matches the hardware thesis, less impressive when judged as a broad chat-and-agency brain.
That distinction matters for an ops lead at a 20-to-50-person software firm. If your near-term project is internal support deflection, runbook lookup, or ticket triage over a large corpus, this release deserves a test. If your near-term project is code repair, telecom-style workflow navigation, or polished user-facing conversation, the report itself gives you reasons to keep alternatives in the bake-off.
The test worth running this week
A practical evaluation for that ops lead is not “which model wins the most benchmarks.” It is whether Nemotron 3 Super lets you reduce hardware pressure without blowing up failure review.
A clean test stack would be:
- TRT-LLM or vLLM for serving
- LlamaIndex or Haystack for retrieval over your support docs and ticket history
- a narrow tool layer for Jira, Zendesk, or Linear actions
- a logged replay set of 200 to 500 real internal support or ops requests
Measure four things:
- output tokens per second per GPU
- action accuracy on tool calls
- citation quality against your knowledge base
- verbosity drift across long sessions
That last one is not academic. The report notes that naive NVFP4 handling of the Mamba SSM cache could increase verbosity by as much as 40% unless casting behavior was fixed with stochastic rounding. If you are evaluating this model on Blackwell hardware, verbosity drift under long runs should be a first-class acceptance criterion, not an afterthought.
A realistic payoff estimate: if your current internal ticket assistant needs four GPUs to hold p95 response time under peak load, a true 2.2x throughput gain would imply something closer to two GPUs for the same traffic envelope, or the same four GPUs handling roughly double the concurrent queue. That estimate will move around with sequence length, batching, and serving stack, but the operating decision is clear: this model is potentially a capacity play before it is a quality play.
Nemotron 3 Super is worth attention because NVIDIA exposed more of the engineering tradeoffs than most vendors do. The launch story is speed. The real story is numeric discipline: if your workload matches the long-context, high-throughput thesis, test it hard; if not, do not mistake a Blackwell-optimized systems model for an all-purpose winner.
