Brewing...

Industry Insights

Nvidia's 35x inference number lost its denominator on the way to the headline

Nvidia's Groq 3 LPX claims 35x inference throughput, but the unit is per megawatt, not absolute. The real story is 128GB of on-chip SRAM replacing HBM entirely — a supply chain end-run hiding inside a performance slide.

Sean McLellan

Lead Architect & Founder

March 16, 20264 min read

Nvidia announced the Groq 3 LPX rack at GTC 2026. The headline number is 35x inference throughput over Blackwell. That number showed up in keynote slides, live tweets, and analyst summaries within minutes.

The unit is per megawatt.

Not 35x faster. Not 35x more tokens per second per chip. 35x more inference throughput per megawatt. The press release buries this in a single compound phrase: "up to 35x higher inference throughput per megawatt and up to 10x more revenue opportunity for trillion-parameter models." Two claims, one sentence, and the audience remembers the bigger number stripped of its denominator.

128GB of SRAM and zero HBM

The LPX rack ships with 256 LPU processors containing 128GB of on-chip SRAM and 640 TB/s of scale-up bandwidth. No HBM. At all.

That is not a minor packaging decision. HBM is the single tightest bottleneck in GPU supply chains. SK Hynix, Samsung, and Micron collectively cannot produce enough HBM4 to satisfy demand from Nvidia, AMD, and every other accelerator vendor through 2027. Allocation fights have been public. Pricing has been volatile. Every NVL72 rack contains massive amounts of HBM attached to its 72 Rubin GPUs — and sourcing that memory constrains how many racks Nvidia can actually ship.

The Groq LPX rack sidesteps the entire problem. SRAM is fabricated on the same die as the processor. No separate memory packaging, no CoWoS interposer dependency, no allocation negotiation with memory vendors. Nvidia can scale LPX production on a completely different supply chain track than its GPU racks.

How the architecture actually pairs

The LPX rack does not replace Vera Rubin NVL72. It extends it. In Nvidia's architecture, Rubin GPUs handle prefill — the compute-heavy phase where the model processes the input prompt. LPUs handle decode — the sequential, latency-sensitive phase where the model generates tokens one at a time.

This split matters because prefill and decode have fundamentally different hardware profiles. Prefill is parallel and memory-bandwidth-hungry. Decode is serial and latency-bound. GPUs are overbuilt for decode. LPUs, with deterministic execution and massive on-chip bandwidth, are purpose-built for it.

The connection between them runs over Direct C2C, chip-to-chip interconnect that avoids network hops. Rubin GPUs and LPUs jointly compute every layer of the model for every output token. The practical effect is that a Vera Rubin + LPX deployment can serve trillion-parameter models at interactive latency without the GPU utilization waste that decode normally causes.

The supply chain geometry nobody is discussing

Ryan Shrout, who covers silicon for a living, flagged the right question in his GTC notes: "Lots of questions still on cost, capacity tradeoffs, and real-world deployment." Max Weinbach connected a different dot — the Groq partnership "gets away from CoWoS/TSMC and HBM bottlenecks."

Both are circling the same point. Nvidia just built a rack-scale inference product that does not compete with its own GPU products for the two scarcest resources in AI hardware: HBM capacity and advanced packaging slots at TSMC.

For an IT buyer at a 30-person AI startup or a mid-size cloud provider building out inference capacity, this changes the procurement calendar. GPU racks have lead times driven by memory supply. LPX racks have lead times driven by wafer supply, which is a different queue with different constraints. If Nvidia can deliver LPX racks on a faster timeline than additional NVL72 racks, inference-heavy customers get capacity sooner — at the cost of needing both rack types instead of one.

The capital planning question becomes: do you buy one NVL72 rack and use GPUs for both prefill and decode (simpler, more expensive per token), or do you buy a smaller GPU allocation plus LPX racks (more complex, potentially cheaper per token at scale, and potentially available sooner)?

The "10x revenue opportunity" claim deserves the same scrutiny

The second number in Nvidia's press release — "up to 10x more revenue opportunity for trillion-parameter models" — is not a performance metric. It is a business projection. Nvidia is arguing that LPX makes trillion-parameter inference economically viable where it previously was not, and that the resulting market expansion represents 10x more addressable revenue.

That framing benefits Nvidia regardless of whether individual customers save money. If LPX enables inference workloads that were previously too expensive to serve, the total spend on Nvidia hardware goes up even if the per-token cost goes down. The customer gets cheaper tokens. Nvidia gets a larger market. Both can be true simultaneously.

Skip the 35x, price the SRAM

The useful number from this announcement is not 35x. It is 128GB of SRAM across 256 processors in a single rack, connected at 640 TB/s, with no HBM dependency. That set of specifications determines what models fit, what latency is achievable, and what the supply chain looks like.

A managed services provider planning inference capacity for 2027 should be asking their Nvidia rep one question: what is the lead time delta between an NVL72 rack and an LPX rack? If the answer is measured in months, the LPX's value is not just performance per megawatt — it is time to revenue.

Share this post

Share on X Share on LinkedIn Share on Bluesky

Nvidia's 35x inference number lost its denominator on the way to the headline

Sean McLellan

Lead Architect & Founder

March 16, 20264 min read

The unit is per megawatt.

Share this post

Share on X Share on LinkedIn Share on Bluesky

Nvidia's 35x inference number lost its denominator on the way to the headline

128GB of SRAM and zero HBM

How the architecture actually pairs

The supply chain geometry nobody is discussing

The "10x revenue opportunity" claim deserves the same scrutiny

Skip the 35x, price the SRAM

Share this post

Related Posts

Nvidia GTC 2026: The $1 Trillion Demand Signal

17 enterprise vendors signed on to one stack. Then a ToS clause told you who owns what goes wrong.

NVIDIA Dynamo 1.0 turns inference into an operating-system problem — and every major cloud provider just signed up.

Keep Reading

Nvidia GTC 2026: The $1 Trillion Demand Signal

17 enterprise vendors signed on to one stack. Then a ToS clause told you who owns what goes wrong.

NVIDIA Dynamo 1.0 turns inference into an operating-system problem — and every major cloud provider just signed up.

Nvidia's 35x inference number lost its denominator on the way to the headline

128GB of SRAM and zero HBM

How the architecture actually pairs

The supply chain geometry nobody is discussing

The "10x revenue opportunity" claim deserves the same scrutiny

Skip the 35x, price the SRAM

Share this post

Related Posts

Nvidia GTC 2026: The $1 Trillion Demand Signal

17 enterprise vendors signed on to one stack. Then a ToS clause told you who owns what goes wrong.

NVIDIA Dynamo 1.0 turns inference into an operating-system problem — and every major cloud provider just signed up.

Keep Reading

Nvidia GTC 2026: The $1 Trillion Demand Signal

17 enterprise vendors signed on to one stack. Then a ToS clause told you who owns what goes wrong.

NVIDIA Dynamo 1.0 turns inference into an operating-system problem — and every major cloud provider just signed up.