Alibaba's Qwen team dropped four new dense models today -- Qwen3.5-0.8B, 2B, 4B, and 9B -- and the headline benchmark claim deserves a close read: the 9B outperforms "large Qwen3 and previous closed models" across math, reasoning, and long video understanding tasks, while running in roughly 6 GB of VRAM at NVFP4 quantization.
That's a single RTX 5090 using barely a sixth of its available memory. Or, put differently, hardware that was already sitting in a mid-range workstation.
Before you rebuild your inference stack, here's what the numbers actually show -- and where the caveats live.
The Four Models Released Today
Qwen 3.5 ships as four dense models (not mixture-of-experts). Dense matters here: no routing overhead, more predictable latency, simpler deployment on commodity hardware.
| Model | Parameters | Estimated VRAM (NVFP4) | Primary Use Case | |---|---|---|---| | Qwen3.5-0.8B | 0.8B | ~1 GB | Edge, embedded, real-time classification | | Qwen3.5-2B | 2B | ~2.5 GB | Mobile inference, lightweight agents | | Qwen3.5-4B | 4B | ~4 GB | Mid-tier tasks, document processing | | Qwen3.5-9B | 9B | ~6 GB | General reasoning, code, long context |
All four models support a native 262k context window, extensible to 1M tokens. That last detail is not a footnote -- long context at this parameter count was not available at this performance tier six months ago.
What the 9B Benchmark Claims Show
Alibaba's benchmarks put the 9B above several models that were, until recently, considered top-tier: GPT-4o-mini class inference targets, and some of the earlier Qwen3 variants that run at considerably higher parameter counts.
The official Qwen 3.5 model cards show improvements on MATH-500, AIME, and a suite of multimodal benchmarks including long video understanding -- a genuinely hard task that typically required much larger models.
Independent early testing echoes this, with at least one hardware analysis noting the NVFP4 9B variant beating "120B GPT-OSS class models" on specific reasoning tasks while occupying a fraction of the GPU budget.
Two things worth noting before you take that claim at face value:
- "Beating 120B-class models" depends entirely on which benchmark and which 120B model. Qwen3.5-9B likely wins on tasks that play to its training strengths; it does not win universally.
- NVFP4 quantization introduces quality tradeoffs on nuanced language tasks compared to full BF16. If you need precise document generation or contract analysis, test at the task level, not the benchmark level.
Running the VRAM Math
The practical reason this release matters: it changes the cost floor for private inference.
A workstation with a single RTX 5090 (96 GB VRAM, ~$2,000 at current street pricing) can run the Qwen3.5-9B at NVFP4 in under 6 GB -- which means you can also run two or three instances in parallel on the same GPU. Or you can run the 9B alongside other models for a routing setup without dedicated hardware per model.
Compare that to the equivalent capability from an API provider:
- OpenAI GPT-4o-mini: ~$0.15 per 1M input tokens
- At 10M tokens/day (reasonable for a multi-agent workflow), that's $1,500/month
- A mid-range inference server with the 9B running locally pays back inside 60 days at that volume
The break-even math shifts dramatically once you factor in data residency requirements or workflows where latency matters more than raw quality. This model hits a sweet spot for structured extraction, classification, routing agents, and retrieval-augmented generation where a GPT-4 class model is overkill.
Where the Floor Still Sits
This is not an argument to rip out your API calls. Three places where local Qwen 3.5 underdelivers:
Instruction following on novel tasks. GPT-4o and Claude Sonnet still outperform small open-source models when the prompt requires nuanced instruction adherence across a long chain of reasoning. For creative work, edge case handling, or anything where errors are expensive, API models remain the safer default.
Tooling maturity. The ecosystem around Qwen 3.5 is a day old. Quantized GGUF variants, Ollama support, and LM Studio integration will arrive, but if you need production-ready deployment today, you're early.
Multimodal tasks. The Qwen team notes video and image understanding improvements, but independent validation on business-relevant multimodal tasks (document parsing, diagram analysis) is sparse. Treat it as promising, not proven.
The Qwen 3.5 release is the clearest signal yet that the small model tier is no longer a compromise. Teams that dismissed on-prem inference because the quality gap was too wide now have a narrower case for that position -- especially at 262k context where cloud API costs compound fast. Run the numbers for your specific workload before assuming the API is still the obvious choice.
