You've set temperature=0. You've read the docs. Greedy sampling should always pick the highest-probability token. Same prompt, same output -- every time.
Then you run it again and the answer is different.
This isn't a bug you introduced. It's been a known issue across OpenAI's API, vLLM, SGLang, and most production inference stacks for years. The documentation pages for both vLLM and SGLang openly acknowledge it. But the explanation usually given -- "GPUs have floating-point nondeterminism due to parallelism" -- is only partly true, and it obscures the actual fix.
Understanding the real cause matters, because three distinct engineering teams have now attacked the problem from different angles. Their approaches have tradeoffs worth knowing before you decide which one fits your stack.
What's Actually Happening
The common explanation goes: GPUs run operations in parallel, floating-point addition is non-associative (meaning (a + b) + c can differ from a + (b + c) at finite precision), and so execution order varies run-to-run.
That's true as far as it goes. But it doesn't explain why a plain matrix multiplication on the same data produces bitwise-identical results every time you run it on the same GPU, even though it uses floating-point and massive parallelism. That operation is deterministic. Transformer inference isn't.
The difference is batch size. When your inference server handles multiple requests at once, it batches them together for efficiency. The batch composition changes depending on what other requests are in flight when yours arrives. Different batch sizes mean different GPU kernel dispatch strategies, which change the order of floating-point reductions across tokens in the attention layers -- and that changes the result.
This is the finding from Thinking Machines Lab's research: the root cause isn't concurrency in the abstract, it's specifically that existing GPU kernels aren't batch-invariant. The same token processed in a batch of 1 gets different intermediate rounding than the same token processed in a batch of 32.
For most applications this doesn't matter. For audit logs, reproducible research workflows, regression testing, and any system where you need to diff model outputs across deploys -- it matters a lot.
Three Fixes, Three Levels
Kernel layer: Batch-invariant ops (Thinking Machines Lab + SGLang)
Thinking Machines Lab open-sourced a library of batch-invariant kernels that replace the nondeterministic reduction strategies in standard PyTorch operators. The key constraint: every kernel uses a single, universal reduction strategy regardless of batch size. You lose some throughput -- the "smart" kernel strategies that pick different execution paths based on batch composition are part of what makes modern inference fast -- but you get bitwise reproducibility.
SGLang adopted this approach and now supports fully deterministic inference while keeping chunked prefill, CUDA graphs, radix cache, and non-greedy sampling functional. That's a meaningful list of features to keep -- determinism in SGLang isn't a stripped-down mode.
The tradeoff: you're working at the kernel level, which means you need to be running your own inference server (not using hosted APIs), and you accept some throughput overhead for the reproducibility guarantee.
Infrastructure layer: Inference as a pure function (Eigen AI)
Eigen AI's framing treats the inference call as a pure function: output = f(model, architecture, prompt, seed, decode policy). If you hold all inputs constant, you get a constant output -- but the guarantee lives at the infrastructure layer, not the kernel level. Eigen tracks model weights, quantization, hardware, and scheduling as part of the function signature.
This is a different bet than batch-invariant kernels. Instead of fixing the nondeterminism at the GPU level, you're managing it by making the full compute environment reproducible and version-pinned. Practically this means tighter constraints on deployment: you can't transparently swap in a faster kernel or update your inference runtime without potentially changing outputs.
The benefit is that you get reproducibility across hardware generations if you control the full stack. The cost is operational rigidity.
Hardware layer: Remove dynamic scheduling entirely (Groq)
Groq's approach is the most radical: the hardware is designed to eliminate dynamic execution paths by construction. Where NVIDIA GPUs dispatch work dynamically (and variably) across CUDA cores based on workload, Groq's Language Processing Units use a deterministic schedule baked in at compile time. Same inputs, same hardware -- bitwise identical output, every run.
The tradeoff is flexibility. You're on Groq's hardware, running Groq-compiled models, with whatever optimization choices Groq's compiler made. For latency-sensitive applications where you want both speed and reproducibility, this is attractive. For teams that need to run custom kernels or modify model architectures frequently, it's constraining.
Which Approach for Which Team
Running vLLM or SGLang on your own GPU cluster: The batch-invariant kernel approach from Thinking Machines Lab is the most tractable option. SGLang's implementation is the most production-tested path today.
Using a hosted API (OpenAI, Anthropic, Google): You have no access to the kernel layer. Reproducibility guarantees from these providers are limited to session-level seeding when offered at all -- and that doesn't prevent batch-composition nondeterminism. If you genuinely need determinism, a hosted API isn't the right substrate.
Building audit-grade workflows where every response needs to be reconstructable: Groq's hardware approach gives you the strongest guarantee, but you're committing to their ecosystem. Eigen's approach gives you infrastructure-level reproducibility if you control the full deployment stack.
Running regression tests across model versions: This is the underappreciated use case. If you're trying to detect whether a model update changed behavior on a fixed eval set, nondeterminism at the inference layer adds false positives to your diffs. Batch-invariant kernels are the minimum viable fix.
The Benchmark Gap
Evals run against LLMs typically don't test for inference reproducibility -- they test accuracy, latency, cost, and reasoning quality. That means teams often ship to production without knowing whether their model will return the same answer twice on the same input.
For classification tasks, document extraction, or structured output generation that feeds downstream systems, this gap has real consequences. A response that varies across retries isn't just an aesthetic problem -- it's a failure mode in any system that assumes idempotent model calls.
The Thinking Machines Lab research is open-sourced on GitHub. The SGLang deterministic inference docs explain how to enable it in an existing deployment. Neither requires abandoning your current stack -- just adding a constraint to how kernels dispatch work.
The harder question is whether your architecture currently assumes determinism without testing for it. Most production deployments do.
