Why temperature-zero LLM inference produces different outputs every run, why the standard explanation is wrong, and how batch-invariant GPU kernels fix it — at the level of floating-point bits, CUDA tiles, and FlashDecoding splits.
Every concept in this lesson and how they connect — the territory before the map.
This lesson unpacks a single insight: LLM nondeterminism at temperature zero comes from batch size variance, not atomic operations. That one sentence requires understanding 16 interconnected concepts across four clusters. The constellation below shows how they relate.
Every concept you’ll encounter, sorted by cluster.
Why (a+b)+c != a+(b+c) in floating-point arithmetic. Ch 02.
Different batch sizes produce different reduction orders. Ch 03.
The sequence in which partial sums are accumulated on GPU. Ch 03.
Serving systems group requests into variable-size batches. Ch 01.
Normalization layer — trivially data-parallel. Ch 04.
General matrix multiply — tiled with split-K. Ch 04.
Split KV across SMs for parallel decode. Ch 05.
Splitting the K dimension across multiple thread blocks. Ch 04.
Matrix-vector multiply — batch=1 special path. Ch 04.
One batch element per core, fixed reduction. Ch 04.
Always split at fixed token boundaries in attention. Ch 05.
Assign one batch element per SM to avoid cross-talk. Ch 04.
Each core reduces its own elements independently. Ch 04.
KL divergence exactly 0, no importance weighting. Ch 06.
Reproducible outputs enable bisection debugging. Ch 06.
~1.6x slowdown for full determinism. Ch 06.
This lesson is structured in layers. You can read it linearly or skip around.
If you already understand FP non-associativity, skip to Chapter 03. If you only care about the solution, jump to Chapter 04.
Temperature zero. Same seed. Same GPU. Same prompt. Run it 1000 times. How many unique outputs?
You would expect the answer to be one. Temperature zero means greedy decoding — always pick the most probable token. There is no randomness in the sampling step. The model weights are fixed. The input is fixed. This is a deterministic function.
And yet: 80 unique completions from 1000 runs of Qwen 3-235B at temperature zero.
This is not a rare edge case. It happens on every modern serving system. vLLM, TensorRT-LLM, SGLang — they all exhibit it. The nondeterminism is not in the sampling. It is in the forward pass itself.
The histogram above simulates what happens when you run the same prompt through an LLM serving system 1000 times. Each bar represents a distinct output. The heights show how often each output appeared. Notice: no single output dominates. The distribution is surprisingly flat — the model is essentially rolling dice at every step.
Ask any ML engineer why LLM inference is nondeterministic, and you’ll hear: “atomic adds.” The story goes like this:
This explanation is mechanically correct about FP non-associativity but wrong about where it applies. Here is the critical fact:
Atomic adds appear in backward passes (gradient accumulation) and in some older pointwise kernels. But the inference path — the path that determines what token comes next — does not use them. If atomics were the cause, you could fix it with CUBLAS_WORKSPACE_CONFIG=:4096:8 and torch.use_deterministic_algorithms(True). Those flags do nothing for inference nondeterminism.
The nondeterminism comes from batch size variance. In a serving system:
From your perspective as a user, other concurrent users are nondeterministic noise injected into your computation. Their presence changes the batch size, which changes the tiling, which changes the reduction order, which changes your logits, which changes your output.
This is the thesis we will unpack over the next four chapters: exactly how batch size changes reduction order in each kernel, and exactly how to fix it.
Real numbers are infinite. Floating-point numbers are finite. Something has to give.
In real arithmetic, addition is associative: $(a + b) + c = a + (b + c)$. Always. No exceptions. This is so fundamental that we rarely think about it.
In floating-point arithmetic, this property does not hold. Here is why:
A 32-bit float (FP32) stores a number as: $(-1)^s \times 1.m \times 2^e$ where $s$ is the sign bit, $m$ is a 23-bit mantissa, and $e$ is an 8-bit exponent. The mantissa gives you about 7 decimal digits of precision. When you add two numbers of very different magnitudes, the smaller number must be aligned — shifted right until the exponents match. Bits that shift off the end of the 23-bit mantissa are lost forever.
Now consider the triple $(10^{-8}, 10^{-8}, 1.0)$. Added left-to-right: $(10^{-8} + 10^{-8}) + 1.0 = 2 \times 10^{-8} + 1.0 = 1.0$ (the small part is still lost). Added right-to-left: $10^{-8} + (10^{-8} + 1.0) = 10^{-8} + 1.0 = 1.0$ (same). But consider $(-1.0, 1.0, 10^{-8})$: left-to-right gives $0 + 10^{-8} = 10^{-8}$, while $(1.0 + 10^{-8}) - 1.0 = 1.0 - 1.0 = 0$ in FP32. Different answers depending on order.
Let’s look at what happens at the bit level. When adding $a + b$ where $|a| \gg |b|$:
The mantissa of the smaller number is shifted right until both exponents match. For FP32 with 23 mantissa bits, if the exponents differ by more than 23, all bits of the smaller number shift into oblivion.
This is where precision loss occursThe aligned mantissas are added in a fixed-width adder. Any carry propagates normally.
No loss here — exact addition of the aligned valuesThe result is normalized (shift until leading 1) and rounded to 23 bits. This rounding introduces an additional error of at most 0.5 ULP.
Second source of error — compounds with alignment lossThe key insight: the error depends on what you are adding to what. If you sum a million small numbers first and then add a big number, you lose less than if you interleave big and small. The order of summation determines the final answer.
Take 8 numbers. Shuffle them. Sum left-to-right. How many distinct answers can you get?
Consider the array [1e-10, 1e-5, 1e-2, 1, -1e-10, -1e-5, -1e-2, -1]. The true sum is exactly zero. But if you shuffle the array and sum left-to-right in FP32, you get different answers depending on the permutation. Out of all $8! = 40320$ possible orderings, the blog post reports 102 distinct sums.
The interactive visualization below lets you explore this. Shuffle the array and watch how different accumulation orders produce different final sums — all from the same set of numbers.
Each time you click “Shuffle & Sum,” the visualization shows:
“Run 1000 Shuffles” reveals how many distinct results are possible. This is exactly the mechanism behind GPU nondeterminism — when the accumulation order changes, the answer changes.
In a real LLM, we are not summing 8 numbers. We are computing dot products of dimension 4096 or higher. Each dot product is a sum of 4096 products. The error compounds.
And that analysis is for a single dot product. A single forward pass of a large LLM involves millions of dot products chained together. Errors accumulate across layers. By the time you reach the output logits, a tiny change in accumulation order early in the network can shift probabilities enough to change the argmax — picking a completely different token.
One different token early in generation cascades: the KV cache now contains a different prefix, every subsequent token is conditioned on different context, and the entire completion diverges.
The forward pass of a transformer is three operations: matmul, attention, normalization. None use atomic adds.
MatMul (GEMM) decomposes the output matrix into tiles. Each tile is computed by one thread block using a structured reduction — load tile of A, load tile of B, accumulate partial products in registers, write the final result. No atomics.
Attention (FlashAttention, FlashDecoding) processes chunks of the KV cache in registers, accumulating softmax numerators and denominators in a structured order. No atomics.
RMSNorm computes a mean-square and normalizes. The reduction across the hidden dimension uses warp shuffles and shared memory — not atomics.
So where does the order change? It changes because the tiling strategy depends on the batch size.
When cuBLAS computes $C = A \times B$ where $A$ is $(M, K)$ and $B$ is $(K, N)$, it must decide how to decompose the work. The $(M, N)$ output is divided into tiles, and each tile is assigned to a streaming multiprocessor (SM).
For a given tile size (say 128x128), the number of tiles is $\lceil M/128 \rceil \times \lceil N/128 \rceil$. When $M$ (the batch dimension) changes, the number of tiles changes, and cuBLAS may choose a completely different kernel configuration:
| Batch Size | Kernel | K-Splits | Tile Size |
|---|---|---|---|
| 1 | GEMV (special path) | 1 | Full row |
| 2–16 | Skinny GEMM | 1–4 | 16×128 |
| 17–512 | Standard GEMM | 1–8 | 64×128 |
| 513+ | Large GEMM + Split-K | 4–16 | 128×128 |
The critical column is K-Splits. When the K dimension is split across multiple thread blocks, each block computes a partial sum. These partial sums are then reduced (summed) to produce the final output tile. The order of this final reduction depends on which blocks finish first — and even if the order is deterministic for a given configuration, a different number of splits means a different summation tree.
Drag the slider and watch how the tiling pattern changes. At batch=1, there is one row of tiles with no K-splits (GEMV). At batch=4, cuBLAS might use 2 K-splits. At batch=32, it might use 4. Each configuration sums the K dimension in a different order, producing a different floating-point result for the same mathematical operation.
The blog post includes a devastating one-liner that proves this effect:
import torch torch.set_default_device('cuda') B, D = 2048, 4096 a = torch.linspace(-1000, 1000, B*D).reshape(B, D) b = torch.linspace(-1000, 1000, D*D).reshape(D, D) out1 = torch.mm(a[:1], b) # batch=1: uses GEMV kernel out2 = torch.mm(a, b)[:1] # batch=2048: uses GEMM kernel print((out1 - out2).abs().max()) # tensor(1669.2500)
Let that sink in. The same mathematical operation — multiplying the same row of $A$ by the same matrix $B$ — produces answers that differ by 1669.25. Not 1669.25 ULPs. Not a relative error of $10^{-4}$. An absolute difference of 1669.25 in the output.
Why so large? The input values range from -1000 to 1000. The dot product dimension is 4096. With values spanning 6 orders of magnitude, catastrophic cancellation is rampant. The GEMV kernel (batch=1) accumulates left-to-right. The GEMM kernel (batch=2048) accumulates in a tiled reduction tree. Same inputs, different answers.
Now connect this back to LLM serving. In vLLM (or any continuous batching system):
From your perspective: same prompt, same model, same GPU, different output. The “nondeterminism” is deterministic given the batch composition — but since you cannot control who else is using the system, it appears random.
This explains why running the same prompt 1000 times on a loaded server produces 80 unique completions. Each run experiences a different sequence of batch sizes across its generation steps, producing a different trajectory through token space.
One batch element per core. Fixed reduction order. Same answer regardless of who else is in the batch.
The fix is conceptually simple: ensure that the computation for row $i$ follows the exact same accumulation order regardless of the total batch size. This means:
The engineering challenge: modern GPU kernels are highly optimized precisely because they adapt their tiling to the problem shape. Constraining the tiling sacrifices performance. The question is: how much?
RMSNorm is the easiest case. It computes, for each row $x$ of dimension $D$:
The reduction is over the hidden dimension (columns), and each row is independent. This is trivially data-parallel: assign one warp (or one thread block) per row. The reduction order within a row is determined only by the warp shuffle pattern, which is fixed by the kernel code.
As long as you don’t use a “split-reduction” variant that splits long rows across multiple blocks (unnecessary for typical hidden dims like 4096–8192), RMSNorm is already batch-invariant. Each row gets the same warp, does the same reduction, produces the same answer.
MatMul is harder. For $C = A \times B$ where $A$ is $(M, K)$ and $B$ is $(K, N)$:
Each output element $C_{ij} = \sum_{k=0}^{K-1} A_{ik} \cdot B_{kj}$ is a dot product of dimension $K$. In a tiled GEMM, this dot product is accumulated in chunks: load a tile of $A$ (size $M_{\text{tile}} \times K_{\text{tile}}$), load a tile of $B$ (size $K_{\text{tile}} \times N_{\text{tile}}$), accumulate the partial product in registers. After $\lceil K / K_{\text{tile}} \rceil$ iterations, the accumulator holds the final result.
For batch-invariance, we need:
The solution: use a fixed $M_{\text{tile}} = 1$ (or at least ensure each row gets its own tile row). This means row $i$ is always processed by the same set of thread blocks in the same order, regardless of how many other rows exist.
The left panel shows standard split-K: the K dimension is split across two thread blocks, each computing a partial sum. These partial sums are then reduced. The right panel shows the batch-invariant approach: one thread block handles the entire K dimension for each row, accumulating in a single pass with a fixed loop order.
Split-K is cuBLAS’s strategy for keeping all SMs busy when the M and N dimensions are small but K is large. Instead of having some SMs idle, it splits K across multiple blocks and reduces afterward.
The problem: the number of splits depends on the problem shape. For batch=1, cuBLAS might use 1 split (no reduction needed). For batch=512, it might use 4 splits (tree reduction). The tree reduction sums partial products in a different order than the sequential accumulation.
# With split-K = 4, each block computes partial[k] # Standard reduction (order depends on block scheduling): result = partial[0] + partial[1] + partial[2] + partial[3] # With split-K = 1 (batch-invariant): # Single block accumulates sequentially acc = 0 for k in range(K): acc += A[i, k] * B[k, j] # always same order
The batch-invariant matmul simply never uses split-K. One thread block processes the entire K dimension for each output tile. This is slower for large batches (because SMs are underutilized) but guarantees identical accumulation order.
mma.sync instruction computes a small tile (e.g., 16x8x16) and the accumulation order within that tile is fixed by the hardware. But the order of tiles within a block is determined by the compiler. The batch-invariant kernel must ensure this inner-loop order is also fixed — which it is, as long as the kernel code is the same.
What do we sacrifice by avoiding split-K?
| Operation | cuBLAS (Optimized) | Batch-Invariant | Slowdown |
|---|---|---|---|
| GEMM (large batch) | 100% | ~80% | ~1.2x |
| GEMM (small batch) | 100% | ~95% | ~1.05x |
| RMSNorm | 100% | ~100% | ~1.0x |
| Attention | 100% | ~60% | ~1.6x |
The matmul overhead is modest (~20% worst case) because modern GPUs have enough SMs that even without split-K, utilization stays high for typical LLM shapes. The real cost is in attention, which we’ll tackle next chapter.
During decode, the query is a single token but the KV cache can be millions of tokens. FlashDecoding parallelizes across the KV sequence.
Standard FlashAttention processes the KV cache sequentially in chunks within a single thread block. This is fine during prefill (where the query sequence is long, giving parallelism over Q). But during decode, the query is a single token — there is no Q-dimension parallelism.
FlashDecoding solves this by splitting the KV cache across multiple SMs. Each SM processes a chunk of the KV cache, computing a partial softmax-weighted sum. These partial results are then reduced:
The reduction is mathematically associative in exact arithmetic (the log-sum-exp trick preserves this). But in floating-point, the order of the final sum over splits produces different results.
The standard FlashDecoding approach uses a fixed number of splits (e.g., always split into 128 chunks). The problem: the KV cache grows during generation. At step 100, the cache has 100 tokens. At step 10000, it has 10000 tokens. With a fixed number of splits (say 128):
The split boundaries change as the cache grows. Token 50 might be in split 64 at step 100, but in split 1 at step 10000. This means the accumulation order for the same set of KV pairs changes as generation proceeds. And crucially: during prefill vs decode, the splitting is completely different, so the first decode step produces a different result than continuing decode from a cached prefill.
The solution is elegant: instead of a fixed number of splits, use a fixed split size. Always split at token boundaries that are multiples of a constant (e.g., every 4096 tokens).
| Approach | Splits at step 100 | Splits at step 10000 | Deterministic? |
|---|---|---|---|
| Fixed # splits (128) | 128 splits of ~1 token | 128 splits of ~78 tokens | No |
| Fixed split-size (4096) | 1 split of 100 tokens | 3 splits: [4096, 4096, 1808] | Yes |
With fixed split-size:
The “left-aligned” insight: by anchoring split boundaries to absolute token positions (not relative to the current cache size), you guarantee that the computation for any prefix is a sub-computation of any longer sequence. The partial sums for the first $N$ splits are identical regardless of what comes after.
SPLIT_SIZE = 4096 num_splits = ceil(seq_len / SPLIT_SIZE) # Each split processes a fixed range of KV positions for s in range(num_splits): start = s * SPLIT_SIZE end = min((s + 1) * SPLIT_SIZE, seq_len) # FlashAttention-style inner loop over [start, end) partial_out[s], partial_lse[s] = flash_attn_chunk(Q, K[start:end], V[start:end]) # Fixed-order reduction (always left-to-right over splits) output = reduce_partials(partial_out, partial_lse) # deterministic order
The key guarantee: given the same Q, K, V content and the same split boundaries, the output is bit-identical regardless of batch size, cache strategy, or other concurrent requests.
Visualize how FlashDecoding splits the KV cache and why fixed split-size guarantees determinism.
Toggle between the two strategies and drag the sequence length slider. Notice:
This is why fixed split-size is the correct approach. It turns an $O(S)$ nondeterminism problem (where every split is affected) into an $O(1)$ stability guarantee (only the boundary split changes).
The cost of fixed split-size: fewer splits means less parallelism during short-sequence decode. With fixed #splits=128, you always have 128 SMs working. With fixed split-size=4096, a 1000-token cache only has 1 split — only 1 SM is doing the attention work.
The mitigation: use a smaller split-size (e.g., 256 or 512) for better parallelism, at the cost of more reduction overhead. The Thinking Machines implementation uses a tuned split-size that balances parallelism and determinism, achieving ~1.6x overhead vs fully-optimized vLLM attention at large batch sizes.
1000 runs. Same prompt. Same model. Same result. Every single time.
With batch-invariant kernels for matmul, RMSNorm, and attention, the Thinking Machines team ran Qwen 3-235B (a Mixture-of-Experts model with 235 billion parameters) at temperature zero, 1000 times, with varying concurrent load.
Result: all 1000 completions are bit-identical.
Compare this to the baseline: 80 unique completions from the same 1000 runs with standard vLLM kernels. The nondeterminism is completely eliminated.
Determinism is not free. The end-to-end generation time comparison:
| Configuration | Time (s) | Relative | Notes |
|---|---|---|---|
| vLLM default | 26 | 1.0x | Nondeterministic, fully optimized |
| Batch-invariant (naive attn) | 72 | 2.8x | Single-split attention (no parallelism) |
| Batch-invariant (optimized) | 42 | 1.6x | Fixed split-size attention |
The 1.6x overhead breaks down as:
Attention dominates the cost because it loses the most parallelism. The “optimized” version recovers significant performance over the naive (single-split) version by using a smaller split-size that still provides parallelism while maintaining determinism.
The most impactful application is in reinforcement learning from human feedback (RLHF) and related training schemes. Here is why:
In on-policy RL (like GRPO, PPO applied to LLMs), the training loop is:
The policy gradient requires knowing $\pi_\theta(a|s)$ — the probability of the action (token) under the current policy. If the model is nondeterministic, you have a problem: the logits you computed during sampling (step 1) are different from the logits you’d compute if you re-ran the forward pass (during step 3). This means:
Even though $\theta$ has not changed! The nondeterminism creates a phantom “distribution shift” that introduces noise into the gradient estimate. To compensate, practitioners use importance weighting: $\frac{\pi_{\text{train}}(a|s)}{\pi_{\text{sample}}(a|s)}$. But this ratio can have high variance, destabilizing training.
With batch-invariant inference:
The logits during sampling and training are bit-identical (same model, same input, same output). No importance weighting needed. No phantom distribution shift. The gradient estimate is pure signal.
This is what Thinking Machines calls “true on-policy RL” — the training process sees exactly the same model behavior that the sampling process produced. The KL divergence is not approximately zero. It is exactly zero, by construction.
Beyond RL, deterministic inference enables practices that are standard in software engineering but impossible with nondeterministic models:
None of these are possible when the same prompt can produce 80 different outputs.
The batch-invariant kernel implementations are open-sourced at:
thinking-machines-lab/batch-invariant-opsThe implementation supports: