A language model doesn’t output words — it outputs a probability distribution over the next token. The decoding strategy is how you turn that distribution into actual text. Get it wrong and you get either robotic loops or incoherent gibberish. This is the dial between boring and unhinged.
Here is a fact that surprises people: a language model never decides what word to say next. It only produces a probability distribution — a number for every single token in its vocabulary, saying how likely each one is to come next. For the prompt “The capital of France is”, it might assign 0.90 to “Paris”, 0.03 to “a”, 0.01 to “the”, and tiny slivers to fifty thousand other tokens.
Something else has to take that distribution and pick an actual token. That something is the decoding strategy, and it is a separate algorithm sitting on top of the model. The exact same model, with the exact same weights, will write crisp factual answers or wild creative prose depending entirely on how you decode. The model proposes; the decoder disposes.
Your first instinct is obvious: just take the most likely token every time. It’s called greedy decoding, and for “The capital of France is → Paris” it’s perfect. But run it for a few paragraphs of open-ended writing and something disturbing happens: the text collapses into loops. “I think that I think that I think that...” The single most probable path is, paradoxically, often a dull, repetitive dead end.
Look at a real next-token distribution below. Sometimes it’s sharply peaked (one obvious answer); sometimes it’s flat (many plausible continuations). A good decoder has to handle both — commit when the model is confident, explore when it isn’t.
The model’s output for one step: a probability for every candidate token. Toggle between a confident context (peaked) and an open one (flat). Greedy always grabs the tallest bar — fine on the left, fatal on the right.
So decoding is a trade-off between quality and diversity. Too greedy and you get safe, repetitive, robotic text. Too random and you get creative but incoherent nonsense. Every method in this lesson — temperature, top-k, top-p, beam search — is a different way to navigate that trade-off, deciding how much of the distribution to trust and how much to gamble.
Before the model gives us probabilities, it gives us logits — raw, unnormalized scores, one per token. A logit can be any real number: +8.2 for “Paris”, −1.5 for “banana”, 0.3 for “the”. They’re the model’s gut feelings before being turned into honest probabilities. To decode, we first need to convert these scores into a proper distribution that’s non-negative and sums to one.
The conversion is the softmax function. It does two things: it exponentiates each logit (making everything positive and amplifying differences), then divides by the total (so they sum to one). The probability of token i is e-to-the-logit-i, divided by the sum of e-to-the-logit over all tokens.
The exponential is the important part. It means a logit that’s a little higher becomes a probability that’s a lot higher — softmax exaggerates the leader. This is why a model that’s only slightly more confident in “Paris” can end up assigning it overwhelming probability.
Take three tokens with logits 2.0, 1.0, and 0.1. Exponentiate: e²⁰ ≈ 7.39, e¹⁰ ≈ 2.72, e⁰·¹ ≈ 1.11. The sum is 7.39 + 2.72 + 1.11 = 11.22. Now divide each: 7.39/11.22 = 0.659, 2.72/11.22 = 0.242, 1.11/11.22 = 0.099. Three positive numbers summing to 1.0. Notice the logit gap of 1.0 between the top two became a probability ratio of nearly 3-to-1 — that’s the exponential at work.
Drag a logit slider and watch softmax respond. The top row is raw logits; the bottom is the resulting probability distribution. See how raising one logit steals mass from all the others.
The simplest decoder takes the token with the highest probability, every step, with zero randomness. This is greedy decoding — argmax of the distribution. For short, factual completions it’s often exactly right: “2 + 2 =” should deterministically produce “4”, and greedy guarantees it.
But greedy has two deep problems. The first is repetition collapse. Once a model emits a phrase, that phrase becomes part of the context, which makes the model more likely to predict it again, which makes greedy pick it again — a feedback loop into “the best the best the best the best.” This isn’t a bug in a specific model; it’s a known failure mode of always taking the locally most probable token.
The second problem is subtler and important: greedy is myopic. Picking the single best token now can paint you into a corner where every continuation is bad. The globally most probable sentence might start with a token that wasn’t the most probable first choice. Greedy can’t see that — it commits to each token before knowing what comes after.
Step a greedy decoder. Each pick feeds back into the context, raising that token’s probability next time. Watch it spiral into a loop — the distribution sharpens onto the repeated token.
Here’s a beautiful empirical fact (from Holtzman and colleagues, 2019): natural human text is not the highest-probability text. If you plot the per-token probability of real human writing, it’s full of surprises — humans constantly pick less-likely words to stay interesting. Always choosing the most probable token produces text that is, statistically, too predictable to be human. Good decoding deliberately injects the right amount of surprise. That injection is randomness, and the knob that controls it is temperature.
Temperature is the master volume knob for randomness. It works by dividing every logit by a number T before the softmax. That one division reshapes the entire distribution.
Think about what dividing does. With T < 1 (say 0.5), you’re dividing by a small number, so all the logits get bigger and their gaps widen — softmax then sharpens the distribution, concentrating mass on the top tokens. The model becomes more confident, more deterministic. With T > 1 (say 1.5), logits shrink toward each other, gaps narrow, and softmax flattens — rare tokens get more chance, output gets more random and creative. T = 1 is the model’s honest distribution, untouched. And T → 0 makes the top logit dominate completely — that’s greedy decoding as a limiting case.
Recall our logits 2.0, 1.0, 0.1, which gave probabilities 0.659, 0.242, 0.099 at T = 1. Now set T = 0.5: divide the logits to get 4.0, 2.0, 0.2. Exponentiate: 54.6, 7.39, 1.22, summing to 63.2. Probabilities: 0.864, 0.117, 0.019. The leader jumped from 0.66 to 0.86 — sharper. Now T = 2: logits become 1.0, 0.5, 0.05, exponentiate to 2.72, 1.65, 1.05, sum 5.42, probabilities 0.502, 0.304, 0.194 — much flatter, the underdog tripled its chances. Same logits, three different personalities.
Slide temperature and watch the same logits become a sharp spike (low T, near-greedy) or a flat spread (high T, wild). The dashed line is the original T=1 distribution for comparison.
Temperature reshapes the whole distribution, but it never fully removes the long tail of terrible tokens — at any T > 0 there’s a nonzero chance of sampling something absurd. With fifty thousand tokens, even if each junk token has probability 0.0001, their combined mass can be large, and occasionally you’ll draw one and derail the whole generation. We need to truncate the tail.
Top-k sampling is the bluntest truncation: keep only the k highest-probability tokens, throw away everything else, renormalize the survivors back to sum to one, and sample from those. With k = 50, no matter how flat the distribution, you only ever sample from the 50 best candidates. The garbage tail is simply gone.
Suppose after softmax the probabilities are 0.5, 0.25, 0.15, 0.07, 0.02, and then a tail of tiny values. Apply top-k with k = 3: keep 0.5, 0.25, 0.15 and discard the rest. They sum to 0.90, so renormalize by dividing each by 0.90: the new distribution is 0.556, 0.278, 0.167. We now sample from just those three, with no chance of ever picking the discarded tokens.
Slide k. Tokens beyond the cutoff (greyed) are discarded; the survivors are renormalized (shown brighter, taller). Notice top-k keeps a fixed count regardless of how peaked or flat the distribution is.
Top-k works well and is simple, but it has one real weakness: k is fixed, but distributions aren’t. When the model is very confident (one token at 0.95), k = 50 still keeps 49 junk tokens that should have been cut. When the model is genuinely uncertain (50 tokens all around 0.02), k = 50 might cut off perfectly good options. A fixed count can’t adapt to the distribution’s shape — which is exactly the problem top-p solves.
Top-p sampling, introduced by Holtzman and colleagues in the same “nucleus sampling” paper, fixes top-k’s rigidity with a beautiful idea: instead of keeping a fixed number of tokens, keep a fixed amount of probability mass.
The recipe: sort tokens by probability, then add them up from the top until the cumulative probability first reaches p (say 0.9). That smallest set — the nucleus — is what you keep. Renormalize and sample from it. Everything outside the nucleus, the bottom 10% of mass, is discarded.
The magic is that the nucleus resizes itself automatically. When the model is confident — one token at 0.92 — the nucleus for p = 0.9 is just that one token, so you behave almost greedily and stay accurate. When the model is uncertain — many tokens around 0.05 — the nucleus might include twenty or thirty tokens, so you explore widely. The truncation adapts to the model’s confidence, step by step, which is exactly what a fixed k cannot do.
Same probabilities: 0.5, 0.25, 0.15, 0.07, 0.02, tail. Apply top-p with p = 0.9. Accumulate from the top: 0.5 (running total 0.5), +0.25 (0.75), +0.15 (0.90) — we’ve reached 0.90, stop. The nucleus is the first three tokens. Renormalize by 0.90: 0.556, 0.278, 0.167. Now contrast a peaked case, 0.92, 0.05, 0.02: just the first token already hits 0.92 ≥ 0.9, so the nucleus is a single token — effectively greedy. Same p, completely different nucleus size, decided by the distribution itself.
Slide p and toggle the distribution. The shaded bars are the nucleus (smallest set summing to ≥ p). Watch the nucleus shrink to one token on a peaked distribution and grow on a flat one — the adaptivity top-k lacks.
Everything so far decides one token at a time. Beam search takes a different tack: it tries to find a high-probability whole sequence by exploring several candidate continuations in parallel. It directly attacks greedy’s myopia.
Here’s the mechanism. Keep the B best partial sequences alive at once (B is the “beam width”). At each step, for every one of the B beams, consider its possible next tokens, score all the resulting longer sequences by their total probability, and keep only the top B overall. So with B = 4, you always carry the four most promising sentences-so-far, pruning the rest. At the end, you return the highest-scoring complete sequence.
Why bother? Because the best sequence may start with a token that wasn’t the best first choice. Greedy (which is beam search with B = 1) would miss it. Beam search keeps enough options open to recover — a slightly worse first token can lead to a much better overall sentence, and beam search can find it where greedy cannot.
Step the search. At each level, every surviving beam branches into candidates; only the top B by cumulative probability survive (highlighted), the rest are pruned (faded). The winning path is the best total-probability sequence, not the greedy one.
Beam search is excellent for closed-ended tasks: machine translation, summarization, speech recognition — anywhere there’s essentially one correct, high-probability answer and you want to find it. Maximizing sequence probability is the right goal there.
But for open-ended generation — storytelling, chat, brainstorming — beam search is often worse than sampling. Remember Chapter 2: high-probability text is bland and repetitive. Beam search, by design, hunts for the highest-probability sequence, so it produces safe, generic, often repetitive output — and increasing the beam width makes this worse, not better. For creative text you want surprise, which means sampling (temperature + top-p), not probability maximization.
Real systems don’t pick one method — they apply a pipeline of filters to the logits in sequence, then sample. Understanding the order makes everything click.
To fight the loops from Chapter 2 directly, decoders adjust the logits of tokens that already appeared. A repetition penalty divides (or subtracts from) the logit of any token seen recently, making it less likely to recur. A frequency penalty scales with how many times a token appeared (the more you’ve said it, the harder it’s pushed down). A presence penalty is a flat penalty for any token that appeared at all, nudging the model toward new vocabulary. These are why modern chat models rarely loop the way raw greedy would.
Min-p sampling keeps tokens whose probability is at least some fraction of the top token’s probability — a cleaner adaptive cut that scales with the peak. Typical sampling keeps tokens whose surprise is close to the distribution’s average, aiming to match the natural “surprise level” of human text rather than just the high-probability head. These are refinements on the same idea: trim the distribution intelligently before sampling.
| Goal | Typical settings |
|---|---|
| Factual Q&A, code, math | T = 0 (greedy) or T = 0.2, top-p = 1 |
| Balanced assistant chat | T = 0.7, top-p = 0.9 |
| Creative writing, brainstorming | T = 0.9–1.1, top-p = 0.95 |
| Translation / summarization | beam search (B = 4–8) |
| Anti-repetition (long output) | add frequency/presence penalty |
Now drive all the knobs at once. Below is a fixed set of candidate next-tokens with realistic logits. Adjust temperature, top-k, and top-p, and watch the surviving distribution reshape in real time. Then hit “sample 100×” to draw repeatedly and see which tokens actually come out and how often.
The bars are the post-filter probabilities (greyed = cut by top-k/top-p). Temperature shapes, then top-k and top-p truncate. “Sample” draws from what survives; the tally shows the realized frequencies.
Experiments worth running:
Set temperature to 0.1. The distribution collapses onto the top token — near-greedy. Sample 100 times and you’ll get the same token almost every draw. Deterministic, focused, boring.
Set temperature to 1.8 with top-p = 1. The distribution flattens and the tally spreads across many tokens, including unlikely ones. Creative, but you’ll see junk tokens appear — the long-tail risk.
Now keep temperature 1.8 but set top-p = 0.9. The flat distribution is truncated to its nucleus — the junk tokens vanish from the tally while diversity among good tokens remains. This is the combination that gives you creativity without incoherence, and it’s why the modern default pairs a moderate temperature with nucleus sampling.
No quiz here — the sampler is the test.
| Method | What it does | Best for |
|---|---|---|
| Greedy | always the top token (T = 0) | short factual answers; repeats on long text |
| Temperature | divide logits by T before softmax | the master randomness knob; ~0.7 balanced |
| Top-k | keep k highest tokens, renormalize | simple tail-cutting; fixed count |
| Top-p (nucleus) | keep smallest set with mass ≥ p | adaptive tail-cutting; the common default |
| Beam search | keep B best sequences, maximize total prob | translation, summarization (closed-ended) |
| Penalties | lower logits of repeated tokens | anti-repetition on long output |
logits → penalties → ÷ temperature → top-k → top-p → renormalize → sample
1. The model outputs a distribution, not a word. Decoding is the separate algorithm that turns that distribution into text. Same weights + different decoder = different personality.
2. It’s a quality–diversity dial. Greedy/low-T/beam = focused but bland and loopy. High-T/wide-p = creative but risky. Temperature spreads, truncation (top-k/top-p) cuts the junk tail.
3. Match the decoder to the task. Facts and code → near-greedy. Chat → T~0.7 + top-p~0.9. Creative → higher T + top-p. Translation → beam search. Never one-size-fits-all.