LLM Inference & Adaptation

Sampling & Decoding

A language model doesn’t output words — it outputs a probability distribution over the next token. The decoding strategy is how you turn that distribution into actual text. Get it wrong and you get either robotic loops or incoherent gibberish. This is the dial between boring and unhinged.

Prerequisites: The model gives a score to every possible next token + Probabilities sum to 1. That’s it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Choice Problem

Here is a fact that surprises people: a language model never decides what word to say next. It only produces a probability distribution — a number for every single token in its vocabulary, saying how likely each one is to come next. For the prompt “The capital of France is”, it might assign 0.90 to “Paris”, 0.03 to “a”, 0.01 to “the”, and tiny slivers to fifty thousand other tokens.

Something else has to take that distribution and pick an actual token. That something is the decoding strategy, and it is a separate algorithm sitting on top of the model. The exact same model, with the exact same weights, will write crisp factual answers or wild creative prose depending entirely on how you decode. The model proposes; the decoder disposes.

Your first instinct is obvious: just take the most likely token every time. It’s called greedy decoding, and for “The capital of France is → Paris” it’s perfect. But run it for a few paragraphs of open-ended writing and something disturbing happens: the text collapses into loops. “I think that I think that I think that...” The single most probable path is, paradoxically, often a dull, repetitive dead end.

Look at a real next-token distribution below. Sometimes it’s sharply peaked (one obvious answer); sometimes it’s flat (many plausible continuations). A good decoder has to handle both — commit when the model is confident, explore when it isn’t.

A next-token distribution

The model’s output for one step: a probability for every candidate token. Toggle between a confident context (peaked) and an open one (flat). Greedy always grabs the tallest bar — fine on the left, fatal on the right.

So decoding is a trade-off between quality and diversity. Too greedy and you get safe, repetitive, robotic text. Too random and you get creative but incoherent nonsense. Every method in this lesson — temperature, top-k, top-p, beam search — is a different way to navigate that trade-off, deciding how much of the distribution to trust and how much to gamble.

Misconception: “The model generates text.” Strictly, it generates a distribution, once per token. Generation is a loop: get distribution → decode one token → append it → feed back in → repeat. The model is consulted at every single step; the decoder makes every actual choice. Change the decoder and you change the writer’s entire personality without touching a single weight.

What does a language model actually output at each generation step?

Chapter 1: Logits & Softmax

Before the model gives us probabilities, it gives us logits — raw, unnormalized scores, one per token. A logit can be any real number: +8.2 for “Paris”, −1.5 for “banana”, 0.3 for “the”. They’re the model’s gut feelings before being turned into honest probabilities. To decode, we first need to convert these scores into a proper distribution that’s non-negative and sums to one.

The conversion is the softmax function. It does two things: it exponentiates each logit (making everything positive and amplifying differences), then divides by the total (so they sum to one). The probability of token i is e-to-the-logit-i, divided by the sum of e-to-the-logit over all tokens.

pi = ezi / Σj ezj

The exponential is the important part. It means a logit that’s a little higher becomes a probability that’s a lot higher — softmax exaggerates the leader. This is why a model that’s only slightly more confident in “Paris” can end up assigning it overwhelming probability.

By hand

Take three tokens with logits 2.0, 1.0, and 0.1. Exponentiate: e²⁰ ≈ 7.39, e¹⁰ ≈ 2.72, e⁰·¹ ≈ 1.11. The sum is 7.39 + 2.72 + 1.11 = 11.22. Now divide each: 7.39/11.22 = 0.659, 2.72/11.22 = 0.242, 1.11/11.22 = 0.099. Three positive numbers summing to 1.0. Notice the logit gap of 1.0 between the top two became a probability ratio of nearly 3-to-1 — that’s the exponential at work.

Logits become probabilities

Drag a logit slider and watch softmax respond. The top row is raw logits; the bottom is the resulting probability distribution. See how raising one logit steals mass from all the others.

logit of token A2.0
logit of token B1.0

Key insight: Softmax is where the model’s confidence lives. A wide gap between the top logit and the rest produces a peaked, confident distribution; logits all bunched together produce a flat, uncertain one. Every decoding method we’re about to meet works by reshaping this distribution before sampling — and temperature, the next chapter, reaches right inside the softmax to do it.

Misconception: “Logits are probabilities.” They’re not — they can be negative, they don’t sum to anything in particular, and only their relative values matter (adding a constant to every logit changes nothing after softmax). Probabilities only exist after softmax. This distinction matters because temperature operates on logits, not probabilities.

Why does softmax use the exponential function rather than just dividing each logit by the sum of logits?

Chapter 2: Greedy Decoding & Its Trap

The simplest decoder takes the token with the highest probability, every step, with zero randomness. This is greedy decoding — argmax of the distribution. For short, factual completions it’s often exactly right: “2 + 2 =” should deterministically produce “4”, and greedy guarantees it.

But greedy has two deep problems. The first is repetition collapse. Once a model emits a phrase, that phrase becomes part of the context, which makes the model more likely to predict it again, which makes greedy pick it again — a feedback loop into “the best the best the best the best.” This isn’t a bug in a specific model; it’s a known failure mode of always taking the locally most probable token.

The second problem is subtler and important: greedy is myopic. Picking the single best token now can paint you into a corner where every continuation is bad. The globally most probable sentence might start with a token that wasn’t the most probable first choice. Greedy can’t see that — it commits to each token before knowing what comes after.

The greedy repetition loop

Step a greedy decoder. Each pick feeds back into the context, raising that token’s probability next time. Watch it spiral into a loop — the distribution sharpens onto the repeated token.

Why humans don’t talk like greedy decoders

Here’s a beautiful empirical fact (from Holtzman and colleagues, 2019): natural human text is not the highest-probability text. If you plot the per-token probability of real human writing, it’s full of surprises — humans constantly pick less-likely words to stay interesting. Always choosing the most probable token produces text that is, statistically, too predictable to be human. Good decoding deliberately injects the right amount of surprise. That injection is randomness, and the knob that controls it is temperature.

Misconception: “The highest-probability sequence is the best output.” For open-ended generation, no — it’s often bland and repetitive precisely because it’s so probable. Real language has a characteristic level of surprise; maximizing probability undershoots it. (For closed-ended tasks like translation, high-probability is closer to right — which is exactly when beam search shines, as we’ll see.)

Why does greedy decoding often produce repetitive loops in open-ended generation?

Chapter 3: Temperature

Temperature is the master volume knob for randomness. It works by dividing every logit by a number T before the softmax. That one division reshapes the entire distribution.

pi = ezi/T / Σj ezj/T

Think about what dividing does. With T < 1 (say 0.5), you’re dividing by a small number, so all the logits get bigger and their gaps widen — softmax then sharpens the distribution, concentrating mass on the top tokens. The model becomes more confident, more deterministic. With T > 1 (say 1.5), logits shrink toward each other, gaps narrow, and softmax flattens — rare tokens get more chance, output gets more random and creative. T = 1 is the model’s honest distribution, untouched. And T → 0 makes the top logit dominate completely — that’s greedy decoding as a limiting case.

By hand

Recall our logits 2.0, 1.0, 0.1, which gave probabilities 0.659, 0.242, 0.099 at T = 1. Now set T = 0.5: divide the logits to get 4.0, 2.0, 0.2. Exponentiate: 54.6, 7.39, 1.22, summing to 63.2. Probabilities: 0.864, 0.117, 0.019. The leader jumped from 0.66 to 0.86 — sharper. Now T = 2: logits become 1.0, 0.5, 0.05, exponentiate to 2.72, 1.65, 1.05, sum 5.42, probabilities 0.502, 0.304, 0.194 — much flatter, the underdog tripled its chances. Same logits, three different personalities.

Temperature reshapes the distribution

Slide temperature and watch the same logits become a sharp spike (low T, near-greedy) or a flat spread (high T, wild). The dashed line is the original T=1 distribution for comparison.

temperature T1.00

Key insight: Temperature doesn’t change which token is most likely — it changes how much the distribution favors it. Low T trusts the model’s top picks (factual, focused); high T spreads the bets (creative, risky). It’s the single most important generation parameter, and 0.7–0.8 is a common sweet spot for balanced text.

Misconception: “Higher temperature adds more information or knowledge.” It adds randomness, not knowledge. The model knows exactly the same things; temperature only changes how willing the decoder is to pick lower-ranked tokens. Crank it too high (T > 1.5) and you sample from the long tail of nonsense tokens — creativity curdles into gibberish.

What does setting temperature T below 1 do to the probability distribution?

Chapter 4: Top-k Sampling

Temperature reshapes the whole distribution, but it never fully removes the long tail of terrible tokens — at any T > 0 there’s a nonzero chance of sampling something absurd. With fifty thousand tokens, even if each junk token has probability 0.0001, their combined mass can be large, and occasionally you’ll draw one and derail the whole generation. We need to truncate the tail.

Top-k sampling is the bluntest truncation: keep only the k highest-probability tokens, throw away everything else, renormalize the survivors back to sum to one, and sample from those. With k = 50, no matter how flat the distribution, you only ever sample from the 50 best candidates. The garbage tail is simply gone.

By hand

Suppose after softmax the probabilities are 0.5, 0.25, 0.15, 0.07, 0.02, and then a tail of tiny values. Apply top-k with k = 3: keep 0.5, 0.25, 0.15 and discard the rest. They sum to 0.90, so renormalize by dividing each by 0.90: the new distribution is 0.556, 0.278, 0.167. We now sample from just those three, with no chance of ever picking the discarded tokens.

Top-k truncation

Slide k. Tokens beyond the cutoff (greyed) are discarded; the survivors are renormalized (shown brighter, taller). Notice top-k keeps a fixed count regardless of how peaked or flat the distribution is.

k3

Top-k works well and is simple, but it has one real weakness: k is fixed, but distributions aren’t. When the model is very confident (one token at 0.95), k = 50 still keeps 49 junk tokens that should have been cut. When the model is genuinely uncertain (50 tokens all around 0.02), k = 50 might cut off perfectly good options. A fixed count can’t adapt to the distribution’s shape — which is exactly the problem top-p solves.

Misconception: “Bigger k is always safer.” Bigger k keeps more of the tail, which increases the chance of sampling junk, not decreases it. Smaller k is more conservative (closer to greedy at k = 1). The right k depends on the task and is usually paired with temperature, not used alone.

What is the main limitation of top-k sampling?

Chapter 5: Top-p (Nucleus) Sampling

Top-p sampling, introduced by Holtzman and colleagues in the same “nucleus sampling” paper, fixes top-k’s rigidity with a beautiful idea: instead of keeping a fixed number of tokens, keep a fixed amount of probability mass.

The recipe: sort tokens by probability, then add them up from the top until the cumulative probability first reaches p (say 0.9). That smallest set — the nucleus — is what you keep. Renormalize and sample from it. Everything outside the nucleus, the bottom 10% of mass, is discarded.

The magic is that the nucleus resizes itself automatically. When the model is confident — one token at 0.92 — the nucleus for p = 0.9 is just that one token, so you behave almost greedily and stay accurate. When the model is uncertain — many tokens around 0.05 — the nucleus might include twenty or thirty tokens, so you explore widely. The truncation adapts to the model’s confidence, step by step, which is exactly what a fixed k cannot do.

By hand

Same probabilities: 0.5, 0.25, 0.15, 0.07, 0.02, tail. Apply top-p with p = 0.9. Accumulate from the top: 0.5 (running total 0.5), +0.25 (0.75), +0.15 (0.90) — we’ve reached 0.90, stop. The nucleus is the first three tokens. Renormalize by 0.90: 0.556, 0.278, 0.167. Now contrast a peaked case, 0.92, 0.05, 0.02: just the first token already hits 0.92 ≥ 0.9, so the nucleus is a single token — effectively greedy. Same p, completely different nucleus size, decided by the distribution itself.

The nucleus adapts

Slide p and toggle the distribution. The shaded bars are the nucleus (smallest set summing to ≥ p). Watch the nucleus shrink to one token on a peaked distribution and grow on a flat one — the adaptivity top-k lacks.

p0.90

Key insight: Top-k fixes the count; top-p fixes the mass. Because confidence varies token-to-token, fixing the mass lets the candidate pool breathe — tight when the model is sure, wide when it’s not. This is why top-p (typically p = 0.9–0.95, paired with temperature ~0.7) is the most common default in production text generation.

Misconception: “Top-p and top-k are rivals; pick one.” They’re routinely combined: apply top-k to cap the absolute number, then top-p to trim by mass, then temperature to shape what survives. Each guards a different failure mode. Modern APIs expose all three at once.

How does top-p (nucleus) sampling improve on top-k?

Chapter 6: Beam Search

Everything so far decides one token at a time. Beam search takes a different tack: it tries to find a high-probability whole sequence by exploring several candidate continuations in parallel. It directly attacks greedy’s myopia.

Here’s the mechanism. Keep the B best partial sequences alive at once (B is the “beam width”). At each step, for every one of the B beams, consider its possible next tokens, score all the resulting longer sequences by their total probability, and keep only the top B overall. So with B = 4, you always carry the four most promising sentences-so-far, pruning the rest. At the end, you return the highest-scoring complete sequence.

Why bother? Because the best sequence may start with a token that wasn’t the best first choice. Greedy (which is beam search with B = 1) would miss it. Beam search keeps enough options open to recover — a slightly worse first token can lead to a much better overall sentence, and beam search can find it where greedy cannot.

Beam search tree (B = 2)

Step the search. At each level, every surviving beam branches into candidates; only the top B by cumulative probability survive (highlighted), the rest are pruned (faded). The winning path is the best total-probability sequence, not the greedy one.

When beam search helps — and when it hurts

Beam search is excellent for closed-ended tasks: machine translation, summarization, speech recognition — anywhere there’s essentially one correct, high-probability answer and you want to find it. Maximizing sequence probability is the right goal there.

But for open-ended generation — storytelling, chat, brainstorming — beam search is often worse than sampling. Remember Chapter 2: high-probability text is bland and repetitive. Beam search, by design, hunts for the highest-probability sequence, so it produces safe, generic, often repetitive output — and increasing the beam width makes this worse, not better. For creative text you want surprise, which means sampling (temperature + top-p), not probability maximization.

Misconception: “Beam search is strictly better than greedy because it searches more.” Only for closed-ended tasks. For open-ended generation, beam search’s probability-maximizing goal is the wrong objective — it amplifies the blandness problem. The decoder must match the task: beam for “find the one right answer,” sampling for “generate something interesting.”

For which kind of task is beam search the right choice?

Chapter 7: The Modern Stack

Real systems don’t pick one method — they apply a pipeline of filters to the logits in sequence, then sample. Understanding the order makes everything click.

1. Logits
raw scores from the model
2. Penalties
repetition / frequency / presence adjust logits
3. Temperature
divide logits by T
4. Top-k / Top-p
truncate the tail
5. Sample
draw one token from what survives

Repetition penalties

To fight the loops from Chapter 2 directly, decoders adjust the logits of tokens that already appeared. A repetition penalty divides (or subtracts from) the logit of any token seen recently, making it less likely to recur. A frequency penalty scales with how many times a token appeared (the more you’ve said it, the harder it’s pushed down). A presence penalty is a flat penalty for any token that appeared at all, nudging the model toward new vocabulary. These are why modern chat models rarely loop the way raw greedy would.

Newer truncation methods

Min-p sampling keeps tokens whose probability is at least some fraction of the top token’s probability — a cleaner adaptive cut that scales with the peak. Typical sampling keeps tokens whose surprise is close to the distribution’s average, aiming to match the natural “surprise level” of human text rather than just the high-probability head. These are refinements on the same idea: trim the distribution intelligently before sampling.

GoalTypical settings
Factual Q&A, code, mathT = 0 (greedy) or T = 0.2, top-p = 1
Balanced assistant chatT = 0.7, top-p = 0.9
Creative writing, brainstormingT = 0.9–1.1, top-p = 0.95
Translation / summarizationbeam search (B = 4–8)
Anti-repetition (long output)add frequency/presence penalty
Key insight: Order matters. Penalties and temperature reshape logits; truncation (top-k/top-p/min-p) then prunes; sampling draws last. Temperature before truncation means you shape confidence first, then decide how much tail to allow. Knowing the pipeline turns a wall of API parameters into a sensible sequence of decisions.

Misconception: “Temperature = 0 and top-p = 0 do the same thing.” T = 0 is greedy (always the top token). top-p = 0 is ill-defined / collapses to the single top token by convention, but they reach “deterministic” by different routes. To get reproducible factual output, set temperature to 0 — that’s the clean way to disable randomness.

In the modern decoding pipeline, what is the role of a frequency penalty?

Chapter 8: Live Sampler

Now drive all the knobs at once. Below is a fixed set of candidate next-tokens with realistic logits. Adjust temperature, top-k, and top-p, and watch the surviving distribution reshape in real time. Then hit “sample 100×” to draw repeatedly and see which tokens actually come out and how often.

Decode it yourself

The bars are the post-filter probabilities (greyed = cut by top-k/top-p). Temperature shapes, then top-k and top-p truncate. “Sample” draws from what survives; the tally shows the realized frequencies.

temperature0.80
top-k (0 = off)0
top-p1.00

Experiments worth running:

Set temperature to 0.1. The distribution collapses onto the top token — near-greedy. Sample 100 times and you’ll get the same token almost every draw. Deterministic, focused, boring.

Set temperature to 1.8 with top-p = 1. The distribution flattens and the tally spreads across many tokens, including unlikely ones. Creative, but you’ll see junk tokens appear — the long-tail risk.

Now keep temperature 1.8 but set top-p = 0.9. The flat distribution is truncated to its nucleus — the junk tokens vanish from the tally while diversity among good tokens remains. This is the combination that gives you creativity without incoherence, and it’s why the modern default pairs a moderate temperature with nucleus sampling.

The whole lesson in one widget: temperature is the spread (Ch 3), top-k caps the count (Ch 4), top-p caps the mass (Ch 5), and sampling realizes the choice (Ch 0–2). If you can predict how the tally shifts as you turn each knob, you understand decoding.

No quiz here — the sampler is the test.

Chapter 9: Cheat Sheet & Connections

The methods at a glance

MethodWhat it doesBest for
Greedyalways the top token (T = 0)short factual answers; repeats on long text
Temperaturedivide logits by T before softmaxthe master randomness knob; ~0.7 balanced
Top-kkeep k highest tokens, renormalizesimple tail-cutting; fixed count
Top-p (nucleus)keep smallest set with mass ≥ padaptive tail-cutting; the common default
Beam searchkeep B best sequences, maximize total probtranslation, summarization (closed-ended)
Penaltieslower logits of repeated tokensanti-repetition on long output

The pipeline order

logits → penalties → ÷ temperature → top-k → top-p → renormalize → sample

The three things to remember

1. The model outputs a distribution, not a word. Decoding is the separate algorithm that turns that distribution into text. Same weights + different decoder = different personality.

2. It’s a quality–diversity dial. Greedy/low-T/beam = focused but bland and loopy. High-T/wide-p = creative but risky. Temperature spreads, truncation (top-k/top-p) cuts the junk tail.

3. Match the decoder to the task. Facts and code → near-greedy. Chat → T~0.7 + top-p~0.9. Creative → higher T + top-p. Translation → beam search. Never one-size-fits-all.

Where to go next

  • Tokenization — the other end: how text becomes the token IDs whose distribution we’re now sampling from.
  • The Transformer — the model that produces the logits at every step.
  • Test-Time Compute — self-consistency and best-of-N rely on sampling: draw many diverse outputs (high-T) and vote/verify.
  • KV-Cache & Inference — how the generation loop is made fast enough to decode token-by-token in real time.
Closing thought: The model is a fixed oracle of probabilities; the decoder is the voice that reads them aloud. The same oracle can sound like a precise lawyer or a wandering poet — and you choose, with three or four numbers, which one shows up. Decoding is the cheapest, most underrated lever in all of LLM behavior.
You want reproducible, factual answers from a model for a math task. What decoding setting is most appropriate?