LLM Inference & Adaptation

Tokenization

Before a language model sees a single number, your text must become tokens. Too coarse and you can’t spell rare words; too fine and sequences explode. Byte-pair encoding threads the needle — and it’s why “strawberry” has three r’s the model can’t count.

Prerequisites: Text is a string of characters + A model takes a list of integers as input. That’s it.
10
Chapters
9+
Simulations
0
Assumed Knowledge

Chapter 0: The Vocabulary Problem

A language model is, at its core, a machine that maps a list of integers to a probability distribution over the next integer. It never sees text. So before anything else, we face a deceptively hard question: how do we turn a string like “unbelievable” into integers?

Your first instinct is a dictionary: assign every word an ID. “cat” is 5, “dog” is 6, “unbelievable” is 90,210. Simple — until someone types “unbelievableee” or “covfefe” or a brand-new word coined yesterday. It’s not in your dictionary. This is the out-of-vocabulary problem, or OOV, and it is fatal: the model literally has no number for that word.

Your second instinct fixes OOV: use characters. There are only about a hundred of them, every word is spellable, nothing is ever out-of-vocabulary. But now “unbelievable” is twelve tokens instead of one. A paragraph becomes thousands of tokens, and since a transformer’s cost grows with the square of sequence length, you’ve made the model wildly expensive and forced it to learn spelling from scratch before it can even learn meaning.

So we’re trapped between two bad extremes. Words: short sequences, but a giant vocabulary and OOV disaster. Characters: tiny vocabulary, no OOV, but punishingly long sequences. Drag the slider below between these poles and watch both costs move in opposite directions — there’s no setting that wins on both.

The granularity trade-off

Slide from character-level (left) to word-level (right). Watch the sequence length drop while the vocabulary size explodes. Subword tokenization (middle) is the sweet spot.

granularity50

The resolution is subword tokenization: a vocabulary of word-pieces. Common words stay whole (“the”, “cat” are single tokens), while rare words break into reusable chunks (“unbelievable” might become “un” + “believ” + “able”). You get short sequences for ordinary text, a bounded vocabulary, and — crucially — zero OOV, because in the worst case any word falls back to its individual characters. The dominant algorithm for building this vocabulary is byte-pair encoding, and that’s where we’re headed.

Misconception: “Tokenization is just a boring preprocessing detail.” It is load-bearing. The reason GPT-style models struggle to count the r’s in “strawberry” or do arithmetic on long numbers is that the tokenizer chopped those strings into chunks that hide the individual characters and digits. Tokenization decides what the model can even perceive.

Why is word-level tokenization (one ID per word) impractical for a general language model?

Chapter 1: Characters versus Words, Concretely

Let’s make the trade-off quantitative, because the numbers are what drive every design decision. Take one sentence: “The unbelievable cat.”

Word-level. Four tokens: “The”, “unbelievable”, “cat”, “.”. Beautifully short. But to cover real English you need a vocabulary of hundreds of thousands of words — and that’s before plurals, tenses, typos, names, code, and other languages. Worse, the model treats “cat” and “cats” as completely unrelated IDs, learning nothing shared between them.

Character-level. The same sentence is twenty-one tokens (counting spaces and the period). The vocabulary is tiny — maybe 100 symbols — and OOV is impossible. But the model must now learn that “c-a-t” in sequence means a furry animal, reconstructing every word from letters every single time. And those twenty-one tokens, versus four, mean roughly five times the compute and memory for the same sentence.

Why sequence length hurts so much

A transformer’s self-attention compares every token to every other token. For a sequence of length n, that’s n-squared comparisons. Go from 4 tokens to 21 and attention cost rises from 16 units to 441 — a 27-fold jump for the same sentence. This is why sequence length is not a minor efficiency knob; it’s the dominant cost, and shortening sequences is the whole economic point of using bigger tokens.

Same sentence, three tokenizations

Toggle the scheme and watch how “The unbelievable cat.” gets chopped. Note the token count and the implied attention cost (count squared).

The subword scheme, shown in the middle option, gives five tokens here: “The”, “un”, “believable” (or further split), “cat”, “.”. Almost as short as words, but the rare prefix “un” is now a reusable piece that also appears in “undo”, “unhappy”, and thousands of other words. That sharing is the magic: the model learns “un” once and applies it everywhere.

Key insight: The goal of tokenization is to minimize sequence length subject to a fixed vocabulary budget, with no out-of-vocabulary failures. Words minimize length but blow the vocab budget and fail on OOV. Characters satisfy the budget and OOV but maximize length. Subwords optimize all three at once — which is exactly what a learned algorithm like BPE finds.

Misconception: “A bigger vocabulary is always better because sequences get shorter.” Up to a point. A huge vocabulary means a huge embedding table and output layer (vocabulary times hidden-size parameters each), most tokens are seen rarely so their embeddings train poorly, and the gains in sequence length shrink. Real models settle around 30,000 to 130,000 tokens — a sweet spot, not a maximum.

Going from 4 tokens to 21 tokens for the same sentence, roughly how much does self-attention’s pairwise cost grow?

Chapter 2: The Subword Idea

Here’s the principle that makes subwords work, and it comes from a simple observation about language: word frequency is extraordinarily lopsided. A few hundred words (“the”, “of”, “and”) make up most of all text, while millions of rare words each appear almost never. This is Zipf’s law, and it’s the lever subword tokenization pulls.

The strategy writes itself: spend whole tokens on frequent words, and spend pieces on rare ones. “the” deserves its own token because you’ll use it constantly. “antidisestablishmentarianism” does not — it can be assembled from common pieces like “anti”, “dis”, “establish”, “ment”. The vocabulary becomes a set of building blocks of mixed sizes, sized to frequency.

And notice what this buys us for the OOV problem. Suppose the model has never seen “tokenizational”. Word-level chokes. But subword can fall back: maybe it splits as “token” + “ization” + “al”, all known pieces. In the absolute worst case — a string of pure gibberish — it falls all the way back to single characters, which are always in the vocabulary. OOV is mathematically impossible as long as every character is a token. That guarantee is why every modern LLM uses subwords.

Frequency decides granularity

Frequent words (left, tall bars) earn whole tokens. Rare words (right) get split into pieces. Hover the idea: the vocabulary spends its budget where it’s used most.

But this raises the real question, the one the rest of the lesson answers: which pieces should be in the vocabulary? “un” and “ization” are obviously useful, but who decides? We can’t hand-write a list of good subwords for every language. We need an algorithm that learns the best pieces from a corpus — one that looks at real text and discovers that “ing”, “tion”, and “the” are worth their own tokens. That algorithm is byte-pair encoding.

Think of it this way: Subword vocabularies are like a typist’s shorthand. The most common phrases get a single quick stroke; everything else is spelled out. You’d never invent a shorthand symbol for a word you use once a year, but “the” absolutely earns one. BPE is the procedure that discovers the optimal shorthand by watching what you actually write.

Misconception: “Subword pieces correspond to meaningful units like prefixes and suffixes.” Sometimes they coincide (“un”, “ing”), but BPE pieces are statistical, not linguistic. They’re whatever byte-pairs happen to be frequent — which is why a token can span a word boundary or split a word in a linguistically weird place. The pieces are optimized for compression, not grammar.

Why can subword tokenization guarantee zero out-of-vocabulary failures?

Chapter 3: Byte-Pair Encoding, From Scratch

BPE was originally a 1994 data-compression trick, repurposed for tokenization by Sennrich and colleagues in 2016. The idea is almost embarrassingly simple, and you can run it in your head: repeatedly find the most frequent adjacent pair of symbols and merge it into a new symbol. Do that a few thousand times, and the merges you accumulate are your subword vocabulary.

Let’s train it by hand on a tiny corpus. Suppose our entire training text, with word frequencies, is:

“low” ×5,   “lower” ×2,   “newest” ×6,   “widest” ×3

Step 0 — start from characters. Every word is split into characters, with a special end-of-word marker (we’ll write it as a bullet) so the algorithm knows where words end:

l o w • (×5)   l o w e r • (×2)   n e w e s t • (×6)   w i d e s t • (×3)

Step 1 — count all adjacent pairs. Across the whole corpus, weighted by word frequency. The pair “e s” appears in “newest” (6) and “widest” (3), so its count is 9. The pair “s t” is likewise 9. The pair “l o” appears in “low” (5) and “lower” (2), count 7. We pick the most frequent — say “e s” at 9 — and merge it.

Step 2 — apply the merge. Every “e s” becomes the single symbol “es”:

n e w es t • (×6)   w i d es t • (×3)

Step 3 — repeat. Now “es t” is the most frequent pair at 9, so we merge it into “est”. Then “est •” merges into “est•” (a whole word-ending). Each merge is recorded in order. After a handful of merges our vocabulary contains characters plus learned pieces: “es”, “est”, “est•”, “lo”, “low”, and so on. Step through it live:

BPE training, one merge at a time

Click to perform the next merge on our toy corpus. The most frequent adjacent pair is highlighted and fused; the merge list grows below. Watch “newest” and “widest” converge on the shared piece “est”.

From scratch

python
from collections import Counter

def get_pairs(words):                  # words: {tuple_of_symbols: freq}
    pairs = Counter()
    for sym, freq in words.items():
        for i in range(len(sym)-1):
            pairs[(sym[i], sym[i+1])] += freq   # weight by word frequency
    return pairs

def merge(words, pair):
    out = {}
    for sym, freq in words.items():
        new, i = [], 0
        while i < len(sym):
            if i < len(sym)-1 and (sym[i],sym[i+1])==pair:
                new.append(sym[i]+sym[i+1]); i += 2      # fuse the pair
            else:
                new.append(sym[i]); i += 1
        out[tuple(new)] = freq
    return out

words = {tuple(w)+("•",): f for w,f in
         [("low",5),("lower",2),("newest",6),("widest",3)]}
merges = []
for _ in range(10):                         # vocab budget = 10 merges
    pairs = get_pairs(words)
    if not pairs: break
    best = max(pairs, key=pairs.get)         # most frequent adjacent pair
    words = merge(words, best); merges.append(best)
# merges = [('e','s'), ('es','t'), ('l','o'), ('lo','w'), ...] in order

That’s the entire training algorithm — under twenty lines. The output is just the ordered merge list. The number of merges you allow is your one hyperparameter: it directly sets the vocabulary size. More merges means bigger pieces and shorter sequences, up to your budget.

Misconception: “BPE picks the merge that helps the model most.” No — it’s purely greedy on raw frequency. It merges the most common adjacent pair, with no notion of meaning or downstream loss. It’s a compression heuristic, and remarkably, that simple greedy frequency rule produces excellent tokenizers. (WordPiece, next chapter, swaps frequency for a likelihood criterion.)

In BPE training, what determines which pair gets merged at each step?

Chapter 4: Encoding & Decoding New Text

Training gives us an ordered merge list. Now: how do we tokenize a brand-new word the model has never seen? We replay the merges, in the exact order they were learned.

Take the new word “newer”. Start from characters: n e w e r •. Now walk down the merge list and apply each merge wherever it fits, in order. Suppose our learned merges (in order) were: 1) e+s, 2) es+t, 3) l+o, 4) lo+w, 5) n+e, 6) ne+w, 7) w+•... Applying to “newer”: merge 5 (n+e) gives ne w e r •; merge 6 (ne+w) gives new e r •. No other merges apply (there’s no “e r” merge), so “newer” tokenizes as “new”, “e”, “r”, • — four tokens, where the frequent prefix “new” collapsed to one piece and the rare ending stayed as characters.

The order matters enormously. Merges learned earlier are applied earlier, so they take priority — this is what makes encoding deterministic and consistent with training. Watch it run on words you pick:

Encoding by replaying merges

Pick a word; watch BPE start from characters and apply the learned merges in order until none fit. The final pieces are the tokens.

Decoding is trivial

Going from tokens back to text is just concatenation. The token IDs map back to their string pieces, you glue them together, and convert the end-of-word markers back into spaces. There’s no ambiguity and no computation — decoding is the easy direction. (This is why a model’s output, a list of token IDs, becomes readable text instantly.)

The cost of a token

Here’s a practical consequence you pay for in dollars. API pricing is per-token, and context windows are measured in tokens. Common English words are usually one token, so a rule of thumb is ~0.75 words per token, or ~4 characters per token. But unusual text is far more expensive: a long hexadecimal hash, a rare chemical name, or text in an under-represented language can be one token per character, multiplying your cost and eating your context window. The tokenizer is, quite literally, your billing meter.

Key insight: Encoding is greedy replay of an ordered merge list; decoding is concatenation. This asymmetry — hard-ish to encode, trivial to decode — is fine, because encoding happens once on the input while the model generates token-by-token, and each generated token decodes instantly to show the user.

Misconception: “The same word always becomes the same number of tokens everywhere.” It depends entirely on the tokenizer’s training corpus. A word common in the training data is one token; the identical word is several tokens under a tokenizer trained on different text. There is no universal token count — it’s a property of the specific vocabulary.

When encoding a new word with a trained BPE tokenizer, why is the order of merges critical?

Chapter 5: WordPiece & Unigram

BPE isn’t the only way to build a subword vocabulary. Two important alternatives change the criterion for what makes a good merge or piece, and you’ll meet both in real models.

WordPiece (BERT)

WordPiece, used by BERT, looks almost identical to BPE — it builds a vocabulary by merging pieces — but it chooses merges by likelihood, not raw frequency. Instead of merging the most common pair, it merges the pair that most increases the likelihood of the training data under a unigram language model. Concretely, it favors the pair whose merged frequency is high relative to the product of its parts’ frequencies.

Why does that matter? Consider the pair “t h”. Both “t” and “h” are individually super common, so they bump into each other often by chance — BPE might merge them just on raw count. WordPiece asks a sharper question: does seeing “t” and “h” together tell us more than seeing them separately? It merges pairs that are genuinely informative as a unit, not just frequently adjacent. WordPiece also marks word-interior pieces with a “##” prefix, so “playing” becomes “play” + “##ing” — the “##” says “this attaches to the previous piece.”

Unigram (SentencePiece)

Unigram, by Kudo, flips the whole procedure upside down. BPE and WordPiece are bottom-up: start small, merge upward. Unigram is top-down: start with a huge candidate vocabulary (all frequent substrings), then iteratively prune the pieces that hurt the corpus likelihood least, until you hit the target size.

The payoff is that Unigram is probabilistic. Each piece carries a probability, and a word can be segmented many ways — “hello” could be “hel”+“lo” or “he”+“llo” or “h”+“ello”. Unigram picks the segmentation with the highest total probability (via a Viterbi-style search), and it can even sample different segmentations during training — a regularizer called subword sampling that makes models more robust. BPE, by contrast, always produces exactly one segmentation.

Three criteria, compared

Toggle the method and see how each scores a candidate merge or segmentation. BPE counts; WordPiece weighs likelihood gain; Unigram prunes from the top down.

MethodDirectionCriterionUsed by
BPEbottom-up mergemost frequent pairGPT-2/3/4, RoBERTa, LLaMA
WordPiecebottom-up mergemax likelihood gainBERT, DistilBERT
Unigramtop-down prunemin likelihood lossT5, ALBERT, many multilingual
Misconception: “SentencePiece is a fourth algorithm.” No — SentencePiece is a library/framework (next chapter) that implements both BPE and Unigram. People say “SentencePiece” loosely to mean “Unigram via SentencePiece,” but the two ideas are separate: the algorithm (Unigram or BPE) versus the tool that runs it.

How does Unigram tokenization differ fundamentally from BPE?

Chapter 6: Byte-Level BPE

We claimed subwords give zero OOV because they fall back to characters. But there’s a hole in that promise: what if a character isn’t in the vocabulary? Emoji, rare Chinese characters, mathematical symbols, accented letters — there are over 150,000 Unicode code points, and you can’t put them all in a base vocabulary without bloating it. So character-level fallback can still hit OOV on exotic input.

GPT-2 introduced the elegant fix: byte-level BPE. Don’t tokenize characters — tokenize bytes. Every piece of text, in any language, with any emoji, is ultimately a sequence of UTF-8 bytes, and there are exactly 256 possible byte values. Put all 256 in the base vocabulary, and now nothing is ever out-of-vocabulary, ever, in any language or script. The 150,000-character problem collapses to a 256-byte certainty.

Here’s the mechanism. The emoji 🍓 isn’t one character to the tokenizer — in UTF-8 it’s four bytes. A rare Chinese character is three bytes. BPE then runs on this byte stream exactly as before, learning to merge common byte sequences into larger tokens. Common English text still tokenizes efficiently (those bytes are ASCII and merge into familiar word-pieces), while exotic input safely decomposes into its raw bytes.

Everything is bytes

See how characters become UTF-8 bytes. ASCII letters are 1 byte; accented and CJK characters are 2–3; emoji are 4. Byte-level BPE runs on this stream, so the 256 byte values guarantee no OOV.

Why this matters for what models can do

The byte-level trick is why GPT models never crash on weird input — paste any garbage and it tokenizes. But it also explains famous failures. Because a number like “1234567” gets chunked into byte-pair tokens like “123” + “4567” (whatever was frequent in training), the model doesn’t see individual digits and struggles with arithmetic. And “strawberry” might be “str” + “aw” + “berry” — the three r’s are buried inside tokens, so “how many r’s?” is genuinely hard for the model to answer. These aren’t reasoning failures; they’re perception failures caused by tokenization.

Key insight: Byte-level BPE turns the open-ended “which characters do we support?” question into a closed 256-symbol answer. Universal coverage, no OOV, any language — at the small cost that non-English and symbol-heavy text uses more tokens (more bytes per character), which is part of why LLMs are pricier and weaker in low-resource languages.

Misconception: “Byte-level means everything is one byte per token.” Only in the worst case. After BPE merges, common byte sequences become single tokens just like before — “the” is still one token. Byte-level only changes the base alphabet (256 bytes instead of ~100 characters); the merging on top is identical, so efficiency on common text is preserved.

What problem does byte-level BPE solve that character-level BPE does not?

Chapter 7: SentencePiece & Whitespace

One last practical wrinkle, and it’s subtler than it looks: what about spaces? Most early tokenizers pre-split text on whitespace, then tokenized each word. But that throws away information — was there one space or three? A tab? And it’s language-specific: Chinese and Japanese don’t use spaces between words at all, so “split on spaces first” simply doesn’t work.

SentencePiece (Kudo & Richardson, the library, not an algorithm) solves this by treating the raw text as a pure stream of characters — including the spaces — with no pre-tokenization at all. It replaces each space with a visible marker, a special underscore-like symbol (▁), and then runs BPE or Unigram on the whole thing. The space literally becomes part of the tokens.

This has two lovely consequences. First, tokenization is fully reversible: because spaces are encoded as ▁ inside tokens, you can always reconstruct the exact original text, spaces and all, by concatenating tokens and turning ▁ back into spaces. No information is lost. Second, it’s language-agnostic: it makes no assumption that words are space-separated, so it works identically for English, Chinese, code, or anything else.

You’ll notice this in tokenizer outputs: the word “hello” at the start of a sentence and “ hello” mid-sentence are different tokens — one has the leading ▁ and one doesn’t. The leading space is part of the token. This is why, in GPT tokenizers, “hello” and “ hello” have different IDs, a detail that trips up everyone writing prompts the first time.

Spaces become tokens (the ▁ marker)

Watch how SentencePiece attaches the space marker to the following word. “the cat” tokenizes with ▁ glued to each word-start, making the split fully reversible.

Special tokens

Beyond word-pieces, every real vocabulary reserves a handful of special tokens that carry structural meaning: a beginning-of-sequence token, an end-of-sequence token (so the model can signal “I’m done generating”), a padding token to fill batches to equal length, and an unknown token as a last resort. Modern chat models add many more — tokens that mark the start of a user turn, an assistant turn, a system prompt, or tool calls. These aren’t learned by BPE; they’re inserted by hand and given dedicated IDs, and the model learns their meaning during training.

Why reversibility matters: If tokenization weren’t losslessly reversible, a model couldn’t reproduce exact whitespace, indentation, or formatting — fatal for generating code, where a misplaced space breaks everything. SentencePiece’s ▁ scheme guarantees the bytes you put in are exactly the bytes you can get back out.

Misconception: “Leading spaces are stripped before tokenizing.” The opposite — the leading space is preserved and bound to the next token. This is why prompt tricks like ending your prompt with a trailing space can subtly change the model’s next-token distribution: you’ve changed which token it’s likely to produce.

Why does SentencePiece replace spaces with a visible ▁ marker instead of splitting on whitespace?

Chapter 8: Live Tokenizer

Time to put it all together. Below is a real BPE tokenizer: it trains on a small corpus with a vocabulary budget you control, then tokenizes whatever you type. Watch your text shatter into pieces, count the tokens, and see how the vocabulary size changes everything.

Type and tokenize

Adjust the merge budget (vocabulary size), type any text, and watch BPE tokenize it live. More merges → bigger pieces → fewer tokens. Try rare words and gibberish to see the character fallback.

merge budget30

Things to try, and what they reveal:

Set the merge budget to zero. Every character is its own token — pure character-level. Your text is as long as it can be. This is BPE before any training.

Slide the budget up. Watch common pieces fuse: first frequent letter-pairs, then whole common words from the corpus collapse into single tokens. The token count drops. This is the sequence-length payoff in real time.

Type a word that’s not in the corpus — say “zyxwv”. It can’t use any learned merges, so it falls back to characters. That’s the OOV guarantee working: unfamiliar text is still tokenizable, just less efficiently. Familiar text is cheap; unfamiliar text is expensive — exactly the economics we discussed.

The whole lesson in one widget: the merge budget is the vocabulary/sequence-length trade-off from Chapter 0. The fusing of pieces is the BPE training of Chapter 3. The character fallback on unknown words is the OOV guarantee of Chapter 2. Tokenization is just this — learned shorthand, applied greedily, with characters as the safety net.

No quiz here — the tokenizer is the test. If you can predict how the token count changes as you move the budget and type rare versus common words, you understand BPE.

Chapter 9: Cheat Sheet & Connections

The whole pipeline

Raw text
“unbelievable”
Bytes
UTF-8 byte stream (256-symbol base, no OOV)
Merge replay
apply learned merges in order
Tokens
“un”,“believ”,“able”
IDs
[un→291, believ→8420, able→712]

Algorithm comparison

AlgorithmHow it builds vocabSegmentationModels
BPEgreedily merge most frequent pairsingle, deterministicGPT-2/3/4, LLaMA, RoBERTa
Byte-level BPEBPE on UTF-8 bytes (256 base)single; universal coverageGPT-2+, most modern LLMs
WordPiecemerge by likelihood gain; ## prefixsingle, greedy longest-matchBERT, DistilBERT
Unigramprune from large vocab by likelihoodprobabilistic, multipleT5, ALBERT, multilingual

The numbers worth memorizing

QuantityTypical value
Vocabulary size30k (BERT) · 50k (GPT-2) · 100k–130k (GPT-4, LLaMA-3)
Bytes per token (English)~4 characters · ~0.75 words
Base alphabet (byte-level)exactly 256
Cost of non-English / symbolsoften 2–4× more tokens per word

The three things to remember

1. Subwords thread the needle. Whole tokens for common words, pieces for rare ones, characters/bytes as the OOV-proof safety net. Short sequences, bounded vocabulary, no failures.

2. BPE is greedy frequency merging. Train by merging the most frequent pair repeatedly; encode by replaying merges in order; decode by concatenating. Byte-level makes it universal.

3. Tokenization is perception. It decides what the model can see. Buried digits and letters explain arithmetic and spelling failures; per-token billing and context limits make it your cost meter.

Where to go next

  • Embedding Layers — what happens to a token ID next: it indexes a learned vector. Tokenization’s output is the embedding table’s input.
  • The Transformer — the model that consumes these tokens; why sequence length (set by tokenization) drives its cost.
  • Sampling & Decoding — the reverse direction: how the model picks the next token ID to emit.
  • Vector Embeddings — the semantic vectors that tokens map into.
Closing thought: Tokenization is the unglamorous seam between human text and machine math — and like most seams, it’s invisible until it tears. Every “why can’t the model count letters” story starts here. Master it, and a whole class of LLM mysteries stops being mysterious.
Why does a GPT model struggle to count the number of r’s in “strawberry”?