Before a language model sees a single number, your text must become tokens. Too coarse and you can’t spell rare words; too fine and sequences explode. Byte-pair encoding threads the needle — and it’s why “strawberry” has three r’s the model can’t count.
A language model is, at its core, a machine that maps a list of integers to a probability distribution over the next integer. It never sees text. So before anything else, we face a deceptively hard question: how do we turn a string like “unbelievable” into integers?
Your first instinct is a dictionary: assign every word an ID. “cat” is 5, “dog” is 6, “unbelievable” is 90,210. Simple — until someone types “unbelievableee” or “covfefe” or a brand-new word coined yesterday. It’s not in your dictionary. This is the out-of-vocabulary problem, or OOV, and it is fatal: the model literally has no number for that word.
Your second instinct fixes OOV: use characters. There are only about a hundred of them, every word is spellable, nothing is ever out-of-vocabulary. But now “unbelievable” is twelve tokens instead of one. A paragraph becomes thousands of tokens, and since a transformer’s cost grows with the square of sequence length, you’ve made the model wildly expensive and forced it to learn spelling from scratch before it can even learn meaning.
So we’re trapped between two bad extremes. Words: short sequences, but a giant vocabulary and OOV disaster. Characters: tiny vocabulary, no OOV, but punishingly long sequences. Drag the slider below between these poles and watch both costs move in opposite directions — there’s no setting that wins on both.
Slide from character-level (left) to word-level (right). Watch the sequence length drop while the vocabulary size explodes. Subword tokenization (middle) is the sweet spot.
The resolution is subword tokenization: a vocabulary of word-pieces. Common words stay whole (“the”, “cat” are single tokens), while rare words break into reusable chunks (“unbelievable” might become “un” + “believ” + “able”). You get short sequences for ordinary text, a bounded vocabulary, and — crucially — zero OOV, because in the worst case any word falls back to its individual characters. The dominant algorithm for building this vocabulary is byte-pair encoding, and that’s where we’re headed.
Let’s make the trade-off quantitative, because the numbers are what drive every design decision. Take one sentence: “The unbelievable cat.”
Word-level. Four tokens: “The”, “unbelievable”, “cat”, “.”. Beautifully short. But to cover real English you need a vocabulary of hundreds of thousands of words — and that’s before plurals, tenses, typos, names, code, and other languages. Worse, the model treats “cat” and “cats” as completely unrelated IDs, learning nothing shared between them.
Character-level. The same sentence is twenty-one tokens (counting spaces and the period). The vocabulary is tiny — maybe 100 symbols — and OOV is impossible. But the model must now learn that “c-a-t” in sequence means a furry animal, reconstructing every word from letters every single time. And those twenty-one tokens, versus four, mean roughly five times the compute and memory for the same sentence.
A transformer’s self-attention compares every token to every other token. For a sequence of length n, that’s n-squared comparisons. Go from 4 tokens to 21 and attention cost rises from 16 units to 441 — a 27-fold jump for the same sentence. This is why sequence length is not a minor efficiency knob; it’s the dominant cost, and shortening sequences is the whole economic point of using bigger tokens.
Toggle the scheme and watch how “The unbelievable cat.” gets chopped. Note the token count and the implied attention cost (count squared).
The subword scheme, shown in the middle option, gives five tokens here: “The”, “un”, “believable” (or further split), “cat”, “.”. Almost as short as words, but the rare prefix “un” is now a reusable piece that also appears in “undo”, “unhappy”, and thousands of other words. That sharing is the magic: the model learns “un” once and applies it everywhere.
Here’s the principle that makes subwords work, and it comes from a simple observation about language: word frequency is extraordinarily lopsided. A few hundred words (“the”, “of”, “and”) make up most of all text, while millions of rare words each appear almost never. This is Zipf’s law, and it’s the lever subword tokenization pulls.
The strategy writes itself: spend whole tokens on frequent words, and spend pieces on rare ones. “the” deserves its own token because you’ll use it constantly. “antidisestablishmentarianism” does not — it can be assembled from common pieces like “anti”, “dis”, “establish”, “ment”. The vocabulary becomes a set of building blocks of mixed sizes, sized to frequency.
And notice what this buys us for the OOV problem. Suppose the model has never seen “tokenizational”. Word-level chokes. But subword can fall back: maybe it splits as “token” + “ization” + “al”, all known pieces. In the absolute worst case — a string of pure gibberish — it falls all the way back to single characters, which are always in the vocabulary. OOV is mathematically impossible as long as every character is a token. That guarantee is why every modern LLM uses subwords.
Frequent words (left, tall bars) earn whole tokens. Rare words (right) get split into pieces. Hover the idea: the vocabulary spends its budget where it’s used most.
But this raises the real question, the one the rest of the lesson answers: which pieces should be in the vocabulary? “un” and “ization” are obviously useful, but who decides? We can’t hand-write a list of good subwords for every language. We need an algorithm that learns the best pieces from a corpus — one that looks at real text and discovers that “ing”, “tion”, and “the” are worth their own tokens. That algorithm is byte-pair encoding.
BPE was originally a 1994 data-compression trick, repurposed for tokenization by Sennrich and colleagues in 2016. The idea is almost embarrassingly simple, and you can run it in your head: repeatedly find the most frequent adjacent pair of symbols and merge it into a new symbol. Do that a few thousand times, and the merges you accumulate are your subword vocabulary.
Let’s train it by hand on a tiny corpus. Suppose our entire training text, with word frequencies, is:
Step 0 — start from characters. Every word is split into characters, with a special end-of-word marker (we’ll write it as a bullet) so the algorithm knows where words end:
l o w • (×5) l o w e r • (×2) n e w e s t • (×6) w i d e s t • (×3)
Step 1 — count all adjacent pairs. Across the whole corpus, weighted by word frequency. The pair “e s” appears in “newest” (6) and “widest” (3), so its count is 9. The pair “s t” is likewise 9. The pair “l o” appears in “low” (5) and “lower” (2), count 7. We pick the most frequent — say “e s” at 9 — and merge it.
Step 2 — apply the merge. Every “e s” becomes the single symbol “es”:
n e w es t • (×6) w i d es t • (×3)
Step 3 — repeat. Now “es t” is the most frequent pair at 9, so we merge it into “est”. Then “est •” merges into “est•” (a whole word-ending). Each merge is recorded in order. After a handful of merges our vocabulary contains characters plus learned pieces: “es”, “est”, “est•”, “lo”, “low”, and so on. Step through it live:
Click to perform the next merge on our toy corpus. The most frequent adjacent pair is highlighted and fused; the merge list grows below. Watch “newest” and “widest” converge on the shared piece “est”.
python from collections import Counter def get_pairs(words): # words: {tuple_of_symbols: freq} pairs = Counter() for sym, freq in words.items(): for i in range(len(sym)-1): pairs[(sym[i], sym[i+1])] += freq # weight by word frequency return pairs def merge(words, pair): out = {} for sym, freq in words.items(): new, i = [], 0 while i < len(sym): if i < len(sym)-1 and (sym[i],sym[i+1])==pair: new.append(sym[i]+sym[i+1]); i += 2 # fuse the pair else: new.append(sym[i]); i += 1 out[tuple(new)] = freq return out words = {tuple(w)+("•",): f for w,f in [("low",5),("lower",2),("newest",6),("widest",3)]} merges = [] for _ in range(10): # vocab budget = 10 merges pairs = get_pairs(words) if not pairs: break best = max(pairs, key=pairs.get) # most frequent adjacent pair words = merge(words, best); merges.append(best) # merges = [('e','s'), ('es','t'), ('l','o'), ('lo','w'), ...] in order
That’s the entire training algorithm — under twenty lines. The output is just the ordered merge list. The number of merges you allow is your one hyperparameter: it directly sets the vocabulary size. More merges means bigger pieces and shorter sequences, up to your budget.
Training gives us an ordered merge list. Now: how do we tokenize a brand-new word the model has never seen? We replay the merges, in the exact order they were learned.
Take the new word “newer”. Start from characters: n e w e r •. Now walk down the merge list and apply each merge wherever it fits, in order. Suppose our learned merges (in order) were: 1) e+s, 2) es+t, 3) l+o, 4) lo+w, 5) n+e, 6) ne+w, 7) w+•... Applying to “newer”: merge 5 (n+e) gives ne w e r •; merge 6 (ne+w) gives new e r •. No other merges apply (there’s no “e r” merge), so “newer” tokenizes as “new”, “e”, “r”, • — four tokens, where the frequent prefix “new” collapsed to one piece and the rare ending stayed as characters.
The order matters enormously. Merges learned earlier are applied earlier, so they take priority — this is what makes encoding deterministic and consistent with training. Watch it run on words you pick:
Pick a word; watch BPE start from characters and apply the learned merges in order until none fit. The final pieces are the tokens.
Going from tokens back to text is just concatenation. The token IDs map back to their string pieces, you glue them together, and convert the end-of-word markers back into spaces. There’s no ambiguity and no computation — decoding is the easy direction. (This is why a model’s output, a list of token IDs, becomes readable text instantly.)
Here’s a practical consequence you pay for in dollars. API pricing is per-token, and context windows are measured in tokens. Common English words are usually one token, so a rule of thumb is ~0.75 words per token, or ~4 characters per token. But unusual text is far more expensive: a long hexadecimal hash, a rare chemical name, or text in an under-represented language can be one token per character, multiplying your cost and eating your context window. The tokenizer is, quite literally, your billing meter.
BPE isn’t the only way to build a subword vocabulary. Two important alternatives change the criterion for what makes a good merge or piece, and you’ll meet both in real models.
WordPiece, used by BERT, looks almost identical to BPE — it builds a vocabulary by merging pieces — but it chooses merges by likelihood, not raw frequency. Instead of merging the most common pair, it merges the pair that most increases the likelihood of the training data under a unigram language model. Concretely, it favors the pair whose merged frequency is high relative to the product of its parts’ frequencies.
Why does that matter? Consider the pair “t h”. Both “t” and “h” are individually super common, so they bump into each other often by chance — BPE might merge them just on raw count. WordPiece asks a sharper question: does seeing “t” and “h” together tell us more than seeing them separately? It merges pairs that are genuinely informative as a unit, not just frequently adjacent. WordPiece also marks word-interior pieces with a “##” prefix, so “playing” becomes “play” + “##ing” — the “##” says “this attaches to the previous piece.”
Unigram, by Kudo, flips the whole procedure upside down. BPE and WordPiece are bottom-up: start small, merge upward. Unigram is top-down: start with a huge candidate vocabulary (all frequent substrings), then iteratively prune the pieces that hurt the corpus likelihood least, until you hit the target size.
The payoff is that Unigram is probabilistic. Each piece carries a probability, and a word can be segmented many ways — “hello” could be “hel”+“lo” or “he”+“llo” or “h”+“ello”. Unigram picks the segmentation with the highest total probability (via a Viterbi-style search), and it can even sample different segmentations during training — a regularizer called subword sampling that makes models more robust. BPE, by contrast, always produces exactly one segmentation.
Toggle the method and see how each scores a candidate merge or segmentation. BPE counts; WordPiece weighs likelihood gain; Unigram prunes from the top down.
| Method | Direction | Criterion | Used by |
|---|---|---|---|
| BPE | bottom-up merge | most frequent pair | GPT-2/3/4, RoBERTa, LLaMA |
| WordPiece | bottom-up merge | max likelihood gain | BERT, DistilBERT |
| Unigram | top-down prune | min likelihood loss | T5, ALBERT, many multilingual |
We claimed subwords give zero OOV because they fall back to characters. But there’s a hole in that promise: what if a character isn’t in the vocabulary? Emoji, rare Chinese characters, mathematical symbols, accented letters — there are over 150,000 Unicode code points, and you can’t put them all in a base vocabulary without bloating it. So character-level fallback can still hit OOV on exotic input.
GPT-2 introduced the elegant fix: byte-level BPE. Don’t tokenize characters — tokenize bytes. Every piece of text, in any language, with any emoji, is ultimately a sequence of UTF-8 bytes, and there are exactly 256 possible byte values. Put all 256 in the base vocabulary, and now nothing is ever out-of-vocabulary, ever, in any language or script. The 150,000-character problem collapses to a 256-byte certainty.
Here’s the mechanism. The emoji 🍓 isn’t one character to the tokenizer — in UTF-8 it’s four bytes. A rare Chinese character is three bytes. BPE then runs on this byte stream exactly as before, learning to merge common byte sequences into larger tokens. Common English text still tokenizes efficiently (those bytes are ASCII and merge into familiar word-pieces), while exotic input safely decomposes into its raw bytes.
See how characters become UTF-8 bytes. ASCII letters are 1 byte; accented and CJK characters are 2–3; emoji are 4. Byte-level BPE runs on this stream, so the 256 byte values guarantee no OOV.
The byte-level trick is why GPT models never crash on weird input — paste any garbage and it tokenizes. But it also explains famous failures. Because a number like “1234567” gets chunked into byte-pair tokens like “123” + “4567” (whatever was frequent in training), the model doesn’t see individual digits and struggles with arithmetic. And “strawberry” might be “str” + “aw” + “berry” — the three r’s are buried inside tokens, so “how many r’s?” is genuinely hard for the model to answer. These aren’t reasoning failures; they’re perception failures caused by tokenization.
One last practical wrinkle, and it’s subtler than it looks: what about spaces? Most early tokenizers pre-split text on whitespace, then tokenized each word. But that throws away information — was there one space or three? A tab? And it’s language-specific: Chinese and Japanese don’t use spaces between words at all, so “split on spaces first” simply doesn’t work.
SentencePiece (Kudo & Richardson, the library, not an algorithm) solves this by treating the raw text as a pure stream of characters — including the spaces — with no pre-tokenization at all. It replaces each space with a visible marker, a special underscore-like symbol (▁), and then runs BPE or Unigram on the whole thing. The space literally becomes part of the tokens.
This has two lovely consequences. First, tokenization is fully reversible: because spaces are encoded as ▁ inside tokens, you can always reconstruct the exact original text, spaces and all, by concatenating tokens and turning ▁ back into spaces. No information is lost. Second, it’s language-agnostic: it makes no assumption that words are space-separated, so it works identically for English, Chinese, code, or anything else.
You’ll notice this in tokenizer outputs: the word “hello” at the start of a sentence and “ hello” mid-sentence are different tokens — one has the leading ▁ and one doesn’t. The leading space is part of the token. This is why, in GPT tokenizers, “hello” and “ hello” have different IDs, a detail that trips up everyone writing prompts the first time.
Watch how SentencePiece attaches the space marker to the following word. “the cat” tokenizes with ▁ glued to each word-start, making the split fully reversible.
Beyond word-pieces, every real vocabulary reserves a handful of special tokens that carry structural meaning: a beginning-of-sequence token, an end-of-sequence token (so the model can signal “I’m done generating”), a padding token to fill batches to equal length, and an unknown token as a last resort. Modern chat models add many more — tokens that mark the start of a user turn, an assistant turn, a system prompt, or tool calls. These aren’t learned by BPE; they’re inserted by hand and given dedicated IDs, and the model learns their meaning during training.
Time to put it all together. Below is a real BPE tokenizer: it trains on a small corpus with a vocabulary budget you control, then tokenizes whatever you type. Watch your text shatter into pieces, count the tokens, and see how the vocabulary size changes everything.
Adjust the merge budget (vocabulary size), type any text, and watch BPE tokenize it live. More merges → bigger pieces → fewer tokens. Try rare words and gibberish to see the character fallback.
Things to try, and what they reveal:
Set the merge budget to zero. Every character is its own token — pure character-level. Your text is as long as it can be. This is BPE before any training.
Slide the budget up. Watch common pieces fuse: first frequent letter-pairs, then whole common words from the corpus collapse into single tokens. The token count drops. This is the sequence-length payoff in real time.
Type a word that’s not in the corpus — say “zyxwv”. It can’t use any learned merges, so it falls back to characters. That’s the OOV guarantee working: unfamiliar text is still tokenizable, just less efficiently. Familiar text is cheap; unfamiliar text is expensive — exactly the economics we discussed.
No quiz here — the tokenizer is the test. If you can predict how the token count changes as you move the budget and type rare versus common words, you understand BPE.
| Algorithm | How it builds vocab | Segmentation | Models |
|---|---|---|---|
| BPE | greedily merge most frequent pair | single, deterministic | GPT-2/3/4, LLaMA, RoBERTa |
| Byte-level BPE | BPE on UTF-8 bytes (256 base) | single; universal coverage | GPT-2+, most modern LLMs |
| WordPiece | merge by likelihood gain; ## prefix | single, greedy longest-match | BERT, DistilBERT |
| Unigram | prune from large vocab by likelihood | probabilistic, multiple | T5, ALBERT, multilingual |
| Quantity | Typical value |
|---|---|
| Vocabulary size | 30k (BERT) · 50k (GPT-2) · 100k–130k (GPT-4, LLaMA-3) |
| Bytes per token (English) | ~4 characters · ~0.75 words |
| Base alphabet (byte-level) | exactly 256 |
| Cost of non-English / symbols | often 2–4× more tokens per word |
1. Subwords thread the needle. Whole tokens for common words, pieces for rare ones, characters/bytes as the OOV-proof safety net. Short sequences, bounded vocabulary, no failures.
2. BPE is greedy frequency merging. Train by merging the most frequent pair repeatedly; encode by replaying merges in order; decode by concatenating. Byte-level makes it universal.
3. Tokenization is perception. It decides what the model can see. Buried digits and letters explain arithmetic and spelling failures; per-token billing and context limits make it your cost meter.