Neural Machine Translation of Rare Words with Subword Units — iteratively merge the most frequent character pairs to build a vocabulary that handles any word, even ones never seen in training.
You're building a neural machine translation system to translate English to German. You train it on millions of sentence pairs. The model learns "house" → "Haus", "cat" → "Katze", and thousands of other words. Then a user inputs: "Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz."
Your model has never seen this word. In German, this 63-letter compound word means "law for the delegation of monitoring beef labelling." It appeared zero times in your training data. Your model outputs: <UNK>. The translation fails.
This is the out-of-vocabulary (OOV) problem. Neural machine translation (NMT) models work with a fixed vocabulary — typically the 30,000 to 50,000 most common words in the training data. Every word outside that set is mapped to a single unknown token. This causes three cascading failures:
| Failure | Why It Happens | Example |
|---|---|---|
| Rare compounds | Agglutinative languages (German, Turkish, Finnish) create new words by combining existing ones | "Handschuhschneeballwerfer" = glove + snowball + thrower |
| Named entities | Names are essentially infinite — you can't put every person, city, or product in a vocabulary | "Ouagadougou", "Schwarzenegger", "AstraZeneca" |
| Morphological variants | Languages with rich morphology have many word forms per lemma | English "run/runs/running/ran" — Turkish has ~40 forms of each verb |
Before BPE, the standard "fix" was a back-off dictionary: when the model output <UNK>, you'd look up the aligned source word in a bilingual dictionary and paste the translation in. This was a brittle hack — it only worked for 1-to-1 word correspondences and couldn't handle morphological changes.
Think of it like LEGO bricks. Instead of having a separate mold for every possible shape (word), you have a set of reusable bricks (subwords) that can snap together to form any shape. You need enough distinct bricks to be efficient, but few enough that your model can learn them all.
Type a word to see how word-level tokenization handles it vs. BPE subword tokenization. Long or rare words become <UNK> with word-level but are decomposed with BPE.
Byte Pair Encoding was invented by Philip Gage in 1994 — not for NLP, but for data compression. The idea is beautifully simple: find the most frequent pair of bytes in your data, replace every occurrence with a single new byte, and repeat. Each replacement shrinks the data.
Here's the original compression algorithm on a tiny example. Start with the string "aaabdaaabac":
Each step finds the most frequent adjacent pair and merges it into a new symbol. The sequence of merge rules — (a,a)→Z, (Z,a)→Y, (Y,b)→X — is the compression dictionary. To decompress, apply the rules in reverse.
Sennrich et al. adapted this compression algorithm for NLP with two changes:
| Original BPE (compression) | Sennrich BPE (NLP) |
|---|---|
| Operates on raw bytes | Operates on characters within words |
| Goal: minimize data size | Goal: build a fixed-size vocabulary of subword units |
| Number of merges = until no more compression | Number of merges = hyperparameter (e.g., 32,000) |
| Merges happen across word boundaries | Merges only happen within word boundaries (preserves word structure) |
The word boundary constraint is critical. Without it, BPE would merge characters across words, producing subword units like "e_t" (end of "the" + start of "the") that span words. By treating each word independently, every merge builds a meaningful subword: "un", "break", "able", "ing".
python # The vocabulary size spectrum # Character-level: ~100 symbols, but sequences are very long # BPE subwords: ~32,000 symbols, balanced sequence length # Word-level: ~50,000+ symbols, but many OOV words sentence = "unbreakable" # Character: ['u','n','b','r','e','a','k','a','b','l','e'] — 11 tokens # BPE: ['un', 'break', 'able'] — 3 tokens # Word: ['unbreakable'] or ['<UNK>'] — 1 token (if known)
Drag the slider to change vocabulary size. Watch how the same sentence is tokenized at different granularities — from individual characters (left) to full words (right). BPE lives in the sweet spot.
Let's walk through BPE step by step on a real vocabulary. This is the training phase — where we learn which character pairs to merge. The input is a corpus of text; the output is an ordered list of merge rules.
Start by splitting every word in the training corpus into individual characters, with a special end-of-word symbol (we'll use ·). Count word frequencies. Suppose our corpus contains these words with their counts:
training corpus
low: 5 → l o w ·
lower: 2 → l o w e r ·
newest: 6 → n e w e s t ·
widest: 3 → w i d e s t ·
The initial vocabulary is just the set of characters: {l, o, w, e, r, n, s, t, i, d, ·}. Eleven symbols.
Count every adjacent character pair across the corpus, weighted by word frequency:
pair frequencies
(e, s): 6 (newest) + 3 (widest) = 9
(s, t): 6 (newest) + 3 (widest) = 9
(t, ·): 6 (newest) + 3 (widest) = 9
(l, o): 5 (low) + 2 (lower) = 7
(o, w): 5 (low) + 2 (lower) = 7
(n, e): 6
(e, w): 6
(w, e): 6
(w, ·): 5
...
The most frequent pair is (e, s) with count 9. Merge it into a new symbol "es":
after merge #1: (e, s) → es
low: 5 → l o w ·
lower: 2 → l o w e r ·
newest: 6 → n e w es t ·
widest: 3 → w i d es t ·
Vocabulary: {l, o, w, e, r, n, s, t, i, d, ·, es} — 12 symbols
Recount pairs. Now (es, t) has frequency 9. Merge it into "est":
after merge #2: (es, t) → est
low: 5 → l o w ·
lower: 2 → l o w e r ·
newest: 6 → n e w est ·
widest: 3 → w i d est ·
Vocabulary: {l, o, w, e, r, n, s, t, i, d, ·, es, est} — 13 symbols
Continue: (est, ·) freq 9 → merge into "est·". Then (l, o) freq 7 → merge into "lo". Then (lo, w) freq 7 → merge into "low". And so on, until we've done the desired number of merges.
Watch BPE learn merge rules step by step. Click "Merge" to perform the next most frequent pair merge. The vocabulary grows and the representation gets more compact.
python import collections def learn_bpe(corpus, num_merges): # Step 1: Initialize — split words into characters vocab = {} for word, freq in corpus.items(): # Add end-of-word marker, split into chars symbols = list(word) + ['</w>'] vocab[tuple(symbols)] = freq merges = [] for i in range(num_merges): # Step 2: Count all adjacent pairs pairs = collections.Counter() for symbols, freq in vocab.items(): for j in range(len(symbols) - 1): pairs[(symbols[j], symbols[j+1])] += freq if not pairs: break # Step 3: Find most frequent pair best = max(pairs, key=pairs.get) merges.append(best) # Step 4: Apply merge to vocabulary new_vocab = {} for symbols, freq in vocab.items(): new_symbols = [] j = 0 while j < len(symbols): if j < len(symbols)-1 and (symbols[j], symbols[j+1]) == best: new_symbols.append(symbols[j] + symbols[j+1]) j += 2 else: new_symbols.append(symbols[j]) j += 1 new_vocab[tuple(new_symbols)] = freq vocab = new_vocab return merges, vocab
The number of merge operations is the one hyperparameter that controls BPE. It determines your vocabulary size and the granularity of your subword units. Sennrich et al. explored merge counts from 10,000 to 90,000. The choice creates a direct tradeoff.
Few merges (small vocabulary, ~10K): Words are split into many small pieces. "unfortunately" becomes ["un", "for", "tun", "ate", "ly"]. Sequences are longer, but every piece is frequent and well-learned. Good for morphologically rich languages where you need to decompose aggressively.
Many merges (large vocabulary, ~90K): Common words stay intact. "unfortunately" stays as one token. Sequences are shorter, but rare subwords may not have enough training examples. Approaches word-level behavior with the same OOV risks.
For machine translation, you have two languages. Should you learn BPE separately for each, or jointly on the combined corpus?
| Strategy | How | Pros | Cons |
|---|---|---|---|
| Separate BPE | Learn merge rules on source and target independently | Each vocabulary is optimized for its language | Shared words (names, cognates) get different segmentations |
| Joint BPE | Concatenate source + target text, learn one set of merge rules | Shared subwords across languages; "Obama" is the same token everywhere | Dominant language may take more vocabulary slots |
Sennrich et al. found that joint BPE is better for related language pairs (English-German, English-French) because they share many cognates and named entities. For distant language pairs (English-Chinese), separate vocabularies can be better since they share very few subword units.
BPE uses a special end-of-word symbol (written as · or </w>) appended to the last character of each word before training. This is crucial for reversibility. Without it, you can't tell where words end when reconstructing text from subword tokens.
python # Without end-of-word marker: ambiguous! # "low" → ["lo", "w"] # "lower" → ["lo", "w", "er"] # How do you know "w" ends "low" vs continues to "er"? # With end-of-word marker: unambiguous # "low" → ["lo", "w·"] (· marks word boundary) # "lower" → ["lo", "w", "er·"] # "w·" is a DIFFERENT token from "w" — the · tells you the word ends here # Reconstruction: join tokens, replace · with space tokens = ["lo", "w·", "lo", "w", "er·"] # → "low lower"
Drag the slider to change the number of BPE merges. Watch how the same sentence is segmented at different vocabulary sizes — from character-level (few merges) to near word-level (many merges).
Training gives us an ordered list of merge rules. Now we need to apply those rules to new text at inference time. This is the encoding phase — turning an arbitrary string into a sequence of subword tokens.
Given a word and the learned merge rules:
The order matters. You apply merge rules in the exact sequence they were learned during training. Early merges capture frequent patterns (suffixes, prefixes). Later merges build up to common words.
This is BPE's superpower. Consider "lowest" — even if this exact word never appeared in training, its subword pieces "low" and "est" probably did. Now consider a word from a language you've never trained on, like the Czech word "nejneobvyklejší" (most unusual). BPE handles it by falling back to characters:
python # If no merge rules apply, characters remain as individual tokens # "nejneobvyklejší" might tokenize as: # ['ne', 'j', 'ne', 'ob', 'vy', 'k', 'le', 'j', 'ší', '·'] # Some pieces got merged (ne, ob, le) because they're common # Others stayed as characters (j, k) because they're rare in context # The key: NOTHING is ever truly OOV # Worst case: every character is its own token # This is equivalent to character-level processing # for that specific word, while common words stay efficient
python def apply_bpe(word, merges): """Encode a word using learned BPE merge rules.""" # Start with characters symbols = list(word) + ['</w>'] # Apply each merge rule in order for (left, right) in merges: i = 0 while i < len(symbols) - 1: if symbols[i] == left and symbols[i+1] == right: symbols[i:i+2] = [left + right] # Don't increment i — check if new symbol merges with next else: i += 1 return symbols # Example usage merges = [('e','s'), ('es','t'), ('est','</w>'), ('l','o'), ('lo','w')] print(apply_bpe("lowest", merges)) # ['low', 'est</w>'] — 2 tokens instead of 6 characters print(apply_bpe("newer", merges)) # ['n', 'e', 'w', 'e', 'r', '</w>'] — falls back to chars where no merges apply
Click "Step" to apply one merge rule at a time to the word "unbreakable". Watch characters progressively merge into subwords following the learned merge order.
Now let's see how Sennrich et al. actually integrated BPE into a neural machine translation pipeline. The NMT system they used was an attention-based encoder-decoder model — the standard architecture in 2016 (before Transformers).
The @@ notation is a practical detail: during BPE segmentation, every subword except the last in a word gets @@ appended. To reconstruct the original word, you just remove @@ and concatenate.
BPE solves three concrete translation problems:
Compound words: German "Abwasserbehandlungsanlage" (wastewater treatment plant) is split into "Ab@@ wasser@@ behandlungs@@ anlage". The model can translate each morpheme because it has seen "wasser" (water) and "Anlage" (facility) in many contexts.
Named entities: "Schwarzenegger" might split into "Sch@@ war@@ zen@@ egger". Even if the model hasn't seen this name, it has learned to copy subword pieces from source to target — names typically stay the same across languages.
Morphological agreement: Russian adjectives agree with nouns in gender, number, and case — producing dozens of forms. BPE learns the common stems and suffixes separately, allowing the model to generalize.
python # BPE in a translation pipeline from subword_nmt import learn_bpe, apply_bpe # 1. Learn BPE on training data learn_bpe( input="train.en-de", # concatenated EN + DE training text output="bpe.codes", # merge rules file num_operations=32000 # vocabulary size ) # 2. Apply to training data bpe = apply_bpe.BPE(codes=open("bpe.codes")) segmented = bpe.process_line("The unbreakable code was discovered") # → "The un@@ break@@ able code was dis@@ cover@@ ed" # 3. Train NMT model on segmented data (standard encoder-decoder) # Model vocabulary: ~32K BPE tokens instead of ~500K words # 4. At inference: segment input, translate, de-segment output src = bpe.process_line("Schwarzenegger attended the ceremony") tgt_bpe = model.translate(src) # "Sch@@ war@@ zen@@ egger besuchte die Zere@@ monie" tgt = tgt_bpe.replace("@@ ", "") # "Schwarzenegger besuchte die Zeremonie"
Watch how BPE handles a translation pipeline. Click "Step" to advance through segmentation, encoding, decoding, and de-segmentation. See how rare words get decomposed for translation and reassembled after.
Sennrich et al. tested BPE on four language pairs from the WMT 2015 translation task. The results demonstrated that BPE consistently improves over word-level baselines, especially for morphologically rich target languages.
| System | EN→DE | EN→RU | DE→EN | RU→EN |
|---|---|---|---|---|
| Word-level (50K vocab) | 20.5 | 22.1 | 24.0 | 24.0 |
| Word-level + back-off dict | 22.8 | 23.5 | 26.3 | 25.1 |
| BPE (joint, 60K) | 23.7 | 24.2 | 27.1 | 25.8 |
| BPE (joint, 90K) | 24.0 | 23.7 | 27.6 | 25.9 |
Key findings from the results:
BPE beats the dictionary back-off. The word-level + back-off approach was the previous best method for handling OOV words. BPE surpasses it on all language pairs except one, without needing any external dictionary.
Compound-heavy languages benefit most. EN→DE shows the largest improvement (+1.2 BLEU over back-off) because German has many compound words that BPE decomposes effectively. EN→RU also benefits due to Russian's rich morphology.
The optimal merge count varies. 60K merges works best for EN→RU (more decomposition helps with Russian morphology), while 90K merges works best for EN→DE and DE→EN. This aligns with the tradeoff: more morphologically complex languages benefit from smaller subword units.
The paper includes a qualitative analysis showing that BPE discovers linguistically meaningful subword units without any linguistic knowledge:
BPE-discovered morphemes # Prefixes: un- , re- , pre- , über- , aus- # Suffixes: -ing , -tion , -ness , -ung , -lich , -keit # Stems: break , work , hand , schuh (shoe) # German compound decomposition: # Handschuhschneeballwerfer → Hand@@ schuh@@ schnee@@ ball@@ werfer # (glove + snow + ball + thrower) # Each piece is a real German morpheme!
The paper shows specific examples where BPE enables correct translation of rare words:
| Source | Word-level output | BPE output |
|---|---|---|
| "bushy-haired" | <UNK> | "struppig" (correct German) |
| "mistreatment" | <UNK> | "Misshandlung" (correct) |
| "Charlemagne" | <UNK> | "Karl der Große" (correct) |
The "Charlemagne" example is particularly interesting. BPE didn't just copy the name — it translated it to the German equivalent "Karl der Große" (Charles the Great). This happened because BPE segmented "Charlemagne" into subwords that overlapped with training examples containing this historical figure.
Click a language pair to see how BPE compares to word-level baselines. The improvement is largest for morphologically complex target languages.
Sennrich et al.'s 2016 paper had an impact far beyond machine translation. BPE became the de facto tokenization method for large language models, and its direct descendants power every major model today.
| Method | Year | Key Difference from BPE | Used By |
|---|---|---|---|
| BPE (this paper) | 2016 | Original frequency-based merges on characters | Early NMT systems, subword-nmt |
| WordPiece | 2016 | Merge maximizes likelihood instead of frequency | BERT, DistilBERT |
| Byte-level BPE | 2019 | Operates on bytes (UTF-8), not characters — handles any language | GPT-2, GPT-3, GPT-4 |
| SentencePiece | 2018 | Language-agnostic, treats input as raw stream (no pre-tokenization) | T5, LLaMA, Mistral |
| Unigram LM | 2018 | Probabilistic: starts large, prunes based on likelihood impact | XLNet, ALBERT (via SentencePiece) |
Byte-level BPE (used in GPT-2+) is the most important descendant. Instead of operating on Unicode characters, it operates on raw bytes (values 0-255). This means the base vocabulary is just 256 symbols, and BPE can tokenize any byte sequence — any language, any encoding, even binary data. Radford et al. showed this works as well as character-level BPE while being more universal.
SentencePiece (Kudo, 2018) removes BPE's assumption that input is pre-tokenized into words by whitespace. This is crucial for languages like Japanese and Chinese that don't use spaces between words. SentencePiece treats the raw input as a character stream and can use either BPE or Unigram LM as the merging algorithm.
Every major language model uses a BPE variant. The vocabulary sizes have grown:
tokenizer vocabulary sizes
GPT-2: 50,257 tokens (byte-level BPE)
GPT-3: 50,257 tokens (same tokenizer as GPT-2)
GPT-4: 100,256 tokens (cl100k_base, byte-level BPE)
LLaMA: 32,000 tokens (SentencePiece BPE)
LLaMA-3: 128,256 tokens (tiktoken byte-level BPE)
Mistral: 32,000 tokens (SentencePiece BPE)
Looking back from 2026, BPE's key insight — that the right vocabulary emerges from data statistics rather than linguistic rules — proved fundamental. Every subsequent tokenization method builds on this principle. The specific merge heuristic (frequency vs. likelihood) matters less than the architecture: start with atoms, iteratively combine the most useful units, stop at a fixed vocabulary size.
"The most frequent pair captures the most predictable structure. Compression and understanding are the same thing." — adapted from the spirit of Gage (1994)
Drag the slider to explore how BPE evolved from a compression algorithm to the tokenization backbone of modern LLMs.