Rico Sennrich, Barry Haddow, Alexandra Birch (University of Edinburgh) — ACL 2016

BPE: Subword Units for Neural MT

Neural Machine Translation of Rare Words with Subword Units — iteratively merge the most frequent character pairs to build a vocabulary that handles any word, even ones never seen in training.

Prerequisites: Basic NLP tokenization concepts + Character encodings. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The OOV Problem

You're building a neural machine translation system to translate English to German. You train it on millions of sentence pairs. The model learns "house" → "Haus", "cat" → "Katze", and thousands of other words. Then a user inputs: "Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz."

Your model has never seen this word. In German, this 63-letter compound word means "law for the delegation of monitoring beef labelling." It appeared zero times in your training data. Your model outputs: <UNK>. The translation fails.

This is the out-of-vocabulary (OOV) problem. Neural machine translation (NMT) models work with a fixed vocabulary — typically the 30,000 to 50,000 most common words in the training data. Every word outside that set is mapped to a single unknown token. This causes three cascading failures:

FailureWhy It HappensExample
Rare compoundsAgglutinative languages (German, Turkish, Finnish) create new words by combining existing ones"Handschuhschneeballwerfer" = glove + snowball + thrower
Named entitiesNames are essentially infinite — you can't put every person, city, or product in a vocabulary"Ouagadougou", "Schwarzenegger", "AstraZeneca"
Morphological variantsLanguages with rich morphology have many word forms per lemmaEnglish "run/runs/running/ran" — Turkish has ~40 forms of each verb

Before BPE, the standard "fix" was a back-off dictionary: when the model output <UNK>, you'd look up the aligned source word in a bilingual dictionary and paste the translation in. This was a brittle hack — it only worked for 1-to-1 word correspondences and couldn't handle morphological changes.

The core insight of this paper: Don't treat words as atomic units. Decompose rare words into frequent subword pieces — "un" + "break" + "able" — so that nothing is ever truly OOV. The key is choosing which pieces to use. Sennrich et al. borrowed an algorithm from 1990s data compression: Byte Pair Encoding.

Think of it like LEGO bricks. Instead of having a separate mold for every possible shape (word), you have a set of reusable bricks (subwords) that can snap together to form any shape. You need enough distinct bricks to be efficient, but few enough that your model can learn them all.

Word vs Subword Tokenization

Type a word to see how word-level tokenization handles it vs. BPE subword tokenization. Long or rare words become <UNK> with word-level but are decomposed with BPE.

What is the fundamental problem with fixed word-level vocabularies in NMT?

Chapter 1: From Compression to NLP

Byte Pair Encoding was invented by Philip Gage in 1994 — not for NLP, but for data compression. The idea is beautifully simple: find the most frequent pair of bytes in your data, replace every occurrence with a single new byte, and repeat. Each replacement shrinks the data.

Here's the original compression algorithm on a tiny example. Start with the string "aaabdaaabac":

Step 0: Original
a a a b d a a a b a c — 11 bytes
↓ most frequent pair: (a, a) → replace with Z
Step 1: Replace (a,a)
Z a b d Z a b a c — 9 bytes (Z = aa)
↓ most frequent pair: (Z, a) → replace with Y
Step 2: Replace (Z,a)
Y b d Y b a c — 7 bytes (Y = Za = aaa)
↓ most frequent pair: (Y, b) → replace with X
Step 3: Replace (Y,b)
X d X a c — 5 bytes (X = Yb = aaab)

Each step finds the most frequent adjacent pair and merges it into a new symbol. The sequence of merge rules — (a,a)→Z, (Z,a)→Y, (Y,b)→X — is the compression dictionary. To decompress, apply the rules in reverse.

Sennrich's key adaptation

Sennrich et al. adapted this compression algorithm for NLP with two changes:

Original BPE (compression)Sennrich BPE (NLP)
Operates on raw bytesOperates on characters within words
Goal: minimize data sizeGoal: build a fixed-size vocabulary of subword units
Number of merges = until no more compressionNumber of merges = hyperparameter (e.g., 32,000)
Merges happen across word boundariesMerges only happen within word boundaries (preserves word structure)

The word boundary constraint is critical. Without it, BPE would merge characters across words, producing subword units like "e_t" (end of "the" + start of "the") that span words. By treating each word independently, every merge builds a meaningful subword: "un", "break", "able", "ing".

Why not just use characters? You could avoid OOV entirely by tokenizing every word into individual characters: "cat" → ["c", "a", "t"]. But character-level models have two serious problems. First, sequences become 4-5x longer, making training much slower (attention is O(n²)). Second, the model must learn to compose characters into meaning — a much harder task than learning subword semantics. BPE is the sweet spot: shorter sequences than characters, no OOV like words.

python
# The vocabulary size spectrum
# Character-level:  ~100 symbols, but sequences are very long
# BPE subwords:     ~32,000 symbols, balanced sequence length
# Word-level:       ~50,000+ symbols, but many OOV words

sentence = "unbreakable"

# Character: ['u','n','b','r','e','a','k','a','b','l','e'] — 11 tokens
# BPE:       ['un', 'break', 'able']                       — 3 tokens
# Word:      ['unbreakable'] or ['<UNK>']                  — 1 token (if known)
Tokenization Spectrum

Drag the slider to change vocabulary size. Watch how the same sentence is tokenized at different granularities — from individual characters (left) to full words (right). BPE lives in the sweet spot.

Vocab size 32K
What key change did Sennrich make when adapting BPE from compression to NLP?

Chapter 2: The BPE Algorithm

Let's walk through BPE step by step on a real vocabulary. This is the training phase — where we learn which character pairs to merge. The input is a corpus of text; the output is an ordered list of merge rules.

Step 1: Initialize

Start by splitting every word in the training corpus into individual characters, with a special end-of-word symbol (we'll use ·). Count word frequencies. Suppose our corpus contains these words with their counts:

training corpus
low:    5   → l o w ·
lower:  2   → l o w e r ·
newest: 6   → n e w e s t ·
widest: 3   → w i d e s t ·

The initial vocabulary is just the set of characters: {l, o, w, e, r, n, s, t, i, d, ·}. Eleven symbols.

Step 2: Count pairs

Count every adjacent character pair across the corpus, weighted by word frequency:

pair frequencies
(e, s): 6 (newest) + 3 (widest) = 9
(s, t): 6 (newest) + 3 (widest) = 9
(t, ·): 6 (newest) + 3 (widest) = 9
(l, o): 5 (low) + 2 (lower) = 7
(o, w): 5 (low) + 2 (lower) = 7
(n, e): 6
(e, w): 6
(w, e): 6
(w, ·): 5
...

Step 3: Merge the most frequent pair

The most frequent pair is (e, s) with count 9. Merge it into a new symbol "es":

after merge #1: (e, s) → es
low:    5   → l o w ·
lower:  2   → l o w e r ·
newest: 6   → n e w es t ·
widest: 3   → w i d es t ·

Vocabulary: {l, o, w, e, r, n, s, t, i, d, ·, es}  — 12 symbols

Step 4: Repeat

Recount pairs. Now (es, t) has frequency 9. Merge it into "est":

after merge #2: (es, t) → est
low:    5   → l o w ·
lower:  2   → l o w e r ·
newest: 6   → n e w est ·
widest: 3   → w i d est ·

Vocabulary: {l, o, w, e, r, n, s, t, i, d, ·, es, est}  — 13 symbols

Continue: (est, ·) freq 9 → merge into "est·". Then (l, o) freq 7 → merge into "lo". Then (lo, w) freq 7 → merge into "low". And so on, until we've done the desired number of merges.

The merge order encodes frequency. The first merges capture the most common patterns in the language. Merge #1 might be (e, s) — the English suffix "-es". Merge #100 might be (t, ion) — the suffix "-tion". Merge #10,000 might be (break, able) — the word "breakable". The order is a frequency-driven hierarchy from characters to common subwords to full words.
BPE Merge Simulator

Watch BPE learn merge rules step by step. Click "Merge" to perform the next most frequent pair merge. The vocabulary grows and the representation gets more compact.

python
import collections

def learn_bpe(corpus, num_merges):
    # Step 1: Initialize — split words into characters
    vocab = {}
    for word, freq in corpus.items():
        # Add end-of-word marker, split into chars
        symbols = list(word) + ['</w>']
        vocab[tuple(symbols)] = freq

    merges = []
    for i in range(num_merges):
        # Step 2: Count all adjacent pairs
        pairs = collections.Counter()
        for symbols, freq in vocab.items():
            for j in range(len(symbols) - 1):
                pairs[(symbols[j], symbols[j+1])] += freq

        if not pairs:
            break

        # Step 3: Find most frequent pair
        best = max(pairs, key=pairs.get)
        merges.append(best)

        # Step 4: Apply merge to vocabulary
        new_vocab = {}
        for symbols, freq in vocab.items():
            new_symbols = []
            j = 0
            while j < len(symbols):
                if j < len(symbols)-1 and (symbols[j], symbols[j+1]) == best:
                    new_symbols.append(symbols[j] + symbols[j+1])
                    j += 2
                else:
                    new_symbols.append(symbols[j])
                    j += 1
            new_vocab[tuple(new_symbols)] = freq
        vocab = new_vocab

    return merges, vocab
In BPE training, what determines which pair gets merged next?

Chapter 3: Learning Merge Rules

The number of merge operations is the one hyperparameter that controls BPE. It determines your vocabulary size and the granularity of your subword units. Sennrich et al. explored merge counts from 10,000 to 90,000. The choice creates a direct tradeoff.

The vocabulary size tradeoff

Few merges (small vocabulary, ~10K): Words are split into many small pieces. "unfortunately" becomes ["un", "for", "tun", "ate", "ly"]. Sequences are longer, but every piece is frequent and well-learned. Good for morphologically rich languages where you need to decompose aggressively.

Many merges (large vocabulary, ~90K): Common words stay intact. "unfortunately" stays as one token. Sequences are shorter, but rare subwords may not have enough training examples. Approaches word-level behavior with the same OOV risks.

The sweet spot. Sennrich found that 30,000-40,000 merges worked best for most language pairs. This produces subwords that correspond to morphemes — meaningful word parts like "un-", "-break-", "-able". The algorithm doesn't know about morphology; it rediscovers it from frequency statistics alone. English "-ing" gets its own token not because BPE knows it's a suffix, but because it appears millions of times.

Joint vs separate vocabularies

For machine translation, you have two languages. Should you learn BPE separately for each, or jointly on the combined corpus?

StrategyHowProsCons
Separate BPELearn merge rules on source and target independentlyEach vocabulary is optimized for its languageShared words (names, cognates) get different segmentations
Joint BPEConcatenate source + target text, learn one set of merge rulesShared subwords across languages; "Obama" is the same token everywhereDominant language may take more vocabulary slots

Sennrich et al. found that joint BPE is better for related language pairs (English-German, English-French) because they share many cognates and named entities. For distant language pairs (English-Chinese), separate vocabularies can be better since they share very few subword units.

The end-of-word marker

BPE uses a special end-of-word symbol (written as · or </w>) appended to the last character of each word before training. This is crucial for reversibility. Without it, you can't tell where words end when reconstructing text from subword tokens.

python
# Without end-of-word marker: ambiguous!
# "low" → ["lo", "w"]
# "lower" → ["lo", "w", "er"]
# How do you know "w" ends "low" vs continues to "er"?

# With end-of-word marker: unambiguous
# "low" → ["lo", "w·"]  (· marks word boundary)
# "lower" → ["lo", "w", "er·"]
# "w·" is a DIFFERENT token from "w" — the · tells you the word ends here

# Reconstruction: join tokens, replace · with space
tokens = ["lo", "w·", "lo", "w", "er·"]
# → "low lower"
Merge Count Explorer

Drag the slider to change the number of BPE merges. Watch how the same sentence is segmented at different vocabulary sizes — from character-level (few merges) to near word-level (many merges).

Merges 30K
Why does joint BPE (training on source + target text together) work better than separate BPE for related language pairs?

Chapter 4: Encoding New Text

Training gives us an ordered list of merge rules. Now we need to apply those rules to new text at inference time. This is the encoding phase — turning an arbitrary string into a sequence of subword tokens.

The encoding algorithm

Given a word and the learned merge rules:

Step 1: Split into characters
"lowest" → ['l', 'o', 'w', 'e', 's', 't', '·']
Step 2: Apply merge rules in order
Apply merge #1: (e, s) → 'es'. Apply merge #2: (es, t) → 'est'. Apply merge #3: (est, ·) → 'est·'. Apply merge #4: (l, o) → 'lo'. Apply merge #5: (lo, w) → 'low'.
Step 3: Final segmentation
['low', 'est·'] — 2 subword tokens

The order matters. You apply merge rules in the exact sequence they were learned during training. Early merges capture frequent patterns (suffixes, prefixes). Later merges build up to common words.

What about truly unknown words?

This is BPE's superpower. Consider "lowest" — even if this exact word never appeared in training, its subword pieces "low" and "est" probably did. Now consider a word from a language you've never trained on, like the Czech word "nejneobvyklejší" (most unusual). BPE handles it by falling back to characters:

python
# If no merge rules apply, characters remain as individual tokens
# "nejneobvyklejší" might tokenize as:
# ['ne', 'j', 'ne', 'ob', 'vy', 'k', 'le', 'j', 'ší', '·']
# Some pieces got merged (ne, ob, le) because they're common
# Others stayed as characters (j, k) because they're rare in context

# The key: NOTHING is ever truly OOV
# Worst case: every character is its own token
# This is equivalent to character-level processing
# for that specific word, while common words stay efficient
Graceful degradation. BPE creates a continuous spectrum from word-level to character-level representation, automatically adjusting per word based on its familiarity. Common words like "the" and "is" become single tokens. Rare words decompose into known subwords. Completely novel strings fall back to characters. There's no cliff — no point where the system suddenly fails.

Encoding implementation

python
def apply_bpe(word, merges):
    """Encode a word using learned BPE merge rules."""
    # Start with characters
    symbols = list(word) + ['</w>']

    # Apply each merge rule in order
    for (left, right) in merges:
        i = 0
        while i < len(symbols) - 1:
            if symbols[i] == left and symbols[i+1] == right:
                symbols[i:i+2] = [left + right]
                # Don't increment i — check if new symbol merges with next
            else:
                i += 1

    return symbols

# Example usage
merges = [('e','s'), ('es','t'), ('est','</w>'), ('l','o'), ('lo','w')]
print(apply_bpe("lowest", merges))
# ['low', 'est</w>'] — 2 tokens instead of 6 characters

print(apply_bpe("newer", merges))
# ['n', 'e', 'w', 'e', 'r', '</w>'] — falls back to chars where no merges apply
BPE Encoding Step-by-Step

Click "Step" to apply one merge rule at a time to the word "unbreakable". Watch characters progressively merge into subwords following the learned merge order.

How does BPE handle a completely novel word it has never seen during training?

Chapter 5: BPE for Translation

Now let's see how Sennrich et al. actually integrated BPE into a neural machine translation pipeline. The NMT system they used was an attention-based encoder-decoder model — the standard architecture in 2016 (before Transformers).

The pipeline

1. Learn BPE
Train BPE merge rules on source + target training text (joint BPE). Typical: 32,000-90,000 merges.
2. Segment Training Data
Apply BPE to all source and target sentences. "unbreakable" → "un@@ break@@ able". The @@ marker indicates this subword continues to the next.
3. Train NMT
Train the encoder-decoder model on BPE-segmented text. The model sees subword tokens, not words. Vocabulary is now manageable (32K-90K tokens).
4. Translate
BPE-segment source input, translate with the model (which outputs BPE segments), then reverse BPE to get words: "un@@ break@@ able" → "unbreakable".

The @@ notation is a practical detail: during BPE segmentation, every subword except the last in a word gets @@ appended. To reconstruct the original word, you just remove @@ and concatenate.

Why this works for translation

BPE solves three concrete translation problems:

Compound words: German "Abwasserbehandlungsanlage" (wastewater treatment plant) is split into "Ab@@ wasser@@ behandlungs@@ anlage". The model can translate each morpheme because it has seen "wasser" (water) and "Anlage" (facility) in many contexts.

Named entities: "Schwarzenegger" might split into "Sch@@ war@@ zen@@ egger". Even if the model hasn't seen this name, it has learned to copy subword pieces from source to target — names typically stay the same across languages.

Morphological agreement: Russian adjectives agree with nouns in gender, number, and case — producing dozens of forms. BPE learns the common stems and suffixes separately, allowing the model to generalize.

python
# BPE in a translation pipeline
from subword_nmt import learn_bpe, apply_bpe

# 1. Learn BPE on training data
learn_bpe(
    input="train.en-de",    # concatenated EN + DE training text
    output="bpe.codes",     # merge rules file
    num_operations=32000   # vocabulary size
)

# 2. Apply to training data
bpe = apply_bpe.BPE(codes=open("bpe.codes"))
segmented = bpe.process_line("The unbreakable code was discovered")
# → "The un@@ break@@ able code was dis@@ cover@@ ed"

# 3. Train NMT model on segmented data (standard encoder-decoder)
# Model vocabulary: ~32K BPE tokens instead of ~500K words

# 4. At inference: segment input, translate, de-segment output
src = bpe.process_line("Schwarzenegger attended the ceremony")
tgt_bpe = model.translate(src)  # "Sch@@ war@@ zen@@ egger besuchte die Zere@@ monie"
tgt = tgt_bpe.replace("@@ ", "")   # "Schwarzenegger besuchte die Zeremonie"
A key design decision: BPE operates outside the model. The NMT model itself doesn't know about BPE — it just sees a sequence of tokens. BPE is a preprocessing step (segmenting text before training) and a postprocessing step (joining subwords after generation). This means BPE can be used with any sequence-to-sequence model without changing the model architecture.
Translation Pipeline

Watch how BPE handles a translation pipeline. Click "Step" to advance through segmentation, encoding, decoding, and de-segmentation. See how rare words get decomposed for translation and reassembled after.

Why can BPE handle compound words like German "Abwasserbehandlungsanlage" that a word-level model cannot?

Chapter 6: Results & Analysis

Sennrich et al. tested BPE on four language pairs from the WMT 2015 translation task. The results demonstrated that BPE consistently improves over word-level baselines, especially for morphologically rich target languages.

BLEU scores

SystemEN→DEEN→RUDE→ENRU→EN
Word-level (50K vocab)20.522.124.024.0
Word-level + back-off dict22.823.526.325.1
BPE (joint, 60K)23.724.227.125.8
BPE (joint, 90K)24.023.727.625.9

Key findings from the results:

BPE beats the dictionary back-off. The word-level + back-off approach was the previous best method for handling OOV words. BPE surpasses it on all language pairs except one, without needing any external dictionary.

Compound-heavy languages benefit most. EN→DE shows the largest improvement (+1.2 BLEU over back-off) because German has many compound words that BPE decomposes effectively. EN→RU also benefits due to Russian's rich morphology.

The optimal merge count varies. 60K merges works best for EN→RU (more decomposition helps with Russian morphology), while 90K merges works best for EN→DE and DE→EN. This aligns with the tradeoff: more morphologically complex languages benefit from smaller subword units.

Analysis: what BPE learns

The paper includes a qualitative analysis showing that BPE discovers linguistically meaningful subword units without any linguistic knowledge:

BPE-discovered morphemes
# Prefixes:  un- , re- , pre- , über- , aus-
# Suffixes:  -ing , -tion , -ness , -ung , -lich , -keit
# Stems:    break , work , hand , schuh (shoe)

# German compound decomposition:
# Handschuhschneeballwerfer → Hand@@ schuh@@ schnee@@ ball@@ werfer
#                            (glove + snow + ball + thrower)
# Each piece is a real German morpheme!
BPE rediscovers morphology from frequency alone. This is remarkable: BPE knows nothing about linguistics. It doesn't know what a prefix, suffix, or stem is. It just counts character pairs. Yet the merge rules it learns correspond closely to the morphological structure of the language. The reason: morphemes are, by definition, the meaningful units that recur across words — and frequency is a proxy for meaningfulness.

Rare word translation examples

The paper shows specific examples where BPE enables correct translation of rare words:

SourceWord-level outputBPE output
"bushy-haired"<UNK>"struppig" (correct German)
"mistreatment"<UNK>"Misshandlung" (correct)
"Charlemagne"<UNK>"Karl der Große" (correct)

The "Charlemagne" example is particularly interesting. BPE didn't just copy the name — it translated it to the German equivalent "Karl der Große" (Charles the Great). This happened because BPE segmented "Charlemagne" into subwords that overlapped with training examples containing this historical figure.

Benchmark Explorer

Click a language pair to see how BPE compares to word-level baselines. The improvement is largest for morphologically complex target languages.

Why does BPE benefit morphologically rich languages (like German and Russian) more than simpler ones?

Chapter 7: Connections

Sennrich et al.'s 2016 paper had an impact far beyond machine translation. BPE became the de facto tokenization method for large language models, and its direct descendants power every major model today.

The BPE family tree

MethodYearKey Difference from BPEUsed By
BPE (this paper)2016Original frequency-based merges on charactersEarly NMT systems, subword-nmt
WordPiece2016Merge maximizes likelihood instead of frequencyBERT, DistilBERT
Byte-level BPE2019Operates on bytes (UTF-8), not characters — handles any languageGPT-2, GPT-3, GPT-4
SentencePiece2018Language-agnostic, treats input as raw stream (no pre-tokenization)T5, LLaMA, Mistral
Unigram LM2018Probabilistic: starts large, prunes based on likelihood impactXLNet, ALBERT (via SentencePiece)

Byte-level BPE (used in GPT-2+) is the most important descendant. Instead of operating on Unicode characters, it operates on raw bytes (values 0-255). This means the base vocabulary is just 256 symbols, and BPE can tokenize any byte sequence — any language, any encoding, even binary data. Radford et al. showed this works as well as character-level BPE while being more universal.

SentencePiece (Kudo, 2018) removes BPE's assumption that input is pre-tokenized into words by whitespace. This is crucial for languages like Japanese and Chinese that don't use spaces between words. SentencePiece treats the raw input as a character stream and can use either BPE or Unigram LM as the merging algorithm.

Impact on modern LLMs

Every major language model uses a BPE variant. The vocabulary sizes have grown:

tokenizer vocabulary sizes
GPT-2:      50,257 tokens  (byte-level BPE)
GPT-3:      50,257 tokens  (same tokenizer as GPT-2)
GPT-4:     100,256 tokens  (cl100k_base, byte-level BPE)
LLaMA:      32,000 tokens  (SentencePiece BPE)
LLaMA-3:   128,256 tokens  (tiktoken byte-level BPE)
Mistral:     32,000 tokens  (SentencePiece BPE)
From BPE to tokenization equity. One unintended consequence of BPE: when trained on English-dominant data, the tokenizer produces far more tokens for non-English text. "Hello" is 1 token, but the Hindi equivalent "नमस्ते" might be 3-6 tokens. Since LLM API pricing is per-token, this creates a de facto language tax. The Ahia et al. (EMNLP 2023) paper quantifies this problem in detail.

What BPE got right

Looking back from 2026, BPE's key insight — that the right vocabulary emerges from data statistics rather than linguistic rules — proved fundamental. Every subsequent tokenization method builds on this principle. The specific merge heuristic (frequency vs. likelihood) matters less than the architecture: start with atoms, iteratively combine the most useful units, stop at a fixed vocabulary size.

"The most frequent pair captures the most predictable structure. Compression and understanding are the same thing." — adapted from the spirit of Gage (1994)

BPE Family Timeline

Drag the slider to explore how BPE evolved from a compression algorithm to the tokenization backbone of modern LLMs.

Era 2016 — BPE NMT
What is the key difference between original BPE and byte-level BPE (used in GPT-2+)?