How models carve text into pieces — and why your Korean prompt costs 3x more than your English one.
You type "Hello world" into ChatGPT. Before the model sees a single neuron fire, something has to happen: those 11 characters need to become numbers. Neural networks don't read text — they process vectors of floats. The bridge between raw text and those vectors is tokenization.
Tokenization is the first decision in any NLP pipeline, and it's one of the most consequential. Get it wrong and the model wastes capacity learning to spell. Get it right and common words become single tokens while rare words decompose into meaningful pieces.
Here's the problem that makes tokenization hard: there are roughly 150,000 common English words. If we make each word a token, our vocabulary is 150K — the output layer of the model is a 150K-way classifier, which is expensive. And what happens when the model sees "ChatGPT" for the first time? It's not in our vocabulary. We have an out-of-vocabulary (OOV) problem.
We could go the other direction: make each character a token. There are only 256 bytes (or ~65K Unicode code points). No OOV problem. But now the word "understanding" is 13 tokens, and the model must learn that u-n-d-e-r-s-t-a-n-d-i-n-g forms a single concept. That's asking the model to do a lot of character-level pattern recognition that a simple lookup table could handle.
The sweet spot is subword tokenization: break text into pieces that balance vocabulary size against sequence length. Common words like "the" stay whole. Rare words like "tokenization" might split into "token" + "ization". The model gets the benefits of both approaches.
Type any text and see how three strategies — character, word, and subword — carve it differently. Watch the token count change.
The simulation above shows the core tradeoff: characters produce many tokens (long sequences, expensive attention), words produce few tokens but can't handle unseen words, and subwords find the middle ground — manageable sequence lengths with no OOV problem.
This lesson teaches the three dominant subword algorithms (BPE, WordPiece, SentencePiece), shows why they create inequality across languages, and explores multilingual models that try to close the gap.
Let's make the three approaches concrete. Take the sentence: "unhappiness is unlearnable". How does each strategy see it?
Character-level: ['u','n','h','a','p','p','i','n','e','s','s',' ','i','s',' ','u','n','l','e','a','r','n','a','b','l','e']. That's 26 tokens for 3 words. The model must learn that 'u','n','h','a','p','p','y' spells "unhappy" and that the 'un-' prefix means negation. Every morphological pattern must be learned from scratch through the character sequences.
Word-level: ['unhappiness', 'is', 'unlearnable']. Only 3 tokens — beautifully compact. But "unlearnable" might not be in our 50K-word vocabulary. It becomes [UNK] — the model literally has no idea what it means. And it can't even guess, because unlike humans who can decompose "un-learn-able", the word tokenizer treats every word as an atomic unit.
Subword: ['un', 'happiness', 'is', 'un', 'learn', 'able']. Six tokens. The model sees that both words start with 'un' (negation). It sees 'happiness' and 'learn' as familiar concepts. It sees 'able' as a suffix meaning "capable of." All the morphological structure is preserved, the sequence is compact, and nothing is [UNK].
The same sentence split three ways. Notice how subword captures morphological structure that word-level loses and character-level obscures.
Every tokenizer navigates a fundamental tradeoff between vocabulary size and sequence length:
| Strategy | Vocab Size | Avg Tokens/Word | OOV? | Morphology? |
|---|---|---|---|---|
| Character | ~256 | ~5 | Never | Must learn |
| Word | 50K-150K | 1 | Frequent | Lost |
| Subword | 30K-100K | ~1.3 | Never | Preserved |
Why does vocabulary size matter? The embedding table has shape [vocab_size, d_model]. For GPT-2 with vocab 50,257 and d_model 768, that's 38.6M parameters just for the embedding layer — about 31% of the total 124M parameter model. A character vocabulary (256) would shrink this to 197K parameters, but the sequences would be 4-5x longer, making attention O(n²) far more expensive.
The output layer has the same shape: [d_model, vocab_size]. So the vocabulary cost is paid twice: once at the input embedding, once at the output projection. For large models with 100K+ vocabulary (like Llama's 32K or GPT-4's ~100K), the embedding layers can be hundreds of millions of parameters.
Where V is vocabulary size. For Llama 2 (V=32K, d=4096): 2 × 32,000 × 4,096 = 262M parameters. For a hypothetical character model (V=256): only 2M parameters for embeddings, but sequences 4x longer means attention costs 16x more.
Subword tokenization dominates because it hits the sweet spot on every axis:
Every major LLM uses subword tokenization. GPT-2/3/4 use Byte-Pair Encoding (BPE). BERT uses WordPiece. T5, Llama, and many multilingual models use SentencePiece (which implements BPE or Unigram underneath). The next three chapters explain how each works.
Byte-Pair Encoding started as a data compression algorithm in 1994 (Gage). Sennrich et al. (2016) adapted it for NLP in their paper "Neural Machine Translation of Rare Words with Subword Units," and it became the foundation of modern tokenization. GPT-2, GPT-3, GPT-4, and many other models use BPE.
The idea is beautifully simple: start with individual characters, then iteratively merge the most frequent pair of adjacent tokens into a single new token. After K merges, you have a vocabulary of size 256 + K (base bytes plus learned merges).
Let's walk through BPE on a tiny corpus: "low low low low low lowest lowest newer newer newer wider wider".
Step 0: Initialize. Split every word into characters, with a special end-of-word marker "</w>":
corpus
low: l o w </w> (frequency: 5)
lowest: l o w e s t </w> (frequency: 2)
newer: n e w e r </w> (frequency: 3)
wider: w i d e r </w> (frequency: 2)
Step 1: Count all adjacent pairs.
pair counts
(l, o): 5+2 = 7 (o, w): 5+2 = 7 (w, </w>): 5
(w, e): 2+3 = 5 (e, s): 2 (s, t): 2
(t, </w>): 2 (n, e): 3 (e, r): 3+2 = 5
(r, </w>): 3+2 = 5 (w, i): 2 (i, d): 2
(d, e): 2
Step 2: Merge the most frequent pair. Tie between (l,o) and (o,w), both at 7. Pick (l,o). Create new token "lo".
after merge 1: l+o → lo
low: lo w </w> (5)
lowest: lo w e s t </w> (2)
newer: n e w e r </w> (3)
wider: w i d e r </w> (2)
Step 3: Re-count pairs. Now (lo, w) has frequency 7. Merge it into "low".
after merge 2: lo+w → low
low: low </w> (5)
lowest: low e s t </w> (2)
newer: n e w e r </w> (3)
wider: w i d e r </w> (2)
After 2 merges, "low" is now a single token. The algorithm continues: next merge might be (e,r) with frequency 5, creating "er". Then (r,</w>) or (low, </w>). Each merge creates a new token and reduces total sequence length.
Click "Next Merge" to watch BPE build a vocabulary from individual characters. The most frequent pair is highlighted before each merge.
Training BPE produces a merge table: an ordered list of (pair → merged_token) rules. At inference time, we apply these rules in order to any new text.
For the text "lowest":
The word "lowest" becomes ["low", "est"]. The model learns that "est" is a superlative suffix and "low" is a base word. This morphological decomposition emerges naturally from frequency statistics — no linguistic knowledge was programmed in.
GPT-2 introduced byte-level BPE: instead of starting from Unicode characters (~65K base symbols), start from raw bytes (256 base symbols). This guarantees that any text in any language can be tokenized — even binary data. The base vocabulary is exactly 256, and all learned tokens are sequences of bytes.
GPT-2 uses 50,000 merges, giving a total vocabulary of 50,257 (256 bytes + 50,000 merges + 1 special token). GPT-4's cl100k_base tokenizer uses ~100,000 merges for a vocabulary of ~100,277.
python import tiktoken # GPT-2 tokenizer (50,257 vocab) enc_gpt2 = tiktoken.get_encoding("gpt2") tokens = enc_gpt2.encode("Hello world") # [15496, 995] — 2 tokens # GPT-4 tokenizer (100,277 vocab) enc_gpt4 = tiktoken.get_encoding("cl100k_base") tokens = enc_gpt4.encode("Hello world") # [9906, 1917] — 2 tokens (different IDs, same count) # But rare words differ: enc_gpt2.encode("antidisestablishmentarianism") # 5 tokens with GPT-2 enc_gpt4.encode("antidisestablishmentarianism") # 3 tokens with GPT-4 (larger vocab = more merges)
python def train_bpe(corpus, num_merges): """Train BPE tokenizer from scratch.""" # Step 0: Initialize with character-level tokenization vocab = {} # word -> [characters] for word, freq in corpus.items(): vocab[word] = list(word) + ['</w>'] merge_table = [] # ordered list of merges for i in range(num_merges): # Step 1: Count all adjacent pairs pair_counts = {} for word, freq in corpus.items(): tokens = vocab[word] for j in range(len(tokens) - 1): pair = (tokens[j], tokens[j+1]) pair_counts[pair] = pair_counts.get(pair, 0) + freq # Step 2: Find most frequent pair best_pair = max(pair_counts, key=pair_counts.get) # Step 3: Merge that pair everywhere new_token = best_pair[0] + best_pair[1] for word in vocab: tokens = vocab[word] new_tokens = [] j = 0 while j < len(tokens): if j < len(tokens)-1 and \ (tokens[j], tokens[j+1]) == best_pair: new_tokens.append(new_token) j += 2 else: new_tokens.append(tokens[j]) j += 1 vocab[word] = new_tokens merge_table.append(best_pair) return merge_table
The time complexity of BPE training is O(K × N) where K is the number of merges and N is the corpus size. Each merge requires scanning the entire corpus to count pairs and apply the merge. For GPT-2's 50K merges on a multi-billion-token corpus, this takes hours on a cluster. But it's a one-time cost — the merge table is fixed forever after.
BPE isn't the only subword algorithm. Two important variants — WordPiece and SentencePiece — address different design choices. Understanding the differences explains why BERT tokens look different from GPT tokens.
WordPiece (Schuster & Nakajima, 2012) was developed at Google for Japanese/Korean input methods and later adopted for BERT. It looks similar to BPE but uses a different merge criterion.
BPE merges the most frequent pair. WordPiece merges the pair that maximizes the likelihood of the training corpus. The difference is subtle but meaningful. Consider two pairs:
| Pair | Frequency | Individual Frequencies | Likelihood Score |
|---|---|---|---|
| (t, h) | 1000 | t: 5000, h: 3000 | 1000 / (5000 × 3000) = 6.7×10-5 |
| (q, u) | 200 | q: 210, u: 2000 | 200 / (210 × 2000) = 4.8×10-4 |
BPE picks (t,h) because frequency 1000 > 200. WordPiece picks (q,u) because the likelihood score is higher — whenever you see "q", it's almost always followed by "u", making "qu" a highly informative merge. The pair (t,h) is frequent but both "t" and "h" appear in many other contexts, so merging them captures less information.
This is essentially pointwise mutual information (PMI): how much more likely is the pair than we'd expect if the two symbols were independent? High PMI means the symbols are strongly associated — like "q" and "u".
BERT uses WordPiece with a vocabulary of 30,522 tokens. Subwords that continue a word are prefixed with "##":
python from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') tokenizer.tokenize("unhappiness is unlearnable") # ['un', '##ha', '##pp', '##iness', 'is', 'un', '##learn', '##able'] tokenizer.tokenize("tokenization") # ['token', '##ization'] tokenizer.tokenize("the cat sat") # ['the', 'cat', 'sat'] — common words stay whole
The "##" prefix is critical: it tells the model that this piece continues the previous word. Without it, the model couldn't distinguish "un" (the prefix in "unhappy") from "un" (as a standalone token, if it existed). The prefix is part of the token's identity and gets its own embedding.
Both BPE and WordPiece assume pre-tokenized text — words are already separated by spaces. This works for English but fails for Chinese, Japanese, and Thai, where words aren't space-separated. SentencePiece (Kudo & Richardson, 2018) treats the input as a raw byte stream, including spaces.
The key insight: SentencePiece uses a special character "▁" (U+2581, lower one eighth block) to represent the space. Instead of splitting on spaces first, it treats spaces as ordinary characters. This means it can learn tokens that span word boundaries or include the space.
SentencePiece vs BPE tokenization
Input: "New York City"
BPE (pre-split): ["New", "York", "City"]
→ tokenize each word independently
SentencePiece: "▁New▁York▁City"
→ can learn "▁New▁York" as a single token!
This is a significant advantage for multilingual models. In Chinese, "我喜欢自然语言处理" (I like NLP) has no spaces. BPE would need a separate word segmentation step. SentencePiece processes the raw string directly, learning character-level and phrase-level tokens from the corpus statistics.
SentencePiece supports two algorithms: BPE (merge-based) and Unigram (probabilistic). The Unigram method is the opposite of BPE:
| Property | BPE | Unigram |
|---|---|---|
| Direction | Bottom-up (merge pairs) | Top-down (prune large vocab) |
| Start | Characters (small vocab) | All substrings (huge vocab) |
| Each step | Add best merge | Remove least-useful token |
| Tokenization | Deterministic | Probabilistic (multiple valid segmentations) |
| Used by | GPT, Llama, Mistral | T5, ALBERT, mBART |
Unigram starts with a large candidate vocabulary (all frequent substrings up to some length) and iteratively removes the token whose removal least increases the total corpus loss. It models the probability of each token appearing independently:
The Unigram method's killer feature is subword regularization: because it can score any segmentation probabilistically, you can sample different tokenizations of the same word during training. "tokenization" might be ["token", "ization"] in one training step and ["to", "ken", "ization"] in another. This acts as data augmentation and makes the model more robust.
Toggle between BPE (frequency-based) and WordPiece (likelihood-based) merge criteria. Watch how the merge order differs.
Here's an uncomfortable fact: the sentence "I love machine learning" is 4 tokens in GPT-4. The Korean equivalent "나는 기계 학습을 좋아합니다" is 13 tokens. The same meaning costs 3.25x more to process in Korean than in English. This isn't a bug — it's a direct consequence of how tokenizers are trained.
BPE learns merges from corpus statistics. English dominates most training corpora (often 50-90% of the data). So English pairs get merged aggressively: "the" is one token, "machine" is one token, "learning" is one token. Korean characters appear less frequently, so fewer merges happen. Each Korean syllable block might stay as 2-3 byte tokens.
This creates what researchers call the fertility problem: the number of tokens needed to represent the same meaning varies dramatically across languages.
Fertility is the average number of subword tokens per word (or per character, depending on definition). For a given tokenizer:
| Language | Script | Fertility (GPT-2) | Fertility (GPT-4) |
|---|---|---|---|
| English | Latin | 1.3 | 1.2 |
| German | Latin | 1.8 | 1.4 |
| Chinese | CJK | 2.5 | 1.6 |
| Japanese | Mixed | 2.8 | 1.7 |
| Korean | Hangul | 3.8 | 2.2 |
| Thai | Thai | 4.2 | 2.8 |
| Burmese | Myanmar | 8.5 | 5.1 |
Notice two things. First, fertility correlates with how much of each language appears in the training corpus. English dominates, so it has the lowest fertility. Second, GPT-4 is better than GPT-2 for all languages because it has a larger vocabulary (100K vs 50K) and more multilingual training data.
See how many tokens each language needs to express the same sentence ("I love machine learning"). Longer bars = more tokens = more expensive.
High fertility has three concrete costs:
1. Financial cost. API pricing is per-token. If Korean uses 3x more tokens than English for the same content, Korean users pay 3x more. This is a real equity issue for commercial LLM services.
2. Context window waste. A 4K context window holds ~3000 English words but only ~1000 Thai words. Multilingual users get less "working memory" for the same context length.
3. Model capacity. More tokens means more attention computations, more positional encodings, more memory. The model spends capacity on character-level reconstruction instead of semantic understanding.
Several approaches attack the fertility problem:
Larger vocabularies. GPT-4's 100K vocabulary has lower fertility than GPT-2's 50K vocabulary for all languages. More merges = more multilingual tokens. But vocabulary can't grow indefinitely: the embedding table grows linearly, and the output softmax becomes more expensive.
Balanced training data. If the corpus has equal representation of all target languages, the tokenizer learns merges for each language proportionally. XLM-R uses exponential smoothing to up-sample low-resource languages during tokenizer training.
Language-specific tokenizers. Instead of one tokenizer for all languages, train separate tokenizers per language and combine them. This guarantees good fertility for each language but breaks cross-lingual transfer — "cat" and "猫" have no token-level relationship.
Byte-level models. Models like ByT5 skip tokenization entirely and process raw bytes. Fertility is constant across languages (1 byte per byte). But sequences are 3-5x longer, making training and inference more expensive.
Can a single model understand 100 languages? XLM-R (Cross-lingual Language Model — RoBERTa, Conneau et al. 2020) proved it can. Trained on 2.5TB of text in 100 languages with a shared SentencePiece tokenizer, XLM-R demonstrated that multilingual pretraining creates cross-lingual transfer: fine-tune on English, deploy in any of the 100 languages.
This is remarkable. You label training data only in English (cheap, abundant), fine-tune XLM-R on it, and the model works in Korean, Arabic, and Swahili — without ever seeing labeled data in those languages. How?
The key is shared subword tokens. Many languages share script (Latin alphabet covers English, French, German, Spanish, etc.). Even across scripts, shared concepts create overlapping embedding spaces through co-occurrence patterns.
During pretraining, XLM-R learns that "cat" (English), "chat" (French), and "gato" (Spanish) appear in similar contexts: they follow "the"/"le"/"el" and precede "is"/"est"/"es". The masked language modeling objective forces the model to develop representations where semantically equivalent words in different languages occupy nearby points in the embedding space.
Words with similar meanings cluster together regardless of language. Click to highlight a concept cluster and see how translations from different languages map to the same region.
XLM-R is architecturally identical to RoBERTa — a transformer encoder with masked language modeling. The only differences are scale and data:
| Property | RoBERTa | XLM-R Base | XLM-R Large |
|---|---|---|---|
| Languages | 1 (English) | 100 | 100 |
| Training data | 160GB | 2.5TB | 2.5TB |
| Vocab size | 50K BPE | 250K SP | 250K SP |
| Parameters | 125M / 355M | 270M | 550M |
| d_model | 768 / 1024 | 768 | 1024 |
| Layers | 12 / 24 | 12 | 24 |
The 250K vocabulary is massive compared to RoBERTa's 50K. This is necessary: 100 languages means 100 sets of common words that need dedicated tokens. Even with 250K tokens, low-resource languages still have higher fertility than English. But the increased vocabulary provides dramatically better coverage than using a 50K English-centric vocabulary for all languages.
Cross-lingual transfer isn't free. Conneau et al. identified the curse of multilinguality: for a fixed model capacity, adding more languages eventually hurts per-language performance. A model trained on 100 languages is worse at English than one trained only on English, given the same parameter count.
The intuition is capacity dilution. A 270M parameter model has finite representational capacity. If it must represent 100 languages, each language gets roughly 1/100th of the capacity. Some capacity is shared (universal syntactic patterns, shared vocabulary), but language-specific phenomena (word order, morphology, semantics) compete for the remaining capacity.
The solution: scale up. XLM-R Large (550M) closes most of the gap with monolingual RoBERTa. Larger models have enough capacity for both shared cross-lingual structure and language-specific knowledge. This is why modern multilingual models (Llama, GPT-4, PaLM) are very large — multilinguality demands it.
A naive approach trains on data proportional to each language's size in Common Crawl. But English has 300x more data than Urdu. If we sample proportionally, the model barely sees Urdu.
XLM-R uses exponential smoothing: adjust the sampling probability with a temperature parameter α:
Where ni is the number of sentences in language i. With α=1 (proportional sampling), English dominates. With α=0 (uniform sampling), each language gets equal time. XLM-R uses α=0.3, which up-samples low-resource languages significantly while still giving high-resource languages more data (because they have more useful patterns to learn).
python # Exponential smoothing example import numpy as np sizes = {'en': 300, 'fr': 50, 'ur': 1} # relative sizes alpha = 0.3 smoothed = {k: v**alpha for k, v in sizes.items()} total = sum(smoothed.values()) probs = {k: v/total for k, v in smoothed.items()} # en: 0.47, fr: 0.31, ur: 0.22 # Compare to proportional: en: 0.85, fr: 0.14, ur: 0.003 # Urdu goes from 0.3% to 22% of training samples!
Time to put everything together. This interactive explorer lets you type any text and see how different tokenization strategies handle it. Watch how fertility changes across simulated tokenizer configurations, and explore why some scripts are expensive while others are cheap.
Type text in any language to see character, BPE, and word tokenizations side by side. Toggle between small (32K) and large (100K) vocabularies. Try mixing languages!
Try these experiments to build intuition:
Experiment 1: Language comparison. Type "I love programming" and note the token count. Then try "나는 프로그래밍을 좋아합니다" (same meaning in Korean). The Korean version uses 2-3x more tokens despite expressing the same idea.
Experiment 2: Code. Try pasting a Python one-liner like def hello(): return "world". Programming keywords are common in training data, so "def", "return" are single tokens. But unusual variable names fragment into subwords.
Experiment 3: Emoji. Type "😀🎉🚀". Each emoji is 2-4 bytes in UTF-8. A byte-level BPE tokenizer without emoji merges will produce 2-4 tokens per emoji. With a large vocabulary that includes common emoji merges, each emoji becomes 1 token.
Experiment 4: Mixed scripts. Type "Hello こんにちは Bonjour" and see how the tokenizer handles script switches. The spaces between scripts act as natural token boundaries for BPE, but SentencePiece might learn cross-script tokens from bilingual data.
A well-designed tokenizer achieves several properties simultaneously:
| Property | What It Means | How to Measure |
|---|---|---|
| Low fertility | Few tokens per word | Average tokens/word across languages |
| No OOV | Every input can be tokenized | Byte-level fallback |
| Morphological alignment | Token boundaries match meaning boundaries | Manual inspection of splits |
| Fertility equity | Similar fertility across languages | Max/min fertility ratio |
| Compression efficiency | Fewer total tokens in the training corpus | Total tokens after tokenization |
No existing tokenizer achieves perfect equity. The best current approach is a combination: large vocabulary (100K+), balanced training corpus, and byte-level fallback. GPT-4's tokenizer is better than GPT-2's on all metrics, but significant gaps remain.
Tokenization failures are surprisingly common and have real consequences. Here are three failure modes you can observe in the explorer:
Glued tokens. GPT-2's tokenizer treats " cannot" (with a leading space) as a single token, but "cannot" (no leading space) as two tokens: "can" + "not". This means the model processes the same word differently depending on whether it starts a sentence. Sentence-initial "Cannot" gets different tokens than mid-sentence " cannot", creating an artificial asymmetry in the model's representations.
Number fragmentation. The number "123456789" tokenizes as multiple pieces, and the splitting point is arbitrary. "12345" might be ["123", "45"] while "12346" is ["123", "46"]. This means the model can't easily learn that 12345 < 12346, because the token representations change discontinuously with small changes in the number. This is one reason LLMs struggle with arithmetic.
URL and code tokenization. A URL like "https://example.com/path/to/page" might become 8+ tokens, with the split points falling in the middle of meaningful components. The model has to reconstruct the URL structure from fragments that don't align with semantic boundaries.
python import tiktoken enc = tiktoken.get_encoding("cl100k_base") # Number fragmentation print(enc.encode("12345")) # [4513, 1774] = 2 tokens print(enc.encode("12346")) # [4513, 2790] = 2 tokens (different split!) print(enc.encode("123456")) # [4513, 11738] = 2 tokens # Space sensitivity print(enc.encode("cannot")) # [34762] = 1 token print(enc.encode(" cannot")) # [4250] = 1 token (DIFFERENT token!) # Code tokenization print(enc.encode("def __init__(self):")) # Multiple tokens — underscores split awkwardly
Several recent works push beyond the standard BPE/WordPiece/Unigram triad:
ByT5 (Xue et al. 2022): Skips tokenization entirely. Processes raw UTF-8 bytes. Sequences are 3-5x longer, but the model never encounters OOV and has constant fertility across languages. The cost is quadratic attention on longer sequences.
MegaByte (Yu et al. 2023): Groups bytes into fixed-length patches (like image patches in ViT), processes patches with a large model, then generates bytes within each patch with a small model. This recovers much of the efficiency lost by byte-level processing.
Tokenizer-free models: An emerging research direction that replaces the embedding lookup table with a character-level encoder (like a small CNN over bytes). This eliminates the fixed vocabulary entirely, but is still in early research stages.
Tokenization sits at the foundation of every language model. The choices made here — BPE vs WordPiece vs Unigram, vocabulary size, training corpus composition — propagate through the entire system, affecting everything from embedding quality to multilingual fairness.
| Paper | Contribution | Connection |
|---|---|---|
| Neural Machine Translation of Rare Words with Subword Units (Sennrich 2016) | Adapted BPE from compression to NLP tokenization | Foundation of GPT tokenizers (Ch 2) |
| XLM-R (Conneau 2020) | 100-language model with cross-lingual transfer | Core of Ch 5 — shared multilingual embeddings |
| Language Model Tokenizers Introduce Unfairness Between Languages (Petrov 2023) | Systematic measurement of fertility inequality across languages | The equity argument in Ch 4 |
| Lecture | Relationship |
|---|---|
| L02: Word Vectors | Word2Vec assumes a word vocabulary. Subword tokenization extends this: each subword token gets its own embedding vector. BPE tokens become the atoms of the embedding space. |
| L05: Transformers | Vocabulary size directly determines the input embedding table and output projection layer. Attention cost is O(n²) where n is sequence length in tokens — so tokenizer efficiency directly impacts model speed. |
| L07: Pretraining | The pretraining corpus determines the tokenizer's merge statistics. Models pretrained primarily on English develop English-biased tokenizers. Multilingual pretraining (XLM-R) requires balanced tokenizer training to avoid fertility inequality. |
| Era | Tokenization | Limitation |
|---|---|---|
| Pre-2016 | Word-level vocabulary (50K-150K words) | OOV problem; can't handle new words |
| 2016 (Sennrich) | BPE for NMT; subword tokenization begins | English-centric; no multilingual concern |
| 2018 (BERT) | WordPiece with 30K vocab | Small vocab; high fertility for non-English |
| 2019 (GPT-2) | Byte-level BPE with 50K merges | Better coverage but still English-biased |
| 2020 (XLM-R) | SentencePiece with 250K vocab for 100 languages | Large vocab; curse of multilinguality |
| 2023 (GPT-4) | 100K vocab; better multilingual coverage | Still 2-5x fertility gap for many languages |
| Future? | Byte-level or tokenizer-free architectures | Computational cost of long sequences |
How the ideas in this lecture connect: from raw text through tokenization algorithms to multilingual models.
| Concept | Core Idea | Practical Impact |
|---|---|---|
| Subword Tokenization | Split text into pieces between chars and words | No OOV, manageable sequence length, morphology preserved |
| BPE | Iteratively merge most frequent adjacent pair | Used by GPT-2/3/4, deterministic, byte-level variant eliminates OOV |
| WordPiece | Merge by maximum likelihood (PMI) | Used by BERT, captures strong co-occurrence patterns |
| SentencePiece | Language-agnostic: treats spaces as characters | Essential for CJK and scriptless-boundary languages |
| Fertility | Tokens per word varies across languages | Cost, context, and quality inequality for non-English users |
| XLM-R | Shared multilingual representations enable cross-lingual transfer | Fine-tune in English, deploy in 100 languages |
Tokenization research is far from settled. Several open questions define the next frontier:
1. Can we close the fertility gap? Current tokenizers give English a 3-4x advantage over many languages. Language-balanced training, larger vocabularies, and byte-level fallbacks all help, but no existing approach achieves true parity. New architectures like MegaByte (processing bytes in patches) may fundamentally change the calculus.
2. Should tokenization be learned end-to-end? Currently, the tokenizer is trained separately from the model. What if the model could learn its own tokenization as part of pretraining? This would allow the model to discover the optimal granularity for each concept. Early work on character-aware models and byte-level transformers points in this direction.
3. Do we need tokenization at all? ByT5 showed that byte-level models can work. MegaByte showed they can be efficient. If inference hardware continues to get faster, the sequence length penalty of byte-level processing may become tolerable, eliminating the tokenizer entirely and all its biases.
4. Multimodal tokenization. As models process images, audio, and video alongside text, the tokenizer must handle all modalities. Chameleon's VQ-VAE tokenizes images into discrete tokens. SpeechTokenizer quantizes audio. Can we build a universal tokenizer that handles all modalities in a single vocabulary?