CS224N Lecture 14 — Tokenization and Multilinguality

Chapter 0: Why Tokenization?

You type "Hello world" into ChatGPT. Before the model sees a single neuron fire, something has to happen: those 11 characters need to become numbers. Neural networks don't read text — they process vectors of floats. The bridge between raw text and those vectors is tokenization.

Tokenization is the first decision in any NLP pipeline, and it's one of the most consequential. Get it wrong and the model wastes capacity learning to spell. Get it right and common words become single tokens while rare words decompose into meaningful pieces.

Here's the problem that makes tokenization hard: there are roughly 150,000 common English words. If we make each word a token, our vocabulary is 150K — the output layer of the model is a 150K-way classifier, which is expensive. And what happens when the model sees "ChatGPT" for the first time? It's not in our vocabulary. We have an out-of-vocabulary (OOV) problem.

We could go the other direction: make each character a token. There are only 256 bytes (or ~65K Unicode code points). No OOV problem. But now the word "understanding" is 13 tokens, and the model must learn that u-n-d-e-r-s-t-a-n-d-i-n-g forms a single concept. That's asking the model to do a lot of character-level pattern recognition that a simple lookup table could handle.

The sweet spot is subword tokenization: break text into pieces that balance vocabulary size against sequence length. Common words like "the" stay whole. Rare words like "tokenization" might split into "token" + "ization". The model gets the benefits of both approaches.

Characters to Tokens

Type any text and see how three strategies — character, word, and subword — carve it differently. Watch the token count change.

The simulation above shows the core tradeoff: characters produce many tokens (long sequences, expensive attention), words produce few tokens but can't handle unseen words, and subwords find the middle ground — manageable sequence lengths with no OOV problem.

Tokenization is not preprocessing — it's architecture. The choice of tokenizer determines vocabulary size (how big the embedding table is), sequence length (how much attention costs), and which languages get efficient representation. Every token the model wastes on spelling is a token it can't use for reasoning.

This lesson teaches the three dominant subword algorithms (BPE, WordPiece, SentencePiece), shows why they create inequality across languages, and explores multilingual models that try to close the gap.

Why is character-level tokenization problematic for transformer models?

Characters can't represent all languages It produces very long sequences, making attention O(n²) expensive, and forces the model to learn spelling patterns instead of using that capacity for semantics Characters don't have meaningful embeddings

Chapter 1: Char vs Word vs Subword

Let's make the three approaches concrete. Take the sentence: "unhappiness is unlearnable". How does each strategy see it?

Character-level: ['u','n','h','a','p','p','i','n','e','s','s',' ','i','s',' ','u','n','l','e','a','r','n','a','b','l','e']. That's 26 tokens for 3 words. The model must learn that 'u','n','h','a','p','p','y' spells "unhappy" and that the 'un-' prefix means negation. Every morphological pattern must be learned from scratch through the character sequences.

Word-level: ['unhappiness', 'is', 'unlearnable']. Only 3 tokens — beautifully compact. But "unlearnable" might not be in our 50K-word vocabulary. It becomes [UNK] — the model literally has no idea what it means. And it can't even guess, because unlike humans who can decompose "un-learn-able", the word tokenizer treats every word as an atomic unit.

Subword: ['un', 'happiness', 'is', 'un', 'learn', 'able']. Six tokens. The model sees that both words start with 'un' (negation). It sees 'happiness' and 'learn' as familiar concepts. It sees 'able' as a suffix meaning "capable of." All the morphological structure is preserved, the sequence is compact, and nothing is [UNK].

Three Tokenization Strategies

The same sentence split three ways. Notice how subword captures morphological structure that word-level loses and character-level obscures.

The Vocabulary-Sequence Tradeoff

Every tokenizer navigates a fundamental tradeoff between vocabulary size and sequence length:

Strategy	Vocab Size	Avg Tokens/Word	OOV?	Morphology?
Character	~256	~5	Never	Must learn
Word	50K-150K	1	Frequent	Lost
Subword	30K-100K	~1.3	Never	Preserved

Why does vocabulary size matter? The embedding table has shape [vocab_size, d_model]. For GPT-2 with vocab 50,257 and d_model 768, that's 38.6M parameters just for the embedding layer — about 31% of the total 124M parameter model. A character vocabulary (256) would shrink this to 197K parameters, but the sequences would be 4-5x longer, making attention O(n²) far more expensive.

The output layer has the same shape: [d_model, vocab_size]. So the vocabulary cost is paid twice: once at the input embedding, once at the output projection. For large models with 100K+ vocabulary (like Llama's 32K or GPT-4's ~100K), the embedding layers can be hundreds of millions of parameters.

Cost_embedding = 2 × V × d_model

Where V is vocabulary size. For Llama 2 (V=32K, d=4096): 2 × 32,000 × 4,096 = 262M parameters. For a hypothetical character model (V=256): only 2M parameters for embeddings, but sequences 4x longer means attention costs 16x more.

Why Subword Won

Subword tokenization dominates because it hits the sweet spot on every axis:

Subword tokenization captures morphology naturally. "un-" always tokenizes as a single piece, whether it appears in "unhappy", "unlearn", or "undo". The model learns that this prefix means negation from seeing it consistently, just as a human child learns that "un-" reverses meaning. The tokenizer does the segmentation; the model learns the semantics.

Every major LLM uses subword tokenization. GPT-2/3/4 use Byte-Pair Encoding (BPE). BERT uses WordPiece. T5, Llama, and many multilingual models use SentencePiece (which implements BPE or Unigram underneath). The next three chapters explain how each works.

The word "unbelievable" appears in a sentence. A word-level tokenizer with 50K vocabulary has never seen this word. A subword tokenizer has learned pieces like "un", "believ", "able". What happens?

Both tokenizers handle it the same way The word tokenizer maps it to [UNK], the subword tokenizer maps it to [UNK] too The word tokenizer maps it to [UNK] (total information loss), while the subword tokenizer splits it into "un" + "believ" + "able", preserving all morphological structure

Chapter 2: BPE Algorithm

Byte-Pair Encoding started as a data compression algorithm in 1994 (Gage). Sennrich et al. (2016) adapted it for NLP in their paper "Neural Machine Translation of Rare Words with Subword Units," and it became the foundation of modern tokenization. GPT-2, GPT-3, GPT-4, and many other models use BPE.

The idea is beautifully simple: start with individual characters, then iteratively merge the most frequent pair of adjacent tokens into a single new token. After K merges, you have a vocabulary of size 256 + K (base bytes plus learned merges).

BPE Training: Step by Step

Let's walk through BPE on a tiny corpus: "low low low low low lowest lowest newer newer newer wider wider".

Step 0: Initialize. Split every word into characters, with a special end-of-word marker "</w>":

corpus
low: l o w </w>      (frequency: 5)
lowest: l o w e s t </w>  (frequency: 2)
newer: n e w e r </w>    (frequency: 3)
wider: w i d e r </w>    (frequency: 2)

Step 1: Count all adjacent pairs.

pair counts
(l, o): 5+2 = 7    (o, w): 5+2 = 7    (w, </w>): 5
(w, e): 2+3 = 5    (e, s): 2          (s, t): 2
(t, </w>): 2       (n, e): 3          (e, r): 3+2 = 5
(r, </w>): 3+2 = 5 (w, i): 2          (i, d): 2
(d, e): 2

Step 2: Merge the most frequent pair. Tie between (l,o) and (o,w), both at 7. Pick (l,o). Create new token "lo".

after merge 1: l+o → lo
low: lo w </w>          (5)
lowest: lo w e s t </w>  (2)
newer: n e w e r </w>    (3)
wider: w i d e r </w>    (2)

Step 3: Re-count pairs. Now (lo, w) has frequency 7. Merge it into "low".

after merge 2: lo+w → low
low: low </w>          (5)
lowest: low e s t </w>  (2)
newer: n e w e r </w>   (3)
wider: w i d e r </w>   (2)

After 2 merges, "low" is now a single token. The algorithm continues: next merge might be (e,r) with frequency 5, creating "er". Then (r,</w>) or (low, </w>). Each merge creates a new token and reduces total sequence length.

BPE Merge Step-Through

Click "Next Merge" to watch BPE build a vocabulary from individual characters. The most frequent pair is highlighted before each merge.

Step 0 / 8

BPE at Inference Time

Training BPE produces a merge table: an ordered list of (pair → merged_token) rules. At inference time, we apply these rules in order to any new text.

For the text "lowest":

Init

l o w e s t

↓ merge l+o

Merge 1

lo w e s t

↓ merge lo+w

Merge 2

low e s t

↓ merge e+s

Merge 3

low es t

↓ merge es+t

Final

low est

The word "lowest" becomes ["low", "est"]. The model learns that "est" is a superlative suffix and "low" is a base word. This morphological decomposition emerges naturally from frequency statistics — no linguistic knowledge was programmed in.

GPT-2's BPE: Byte-Level

GPT-2 introduced byte-level BPE: instead of starting from Unicode characters (~65K base symbols), start from raw bytes (256 base symbols). This guarantees that any text in any language can be tokenized — even binary data. The base vocabulary is exactly 256, and all learned tokens are sequences of bytes.

GPT-2 uses 50,000 merges, giving a total vocabulary of 50,257 (256 bytes + 50,000 merges + 1 special token). GPT-4's cl100k_base tokenizer uses ~100,000 merges for a vocabulary of ~100,277.

python
import tiktoken

# GPT-2 tokenizer (50,257 vocab)
enc_gpt2 = tiktoken.get_encoding("gpt2")
tokens = enc_gpt2.encode("Hello world")
# [15496, 995] — 2 tokens

# GPT-4 tokenizer (100,277 vocab)
enc_gpt4 = tiktoken.get_encoding("cl100k_base")
tokens = enc_gpt4.encode("Hello world")
# [9906, 1917] — 2 tokens (different IDs, same count)

# But rare words differ:
enc_gpt2.encode("antidisestablishmentarianism")
# 5 tokens with GPT-2
enc_gpt4.encode("antidisestablishmentarianism")
# 3 tokens with GPT-4 (larger vocab = more merges)

The Full BPE Training Algorithm

python
def train_bpe(corpus, num_merges):
    """Train BPE tokenizer from scratch."""
    # Step 0: Initialize with character-level tokenization
    vocab = {}  # word -> [characters]
    for word, freq in corpus.items():
        vocab[word] = list(word) + ['</w>']

    merge_table = []  # ordered list of merges

    for i in range(num_merges):
        # Step 1: Count all adjacent pairs
        pair_counts = {}
        for word, freq in corpus.items():
            tokens = vocab[word]
            for j in range(len(tokens) - 1):
                pair = (tokens[j], tokens[j+1])
                pair_counts[pair] = pair_counts.get(pair, 0) + freq

        # Step 2: Find most frequent pair
        best_pair = max(pair_counts, key=pair_counts.get)

        # Step 3: Merge that pair everywhere
        new_token = best_pair[0] + best_pair[1]
        for word in vocab:
            tokens = vocab[word]
            new_tokens = []
            j = 0
            while j < len(tokens):
                if j < len(tokens)-1 and \
                   (tokens[j], tokens[j+1]) == best_pair:
                    new_tokens.append(new_token)
                    j += 2
                else:
                    new_tokens.append(tokens[j])
                    j += 1
            vocab[word] = new_tokens

        merge_table.append(best_pair)

    return merge_table

The time complexity of BPE training is O(K × N) where K is the number of merges and N is the corpus size. Each merge requires scanning the entire corpus to count pairs and apply the merge. For GPT-2's 50K merges on a multi-billion-token corpus, this takes hours on a cluster. But it's a one-time cost — the merge table is fixed forever after.

BPE produces a deterministic tokenization for any input text. Given the same merge table, the same input always produces the same tokens. No randomness, no ambiguity. The merge table is learned once during training and fixed forever after. This is why different model families (GPT vs Llama vs Claude) produce different tokenizations of the same text — they learned different merge tables on different corpora.

BPE has performed 3 merges: (l,o)→lo, (lo,w)→low, (e,r)→er. How would it tokenize the new word "lower"?

['low', 'er'] — apply merges in order: l+o→lo, lo+w→low, then e+r→er ['l', 'o', 'w', 'e', 'r'] — no applicable merges ['lower'] — it's a common word

Chapter 3: WordPiece & SentencePiece

BPE isn't the only subword algorithm. Two important variants — WordPiece and SentencePiece — address different design choices. Understanding the differences explains why BERT tokens look different from GPT tokens.

WordPiece: Maximum Likelihood Merging

WordPiece (Schuster & Nakajima, 2012) was developed at Google for Japanese/Korean input methods and later adopted for BERT. It looks similar to BPE but uses a different merge criterion.

BPE merges the most frequent pair. WordPiece merges the pair that maximizes the likelihood of the training corpus. The difference is subtle but meaningful. Consider two pairs:

Pair	Frequency	Individual Frequencies	Likelihood Score
(t, h)	1000	t: 5000, h: 3000	1000 / (5000 × 3000) = 6.7×10^-5
(q, u)	200	q: 210, u: 2000	200 / (210 × 2000) = 4.8×10^-4

BPE picks (t,h) because frequency 1000 > 200. WordPiece picks (q,u) because the likelihood score is higher — whenever you see "q", it's almost always followed by "u", making "qu" a highly informative merge. The pair (t,h) is frequent but both "t" and "h" appear in many other contexts, so merging them captures less information.

score(a, b) = freq(ab) / (freq(a) × freq(b))

This is essentially pointwise mutual information (PMI): how much more likely is the pair than we'd expect if the two symbols were independent? High PMI means the symbols are strongly associated — like "q" and "u".

BERT's WordPiece in Practice

BERT uses WordPiece with a vocabulary of 30,522 tokens. Subwords that continue a word are prefixed with "##":

python
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokenizer.tokenize("unhappiness is unlearnable")
# ['un', '##ha', '##pp', '##iness', 'is', 'un', '##learn', '##able']

tokenizer.tokenize("tokenization")
# ['token', '##ization']

tokenizer.tokenize("the cat sat")
# ['the', 'cat', 'sat'] — common words stay whole

The "##" prefix is critical: it tells the model that this piece continues the previous word. Without it, the model couldn't distinguish "un" (the prefix in "unhappy") from "un" (as a standalone token, if it existed). The prefix is part of the token's identity and gets its own embedding.

SentencePiece: Language-Agnostic Tokenization

Both BPE and WordPiece assume pre-tokenized text — words are already separated by spaces. This works for English but fails for Chinese, Japanese, and Thai, where words aren't space-separated. SentencePiece (Kudo & Richardson, 2018) treats the input as a raw byte stream, including spaces.

The key insight: SentencePiece uses a special character "▁" (U+2581, lower one eighth block) to represent the space. Instead of splitting on spaces first, it treats spaces as ordinary characters. This means it can learn tokens that span word boundaries or include the space.

SentencePiece vs BPE tokenization
Input: "New York City"

BPE (pre-split): ["New", "York", "City"]
  → tokenize each word independently

SentencePiece: "▁New▁York▁City"
  → can learn "▁New▁York" as a single token!

This is a significant advantage for multilingual models. In Chinese, "我喜欢自然语言处理" (I like NLP) has no spaces. BPE would need a separate word segmentation step. SentencePiece processes the raw string directly, learning character-level and phrase-level tokens from the corpus statistics.

Unigram Language Model

SentencePiece supports two algorithms: BPE (merge-based) and Unigram (probabilistic). The Unigram method is the opposite of BPE:

Property	BPE	Unigram
Direction	Bottom-up (merge pairs)	Top-down (prune large vocab)
Start	Characters (small vocab)	All substrings (huge vocab)
Each step	Add best merge	Remove least-useful token
Tokenization	Deterministic	Probabilistic (multiple valid segmentations)
Used by	GPT, Llama, Mistral	T5, ALBERT, mBART

Unigram starts with a large candidate vocabulary (all frequent substrings up to some length) and iteratively removes the token whose removal least increases the total corpus loss. It models the probability of each token appearing independently:

P(x₁, x₂, ..., x_n) = ∏_i P(x_i)

The Unigram method's killer feature is subword regularization: because it can score any segmentation probabilistically, you can sample different tokenizations of the same word during training. "tokenization" might be ["token", "ization"] in one training step and ["to", "ken", "ization"] in another. This acts as data augmentation and makes the model more robust.

BPE vs WordPiece Comparison

Toggle between BPE (frequency-based) and WordPiece (likelihood-based) merge criteria. Watch how the merge order differs.

BPE merges the most frequent pair. WordPiece merges the pair with the highest mutual information. In practice, BPE is simpler and dominates in GPT-family models. WordPiece is used by BERT. SentencePiece wraps either algorithm with language-agnostic preprocessing that treats spaces as regular characters — essential for Chinese, Japanese, Thai, and other languages without word boundaries.

SentencePiece treats spaces as regular characters and represents them with "▁". Why is this important for multilingual models?

It makes tokenization faster Languages like Chinese and Japanese don't use spaces to separate words, so a tokenizer that requires pre-splitting on spaces can't process them without a separate word segmentation tool Spaces contain semantic information that should be preserved

Chapter 4: Multilingual Tokenization

Here's an uncomfortable fact: the sentence "I love machine learning" is 4 tokens in GPT-4. The Korean equivalent "나는 기계 학습을 좋아합니다" is 13 tokens. The same meaning costs 3.25x more to process in Korean than in English. This isn't a bug — it's a direct consequence of how tokenizers are trained.

BPE learns merges from corpus statistics. English dominates most training corpora (often 50-90% of the data). So English pairs get merged aggressively: "the" is one token, "machine" is one token, "learning" is one token. Korean characters appear less frequently, so fewer merges happen. Each Korean syllable block might stay as 2-3 byte tokens.

This creates what researchers call the fertility problem: the number of tokens needed to represent the same meaning varies dramatically across languages.

Fertility Across Languages

Fertility is the average number of subword tokens per word (or per character, depending on definition). For a given tokenizer:

fertility(lang) = tokens(text) / words(text)

Language	Script	Fertility (GPT-2)	Fertility (GPT-4)
English	Latin	1.3	1.2
German	Latin	1.8	1.4
Chinese	CJK	2.5	1.6
Japanese	Mixed	2.8	1.7
Korean	Hangul	3.8	2.2
Thai	Thai	4.2	2.8
Burmese	Myanmar	8.5	5.1

Notice two things. First, fertility correlates with how much of each language appears in the training corpus. English dominates, so it has the lowest fertility. Second, GPT-4 is better than GPT-2 for all languages because it has a larger vocabulary (100K vs 50K) and more multilingual training data.

Fertility Calculator

See how many tokens each language needs to express the same sentence ("I love machine learning"). Longer bars = more tokens = more expensive.

Vocab size 50K

Why Fertility Matters: The Cost Multiplier

High fertility has three concrete costs:

1. Financial cost. API pricing is per-token. If Korean uses 3x more tokens than English for the same content, Korean users pay 3x more. This is a real equity issue for commercial LLM services.

2. Context window waste. A 4K context window holds ~3000 English words but only ~1000 Thai words. Multilingual users get less "working memory" for the same context length.

3. Model capacity. More tokens means more attention computations, more positional encodings, more memory. The model spends capacity on character-level reconstruction instead of semantic understanding.

Tokenization creates inequality. Users of low-resource languages pay more money, get shorter effective context, and receive worse model performance — all because the tokenizer was trained primarily on English text. Addressing this is an active research area. Petrov et al. (2023) in "Language Model Tokenizers Introduce Unfairness Between Languages" measured this cost gap systematically across 17 languages.

Approaches to Reducing Fertility Inequality

Several approaches attack the fertility problem:

Larger vocabularies. GPT-4's 100K vocabulary has lower fertility than GPT-2's 50K vocabulary for all languages. More merges = more multilingual tokens. But vocabulary can't grow indefinitely: the embedding table grows linearly, and the output softmax becomes more expensive.

Balanced training data. If the corpus has equal representation of all target languages, the tokenizer learns merges for each language proportionally. XLM-R uses exponential smoothing to up-sample low-resource languages during tokenizer training.

Language-specific tokenizers. Instead of one tokenizer for all languages, train separate tokenizers per language and combine them. This guarantees good fertility for each language but breaks cross-lingual transfer — "cat" and "猫" have no token-level relationship.

Byte-level models. Models like ByT5 skip tokenization entirely and process raw bytes. Fertility is constant across languages (1 byte per byte). But sequences are 3-5x longer, making training and inference more expensive.

A Thai user and an English user both send the same meaning to a pay-per-token LLM API. The Thai user's message uses 4.2x more tokens. What are the consequences?

Only financial: the Thai user pays more Only performance: the model is worse in Thai Financial (pays 4.2x more), context (4.2x less working memory in the same context window), and quality (model spends capacity on character reconstruction instead of semantics)

Chapter 5: XLM-R

Can a single model understand 100 languages? XLM-R (Cross-lingual Language Model — RoBERTa, Conneau et al. 2020) proved it can. Trained on 2.5TB of text in 100 languages with a shared SentencePiece tokenizer, XLM-R demonstrated that multilingual pretraining creates cross-lingual transfer: fine-tune on English, deploy in any of the 100 languages.

This is remarkable. You label training data only in English (cheap, abundant), fine-tune XLM-R on it, and the model works in Korean, Arabic, and Swahili — without ever seeing labeled data in those languages. How?

Cross-Lingual Transfer: The Mechanism

The key is shared subword tokens. Many languages share script (Latin alphabet covers English, French, German, Spanish, etc.). Even across scripts, shared concepts create overlapping embedding spaces through co-occurrence patterns.

During pretraining, XLM-R learns that "cat" (English), "chat" (French), and "gato" (Spanish) appear in similar contexts: they follow "the"/"le"/"el" and precede "is"/"est"/"es". The masked language modeling objective forces the model to develop representations where semantically equivalent words in different languages occupy nearby points in the embedding space.

Cross-Lingual Embedding Space

Words with similar meanings cluster together regardless of language. Click to highlight a concept cluster and see how translations from different languages map to the same region.

XLM-R Architecture

XLM-R is architecturally identical to RoBERTa — a transformer encoder with masked language modeling. The only differences are scale and data:

Property	RoBERTa	XLM-R Base	XLM-R Large
Languages	1 (English)	100	100
Training data	160GB	2.5TB	2.5TB
Vocab size	50K BPE	250K SP	250K SP
Parameters	125M / 355M	270M	550M
d_model	768 / 1024	768	1024
Layers	12 / 24	12	24

The 250K vocabulary is massive compared to RoBERTa's 50K. This is necessary: 100 languages means 100 sets of common words that need dedicated tokens. Even with 250K tokens, low-resource languages still have higher fertility than English. But the increased vocabulary provides dramatically better coverage than using a 50K English-centric vocabulary for all languages.

The Curse of Multilinguality

Cross-lingual transfer isn't free. Conneau et al. identified the curse of multilinguality: for a fixed model capacity, adding more languages eventually hurts per-language performance. A model trained on 100 languages is worse at English than one trained only on English, given the same parameter count.

The intuition is capacity dilution. A 270M parameter model has finite representational capacity. If it must represent 100 languages, each language gets roughly 1/100th of the capacity. Some capacity is shared (universal syntactic patterns, shared vocabulary), but language-specific phenomena (word order, morphology, semantics) compete for the remaining capacity.

The solution: scale up. XLM-R Large (550M) closes most of the gap with monolingual RoBERTa. Larger models have enough capacity for both shared cross-lingual structure and language-specific knowledge. This is why modern multilingual models (Llama, GPT-4, PaLM) are very large — multilinguality demands it.

Sampling Strategy: Exponential Smoothing

A naive approach trains on data proportional to each language's size in Common Crawl. But English has 300x more data than Urdu. If we sample proportionally, the model barely sees Urdu.

XLM-R uses exponential smoothing: adjust the sampling probability with a temperature parameter α:

P(L_i) = (n_i)^α / ∑_j (n_j)^α

Where n_i is the number of sentences in language i. With α=1 (proportional sampling), English dominates. With α=0 (uniform sampling), each language gets equal time. XLM-R uses α=0.3, which up-samples low-resource languages significantly while still giving high-resource languages more data (because they have more useful patterns to learn).

python
# Exponential smoothing example
import numpy as np

sizes = {'en': 300, 'fr': 50, 'ur': 1}  # relative sizes
alpha = 0.3

smoothed = {k: v**alpha for k, v in sizes.items()}
total = sum(smoothed.values())
probs = {k: v/total for k, v in smoothed.items()}
# en: 0.47, fr: 0.31, ur: 0.22
# Compare to proportional: en: 0.85, fr: 0.14, ur: 0.003
# Urdu goes from 0.3% to 22% of training samples!

Cross-lingual transfer works because multilingual pretraining aligns semantically equivalent words across languages in a shared embedding space. Fine-tune on English labeled data, and the model transfers to 100 languages because "cat", "chat", and "gato" live near each other in representation space. The cost is the curse of multilinguality: fixed capacity divided among more languages means less per-language performance, solved by scaling up.

XLM-R is fine-tuned for sentiment analysis using only English labeled data. It then achieves 80% accuracy on Chinese sentiment data it has never been fine-tuned on. How is this possible?

The model memorized Chinese sentiment patterns during pretraining Multilingual pretraining aligned English and Chinese representations so that semantically equivalent words occupy similar regions of the embedding space — the English fine-tuning signal transfers through these shared representations Chinese and English share enough vocabulary for direct transfer

Chapter 6: Tokenizer Explorer

Time to put everything together. This interactive explorer lets you type any text and see how different tokenization strategies handle it. Watch how fertility changes across simulated tokenizer configurations, and explore why some scripts are expensive while others are cheap.

Tokenizer Explorer

Type text in any language to see character, BPE, and word tokenizations side by side. Toggle between small (32K) and large (100K) vocabularies. Try mixing languages!

0 chars 0 BPE tokens fertility: 0

Try these experiments to build intuition:

Experiment 1: Language comparison. Type "I love programming" and note the token count. Then try "나는 프로그래밍을 좋아합니다" (same meaning in Korean). The Korean version uses 2-3x more tokens despite expressing the same idea.

Experiment 2: Code. Try pasting a Python one-liner like def hello(): return "world". Programming keywords are common in training data, so "def", "return" are single tokens. But unusual variable names fragment into subwords.

Experiment 3: Emoji. Type "😀🎉🚀". Each emoji is 2-4 bytes in UTF-8. A byte-level BPE tokenizer without emoji merges will produce 2-4 tokens per emoji. With a large vocabulary that includes common emoji merges, each emoji becomes 1 token.

Experiment 4: Mixed scripts. Type "Hello こんにちは Bonjour" and see how the tokenizer handles script switches. The spaces between scripts act as natural token boundaries for BPE, but SentencePiece might learn cross-script tokens from bilingual data.

The tokenizer explorer reveals the hidden biases in any NLP system. Languages with Latin scripts and high representation in training data get compact tokenizations. Languages with unique scripts and less training data are fragmented into many tokens. This isn't just an academic concern — it directly affects cost, speed, and quality for billions of non-English speakers.

What Good Tokenization Looks Like

A well-designed tokenizer achieves several properties simultaneously:

Property	What It Means	How to Measure
Low fertility	Few tokens per word	Average tokens/word across languages
No OOV	Every input can be tokenized	Byte-level fallback
Morphological alignment	Token boundaries match meaning boundaries	Manual inspection of splits
Fertility equity	Similar fertility across languages	Max/min fertility ratio
Compression efficiency	Fewer total tokens in the training corpus	Total tokens after tokenization

No existing tokenizer achieves perfect equity. The best current approach is a combination: large vocabulary (100K+), balanced training corpus, and byte-level fallback. GPT-4's tokenizer is better than GPT-2's on all metrics, but significant gaps remain.

The Cost of Getting Tokenization Wrong

Tokenization failures are surprisingly common and have real consequences. Here are three failure modes you can observe in the explorer:

Glued tokens. GPT-2's tokenizer treats " cannot" (with a leading space) as a single token, but "cannot" (no leading space) as two tokens: "can" + "not". This means the model processes the same word differently depending on whether it starts a sentence. Sentence-initial "Cannot" gets different tokens than mid-sentence " cannot", creating an artificial asymmetry in the model's representations.

Number fragmentation. The number "123456789" tokenizes as multiple pieces, and the splitting point is arbitrary. "12345" might be ["123", "45"] while "12346" is ["123", "46"]. This means the model can't easily learn that 12345 < 12346, because the token representations change discontinuously with small changes in the number. This is one reason LLMs struggle with arithmetic.

URL and code tokenization. A URL like "https://example.com/path/to/page" might become 8+ tokens, with the split points falling in the middle of meaningful components. The model has to reconstruct the URL structure from fragments that don't align with semantic boundaries.

python
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")

# Number fragmentation
print(enc.encode("12345"))   # [4513, 1774] = 2 tokens
print(enc.encode("12346"))   # [4513, 2790] = 2 tokens (different split!)
print(enc.encode("123456"))  # [4513, 11738] = 2 tokens

# Space sensitivity
print(enc.encode("cannot"))   # [34762] = 1 token
print(enc.encode(" cannot"))  # [4250] = 1 token (DIFFERENT token!)

# Code tokenization
print(enc.encode("def __init__(self):"))
# Multiple tokens — underscores split awkwardly

Recent Approaches: Beyond Classical BPE

Several recent works push beyond the standard BPE/WordPiece/Unigram triad:

ByT5 (Xue et al. 2022): Skips tokenization entirely. Processes raw UTF-8 bytes. Sequences are 3-5x longer, but the model never encounters OOV and has constant fertility across languages. The cost is quadratic attention on longer sequences.

MegaByte (Yu et al. 2023): Groups bytes into fixed-length patches (like image patches in ViT), processes patches with a large model, then generates bytes within each patch with a small model. This recovers much of the efficiency lost by byte-level processing.

Tokenizer-free models: An emerging research direction that replaces the embedding lookup table with a character-level encoder (like a small CNN over bytes). This eliminates the fixed vocabulary entirely, but is still in early research stages.

A user types "12345" and "12346" into a language model. Both 5-digit numbers tokenize differently (different token boundaries). Why does this make arithmetic harder for the model?

The model can't represent numbers at all The numbers are too large for the vocabulary The token representations change discontinuously — 12345 and 12346 have completely different token sequences despite being numerically adjacent. The model can't learn a smooth "number line" because the tokenizer fragments numbers at arbitrary boundaries

Chapter 7: Connections

Tokenization sits at the foundation of every language model. The choices made here — BPE vs WordPiece vs Unigram, vocabulary size, training corpus composition — propagate through the entire system, affecting everything from embedding quality to multilingual fairness.

Key Papers

Paper	Contribution	Connection
Neural Machine Translation of Rare Words with Subword Units (Sennrich 2016)	Adapted BPE from compression to NLP tokenization	Foundation of GPT tokenizers (Ch 2)
XLM-R (Conneau 2020)	100-language model with cross-lingual transfer	Core of Ch 5 — shared multilingual embeddings
Language Model Tokenizers Introduce Unfairness Between Languages (Petrov 2023)	Systematic measurement of fertility inequality across languages	The equity argument in Ch 4

Lecture Connections

Lecture	Relationship
L02: Word Vectors	Word2Vec assumes a word vocabulary. Subword tokenization extends this: each subword token gets its own embedding vector. BPE tokens become the atoms of the embedding space.
L05: Transformers	Vocabulary size directly determines the input embedding table and output projection layer. Attention cost is O(n²) where n is sequence length in tokens — so tokenizer efficiency directly impacts model speed.
L07: Pretraining	The pretraining corpus determines the tokenizer's merge statistics. Models pretrained primarily on English develop English-biased tokenizers. Multilingual pretraining (XLM-R) requires balanced tokenizer training to avoid fertility inequality.

Historical Arc

Era	Tokenization	Limitation
Pre-2016	Word-level vocabulary (50K-150K words)	OOV problem; can't handle new words
2016 (Sennrich)	BPE for NMT; subword tokenization begins	English-centric; no multilingual concern
2018 (BERT)	WordPiece with 30K vocab	Small vocab; high fertility for non-English
2019 (GPT-2)	Byte-level BPE with 50K merges	Better coverage but still English-biased
2020 (XLM-R)	SentencePiece with 250K vocab for 100 languages	Large vocab; curse of multilinguality
2023 (GPT-4)	100K vocab; better multilingual coverage	Still 2-5x fertility gap for many languages
Future?	Byte-level or tokenizer-free architectures	Computational cost of long sequences

The Big Picture

Tokenization Concept Map

How the ideas in this lecture connect: from raw text through tokenization algorithms to multilingual models.

What We Covered

Concept	Core Idea	Practical Impact
Subword Tokenization	Split text into pieces between chars and words	No OOV, manageable sequence length, morphology preserved
BPE	Iteratively merge most frequent adjacent pair	Used by GPT-2/3/4, deterministic, byte-level variant eliminates OOV
WordPiece	Merge by maximum likelihood (PMI)	Used by BERT, captures strong co-occurrence patterns
SentencePiece	Language-agnostic: treats spaces as characters	Essential for CJK and scriptless-boundary languages
Fertility	Tokens per word varies across languages	Cost, context, and quality inequality for non-English users
XLM-R	Shared multilingual representations enable cross-lingual transfer	Fine-tune in English, deploy in 100 languages

Looking Ahead

Tokenization research is far from settled. Several open questions define the next frontier:

1. Can we close the fertility gap? Current tokenizers give English a 3-4x advantage over many languages. Language-balanced training, larger vocabularies, and byte-level fallbacks all help, but no existing approach achieves true parity. New architectures like MegaByte (processing bytes in patches) may fundamentally change the calculus.

2. Should tokenization be learned end-to-end? Currently, the tokenizer is trained separately from the model. What if the model could learn its own tokenization as part of pretraining? This would allow the model to discover the optimal granularity for each concept. Early work on character-aware models and byte-level transformers points in this direction.

3. Do we need tokenization at all? ByT5 showed that byte-level models can work. MegaByte showed they can be efficient. If inference hardware continues to get faster, the sequence length penalty of byte-level processing may become tolerable, eliminating the tokenizer entirely and all its biases.

4. Multimodal tokenization. As models process images, audio, and video alongside text, the tokenizer must handle all modalities. Chameleon's VQ-VAE tokenizes images into discrete tokens. SpeechTokenizer quantizes audio. Can we build a universal tokenizer that handles all modalities in a single vocabulary?

"The beginning of wisdom is the definition of terms." — Socrates. In NLP, the beginning of understanding is the definition of tokens. Every architectural decision downstream — embedding size, context length, attention pattern — is shaped by how the tokenizer carves raw text into the atoms of meaning.