Tokenization Cost (Ahia 2023)

Chapter 0: The Hidden Tax

You and your colleague both use GPT-4's API. You write your prompt in English: "Summarize this article about climate change." Your colleague writes the same prompt in Yoruba, a language spoken by 50 million people in Nigeria. You get charged for 7 tokens. Your colleague gets charged for 45 tokens. Same question. Same model. 6.4x the cost.

This isn't a hypothetical. It's a real, measurable consequence of how modern language models tokenize text. And it affects billions of people.

Commercial LLM APIs (OpenAI, Anthropic, Google) price by the token. The number of tokens depends on the tokenizer — the algorithm that breaks text into the atomic units the model processes. Tokenizers are trained on data, and that data is overwhelmingly English.

Language	Speakers	Tokens for "Hello, how are you?"	Cost relative to English
English	1.5B	6	1.0x
Spanish	550M	8	1.3x
Hindi	600M	18	3.0x
Yoruba	50M	32	5.3x
Myanmar (Burmese)	33M	42	7.0x

The core finding: Tokenizers trained on English-dominated data fragment non-English text into many more tokens. The same semantic content requires 2-15x more tokens in underrepresented languages. Since APIs charge per token, this creates a de facto language tax — users of non-English languages pay more for the same service. This isn't a bug. It's a direct consequence of training data composition and tokenizer design.

Think of it like a highway toll system where the toll is per axle. English drives a sedan (few tokens). Yoruba drives an 18-wheeler carrying the same cargo (many tokens). Same distance, same cargo, 6x the toll.

The Language Tax

Click a language to see how the same sentence gets tokenized. More tokens = higher API cost for the same semantic content.

Why do non-English speakers pay more to use LLM APIs?

Tokenizers trained on English-heavy data produce many more tokens for non-English text — since APIs charge per token, the same semantic content costs 2-15x more in underrepresented languages. This is a direct consequence of training data composition, not intentional pricing discrimination. API providers charge different rates for different languages Non-English text requires more compute to process

Chapter 1: Tokenization Disparity

To understand the problem, we need to understand what tokenizers actually do to non-English text. Let's trace how GPT-4's tokenizer (cl100k_base, byte-level BPE with 100,256 tokens) handles the same word in different languages.

The fragmentation problem

BPE builds its vocabulary by merging the most frequent character pairs in the training data. Since the training data is ~90% English, English character sequences get merged into large tokens. Non-English sequences — especially those using non-Latin scripts — remain fragmented.

python
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4 tokenizer

# English: efficient — common words are single tokens
len(enc.encode("The weather is beautiful today"))
# 5 tokens: ['The', ' weather', ' is', ' beautiful', ' today']

# Hindi: fragmented — Devanagari characters rarely merged
len(enc.encode("आज मौसम बहुत सुंदर है"))  # same meaning
# 22 tokens: each syllable or even byte is a separate token

# Yoruba: heavily fragmented — diacritics break merges
len(enc.encode("Ojú ọjọ́ dára lónìí"))  # same meaning
# 18 tokens: diacritical marks prevent efficient merging

# Myanmar: extremely fragmented — unique script
len(enc.encode("ဒီနေ့ ရာသီဥတု လှပါတယ်"))  # same meaning
# 35 tokens: almost byte-level decomposition

The Hindi version says exactly the same thing as the English version. But it uses 4.4x as many tokens. Why? Because the BPE tokenizer saw relatively little Devanagari text during training, so it never learned to merge Devanagari character pairs into efficient subword tokens.

Fertility: the metric

Ahia et al. define fertility as the ratio of the number of tokens produced by the tokenizer to the number of words in the original text. High fertility means more fragmentation:

Fertility = # tokens / # words

Language	Script	Fertility (GPT-4)	Interpretation
English	Latin	1.0	Baseline — roughly 1 token per word
Spanish	Latin	1.2	Close to English, shared Latin script
Chinese	Hanzi	2.3	Each character tends to be a separate token
Hindi	Devanagari	4.1	Heavy fragmentation of syllables
Yoruba	Latin+diacritics	6.5	Diacritics prevent merges
Myanmar	Myanmar	11.2	Near byte-level decomposition

Fertility reveals the bias. A perfectly equitable tokenizer would have similar fertility across languages. In practice, fertility ranges from 1.0 (English) to 15+ (some African and Southeast Asian languages). This 15x range means a Myanmar speaker's "context window" is effectively 15x smaller than an English speaker's — the same 4096-token window holds 15x less semantic content.

Fertility Comparison

Compare token fertility across languages. Each bar shows how many tokens are needed per word. Higher = worse efficiency = higher cost.

Generation quality degrades too

Tokenization disparity doesn't just affect input costs — it directly degrades the quality of generated text. When an LLM generates text token-by-token, high-fertility languages require more sequential decisions per word. Each decision is a chance for error, and errors compound:

python
# Error compounding in generation
#
# English: "understanding" = 1 token
# → 1 decision, accuracy: 99%
#
# Myanmar equivalent = 8 byte tokens
# → 8 sequential decisions
# → If each byte is 99% accurate: 0.99^8 = 92% word accuracy
# → For a 100-word sentence: 0.92^100 = 0.02% chance of ALL correct
#
# English 100-word sentence: 0.99^100 = 36.6% chance of ALL correct
# The gap is enormous — and it's a DIRECT consequence of tokenization

This helps explain why LLMs produce lower-quality output in underrepresented languages — the tokenizer forces the model to make more low-level byte-by-byte decisions, each compounding potential errors. It's not just about seeing less training data — the representation itself is fundamentally less efficient.

What does "fertility" measure in the context of tokenization?

The ratio of tokens to words — a fertility of 6.5 means the tokenizer produces 6.5 tokens per word on average, indicating heavy fragmentation. High-resource languages like English have fertility near 1.0 while underrepresented languages can reach 10+ How quickly the tokenizer processes text The number of languages the tokenizer supports

Chapter 2: Measuring the Gap

Ahia et al. systematically measured the tokenization disparity across 17 languages and 4 commercial tokenizers. The results are damning — and consistent across all tokenizers.

Experimental setup

They used FLORES-200, a parallel translation benchmark where the same sentences exist in 200 languages. This is critical: because the sentences are parallel translations, any difference in token count is purely a tokenizer artifact — the semantic content is identical.

python
# The experimental methodology
# 1. Take FLORES-200 parallel sentences (same meaning in all languages)
# 2. Tokenize each sentence with each commercial tokenizer
# 3. Compare token counts across languages
# 4. Compute cost multiplier = tokens(language) / tokens(English)

# Tokenizers tested:
# - GPT-4 (cl100k_base): 100,256 tokens, byte-level BPE
# - GPT-3.5 (r50k_base): 50,257 tokens, byte-level BPE
# - LLaMA (SentencePiece): 32,000 tokens
# - BLOOM (BPE): 250,680 tokens

for lang in ['en', 'hi', 'yo', 'my', 'ta', ...]:
    tokens_en = tokenizer.encode(flores['en'])
    tokens_lang = tokenizer.encode(flores[lang])
    cost_multiplier = len(tokens_lang) / len(tokens_en)
    print(f"{lang}: {cost_multiplier:.1f}x")

Results across tokenizers

Language	GPT-4 (100K)	GPT-3.5 (50K)	LLaMA (32K)	BLOOM (250K)
English	1.0x	1.0x	1.0x	1.0x
Spanish	1.2x	1.4x	1.2x	1.1x
Chinese	2.3x	2.7x	1.9x	1.5x
Arabic	3.2x	4.1x	3.5x	1.8x
Hindi	4.1x	6.8x	4.9x	2.1x
Tamil	5.8x	9.2x	6.5x	2.4x
Yoruba	6.5x	10.1x	7.2x	3.8x
Myanmar	11.2x	15.3x	12.1x	4.2x

Two clear patterns emerge. First: larger vocabularies help but don't eliminate the gap. GPT-4's 100K vocabulary is better than GPT-3.5's 50K, but Myanmar is still 11x more expensive. Second: BLOOM, which was specifically trained on diverse languages, has the smallest gaps — but even it shows 4x for Myanmar. The problem is structural, not just a matter of vocabulary size.

Context window disparity

The cost multiplier affects more than just API pricing. It directly reduces the effective context window for non-English languages. If GPT-4's context window is 8,192 tokens:

effective context (in words)
English:  8,192 tokens ÷ 1.0 fertility ≈ 8,192 words
Hindi:    8,192 tokens ÷ 4.1 fertility ≈ 1,998 words  (4x less content)
Yoruba:   8,192 tokens ÷ 6.5 fertility ≈ 1,260 words  (6.5x less content)
Myanmar:  8,192 tokens ÷ 11.2 fertility ≈  731 words  (11x less content)

A Myanmar speaker can fit less than one-tenth the semantic content in the same context window. For tasks like document summarization or RAG, this is devastating — you can retrieve and process far fewer documents.

Tokenizer Comparison

Select a tokenizer to see cost multipliers across languages. BLOOM (trained on diverse data) has the smallest gaps; GPT-3.5 (English-heavy) has the largest.

Why does the tokenization gap also reduce the effective context window?

Context windows are measured in tokens, not words or meaning. If Myanmar text needs 11x more tokens per word, the 8,192-token window can only hold ~731 Myanmar words vs ~8,192 English words — the same semantic capacity shrinks by 11x, devastating document summarization, RAG, and long-form tasks Because the model processes non-English text more slowly Because non-English text contains more information per word

Chapter 3: Why Tokenizers Are Biased

The tokenization disparity isn't random — it's a direct consequence of three compounding factors in how BPE tokenizers are built.

Factor 1: Training data composition

BPE merges character pairs based on frequency in the training data. The training data for major tokenizers is overwhelmingly English:

Tokenizer	English %	Next language	African languages
GPT-2/3 (r50k)	~93%	French ~1%	~0.001%
GPT-4 (cl100k)	~85%	Code ~5%	~0.01%
LLaMA	~89%	Mixed European ~5%	~0.01%
BLOOM	~30%	French ~12%	~5%

When English text dominates, BPE merges prioritize English character sequences. "th" gets merged early because "the" appears billions of times. The Hindi sequence "म" + "ह" never gets merged because it appears too rarely in the training data.

Factor 2: Script diversity

Languages using the Latin script benefit from shared subword tokens with English. "universal" in English, Spanish ("universal"), French ("universel"), and German ("universell") all share the subword "univers". But Hindi (Devanagari), Arabic, Chinese, and other non-Latin scripts share nothing — every token must be learned from scratch for those scripts.

Factor 3: Byte-level fallback

Modern tokenizers (GPT-4, LLaMA) use byte-level BPE. Characters that weren't frequent enough to earn dedicated tokens get decomposed into raw UTF-8 bytes. A single Hindi character like "ह" is 3 UTF-8 bytes (E0 B9 B9). If the tokenizer hasn't learned "ह" as a token, it becomes 3 separate byte tokens.

python
# Why non-Latin scripts fragment more

# English "a" = 1 UTF-8 byte = 1 token (always merged)
"a".encode('utf-8')  # b'\x61' — 1 byte

# Hindi "ह" = 3 UTF-8 bytes = potentially 3 tokens
"ह".encode('utf-8')  # b'\xe0\xb9\xb9' — 3 bytes

# Myanmar "ပ" = 3 UTF-8 bytes = potentially 3 tokens
"ပ".encode('utf-8')  # b'\xe1\x80\x95' — 3 bytes

# Chinese "你" = 3 UTF-8 bytes, but usually merged (enough data)
"你".encode('utf-8')  # b'\xe4\xbd\xa0' — 3 bytes, but 1 token (frequent)

# The key: frequency in training data determines whether
# multi-byte characters get merged into single tokens or stay as bytes

The compounding effect. These three factors multiply each other. Non-Latin scripts (Factor 2) have multi-byte characters (Factor 3) that appear rarely in English-dominated training data (Factor 1). The result: a single Burmese syllable might become 3-6 byte tokens, while the equivalent English syllable is part of a larger merged token. The tokenizer isn't broken — it's optimizing for the distribution it was trained on, which just happens to be English.

Byte-Level Decomposition

Click a script to see how byte-level BPE handles characters from different writing systems. Latin characters are 1 byte; others are 2-4 bytes and may not get merged.

What three factors compound to cause tokenization disparity for non-English languages?

(1) Training data is ~85-93% English, so BPE merges favor English patterns. (2) Non-Latin scripts share no subwords with English, requiring all tokens to be learned from scratch. (3) Non-Latin characters are 2-4 UTF-8 bytes, fragmenting into multiple byte tokens when not merged. These factors multiply: rare multi-byte characters in English-dominated data stay fragmented. Non-English languages have more complex grammar API providers intentionally make non-English tokenization worse

Chapter 4: Cost Multipliers

Ahia et al. translate the tokenization disparity into concrete dollar amounts. The analysis is sobering — the "language tax" can make LLM usage prohibitively expensive for speakers of underrepresented languages.

API pricing math

OpenAI's GPT-4 pricing (as of the paper): $0.03 per 1K input tokens, $0.06 per 1K output tokens. Let's compute the cost of a simple task — translating a 1,000-word document:

python
# Cost to process 1,000 words with GPT-4
price_per_1k_tokens = 0.03  # USD, input pricing

costs = {
    'English':   1000 * 1.0  * price_per_1k_tokens / 1000,  # $0.030
    'Spanish':   1000 * 1.2  * price_per_1k_tokens / 1000,  # $0.036
    'Hindi':     1000 * 4.1  * price_per_1k_tokens / 1000,  # $0.123
    'Yoruba':    1000 * 6.5  * price_per_1k_tokens / 1000,  # $0.195
    'Myanmar':   1000 * 11.2 * price_per_1k_tokens / 1000,  # $0.336
}
# Myanmar costs 11.2x more than English for the SAME content!

At scale: the cost is devastating

These multipliers compound with usage. Consider a chatbot serving 100K daily queries, each averaging 500 words of input + output:

Language	Daily cost	Monthly cost	Annual cost
English	$150	$4,500	$54,000
Hindi	$615	$18,450	$221,400
Yoruba	$975	$29,250	$351,000
Myanmar	$1,680	$50,400	$604,800

A Myanmar chatbot costs $604,800/year vs $54,000/year for the same service in English. That's a $550,000 language tax.

Real-world case study: education chatbots

Consider an educational chatbot deployed across Africa. The World Bank has funded AI-powered tutoring programs in Sub-Saharan Africa. A student in Lagos asking questions in Yoruba generates 6.5x more tokens than a student in London asking the same questions in English. If the program has a fixed budget, the Yoruba students get fewer interactions — or the program serves fewer students. The tokenizer bias directly translates to educational inequality.

python
# Case study: Educational chatbot budget allocation
budget = 100000  # $100K annual budget
cost_per_query_en = 0.015  # $0.015 per query (500 words)
cost_per_query_yo = 0.015 * 6.5  # $0.0975 per query

# English students get: 100000/0.015 = 6.7M queries/year
# Yoruba students get:  100000/0.0975 = 1.0M queries/year
# Same budget, same model, 6.5x fewer learning interactions

# To equalize: you'd need 6.5x the budget for Yoruba deployment
# Or: fix the tokenizer

The equity dimension. The languages that cost the most are typically spoken in the world's poorest countries. Myanmar's GDP per capita is $1,210. Nigeria's (Yoruba) is $2,180. The people who can least afford the language tax are the ones who pay the most. This creates a feedback loop: high costs → less usage → less data → worse tokenization → higher costs.

Hidden costs beyond pricing

The cost multiplier affects more than just API bills:

Impact	How It Manifests
Inference latency	More tokens = more autoregressive decoding steps = slower responses
Context window	Same token budget holds less semantic content (11x less for Myanmar)
Fine-tuning cost	More tokens per example = more training compute per epoch
RAG effectiveness	Fewer retrieved documents fit in context → worse retrieval quality

Annual Cost Calculator

Drag the slider to set daily query volume. See how the annual API cost differs across languages for the same service.

Daily queries 100K

Beyond API pricing, what other costs does tokenization disparity impose?

Higher inference latency (more decoding steps), reduced effective context windows (11x less content for Myanmar), higher fine-tuning costs (more tokens per training example), and worse RAG quality (fewer documents fit in context) — the language tax compounds across every aspect of LLM usage Non-English models require more GPU memory The only cost is the API pricing difference

Chapter 5: Downstream Performance

The tokenization disparity doesn't just affect cost — it degrades model performance. Ahia et al. show that languages with higher fertility systematically perform worse on downstream NLP tasks.

The fertility-performance correlation

Across multiple benchmarks, there is a strong negative correlation between fertility and task performance. Higher fertility (more fragmentation) leads to lower accuracy. This makes sense mechanistically:

Information bottleneck. Each token position in a Transformer has a fixed-size hidden state (e.g., 4096 dims in LLaMA). If a single English word is one token, that word gets a full 4096-dim representation. If the Hindi equivalent is 4 tokens, the same semantic meaning is spread across 4 representations that must be composed by the Transformer — a harder learning problem.

Sequence length penalties. Longer sequences (more tokens) are harder for Transformers. Attention is O(n²) in sequence length. More tokens mean more attention computation and more opportunities for the model to "lose track" of long-range dependencies.

python
# The fragmentation → performance pipeline
#
# English: "understanding" = 1 token
#   → Full 4096-dim representation for the concept
#   → Model easily learns to process it
#
# Hindi: "समझना" = 4 tokens ["सम", "झ", "ना", "·"]
#   → Meaning spread across 4 positions
#   → Model must learn to compose them via attention
#   → Harder learning problem → worse performance
#   → Also: 4x more tokens → 4x the cost → 4x slower

# Measured correlation (Ahia et al.):
# Pearson r = -0.82 between fertility and NLI accuracy
# This is a STRONG negative correlation

Tokenization is an upstream bottleneck. No amount of model scaling, fine-tuning, or prompt engineering can fully compensate for bad tokenization. If the input representation is fragmented and inefficient, the model starts from a disadvantage. Fixing tokenization is arguably more impactful per dollar than scaling model size — but it requires rebuilding the tokenizer and retraining the model from scratch.

The causal chain: data → tokenizer → performance

Ahia et al. trace the causal chain precisely. It's not just correlation — there's a clear mechanism:

1. Training Data Imbalance

Tokenizer training corpus is ~90% English. Non-English character patterns are rare.

↓

2. Tokenizer Bias

BPE merges English pairs early (efficient tokens). Non-English pairs stay fragmented.

↓

3. Higher Fertility

Same meaning requires 2-15x more tokens in underrepresented languages.

↓

4. Performance Degradation

Fragmented representations → harder learning → lower accuracy.

↓

5. Cost Multiplier

More tokens per query → higher API cost → less usage → less data → worse tokenizer (feedback loop).

python
# Demonstrating the information bottleneck
# Each token position has a fixed hidden dimension (e.g., 4096 in LLaMA)

# English: "understanding" = 1 token = 4096 dims for this concept
# Information density: CONCEPT / 4096 = high

# Hindi: "समझना" = 4 tokens = 4 × 4096 dims, but fragmented
# The model must learn to compose 4 sub-word representations
# via attention to reconstruct the concept
# This requires more layers of processing → worse for shallow layers

# Experiment: probe accuracy by layer
# English sentiment probe at layer 1:  72%  (already linearly decodable)
# Hindi sentiment probe at layer 1:    54%  (still fragmented)
# Hindi catches up by layer 12:        71%  (after attention composition)
# But it never fully closes the gap to English: 71% vs 78%

Potential mitigations

Mitigation	How	Tradeoff
Balanced training data	Train tokenizer on data with equal representation per language	Reduces English efficiency; may require larger vocab
Larger vocabulary	More tokens (250K+) to cover more scripts	Larger embedding matrix → more memory
Language-specific tokenizers	Different tokenizer per language family	Complexity; breaks shared vocabulary benefits
Character-level models	No tokenizer at all — operate on characters/bytes	Much longer sequences → higher compute
Per-language pricing	Charge by semantic content, not tokens	Business model change; hard to define "semantic unit"

Fertility vs Performance

This scatter plot shows the correlation between token fertility (x-axis) and NLI accuracy (y-axis) across languages. The strong negative correlation (r = -0.82) shows that fragmented tokenization directly hurts model performance.

Why does higher token fertility lead to worse downstream performance?

Higher fertility means word meanings are fragmented across multiple token positions — the model must compose these fragments via attention (a harder learning problem), sequences become longer (more O(n²) computation, more room for errors), and each concept gets less representational capacity per information unit Because non-English text is inherently harder to classify Because the model was only trained on English data

Chapter 6: Cost Calculator

This interactive tool lets you explore the real-world cost implications of tokenization disparity. Configure your use case and see how much extra you'd pay in each language.

Language Cost Simulator

Configure your LLM usage scenario. The simulator calculates annual costs for each language based on actual tokenizer fertility data. Drag sliders to adjust daily queries, average prompt length, and per-token price.

Daily queries 50K

Avg words/query 500

Price ($/1K tokens) $0.030

Try this: Set daily queries to 500,000 (a mid-sized chatbot), average length to 800 words (a typical conversation), and price to $0.03/1K tokens (GPT-4 input). English costs about $4.4M/year. Myanmar costs $49M/year. Same service, same model, 11x the cost — solely because of how the tokenizer segments text.

A company wants to deploy a GPT-4-based chatbot serving 100K daily queries in both English and Hindi. Based on tokenizer fertility data, how much more will the Hindi deployment cost?

About the same cost — the model is the same About 4x more — Hindi has a fertility of ~4.1x on GPT-4's tokenizer, meaning the same semantic content produces 4.1x more tokens, making the API cost roughly 4x higher. This is a purely tokenizer-driven cost that has nothing to do with the model or the task. About 2x more

Chapter 7: Connections

This paper is a wake-up call for the LLM community. It quantifies a problem that was informally known but never rigorously measured, and it positions tokenization as a first-class equity concern.

Related work and context

Paper	Contribution	Relationship
Sennrich et al. (2016)	Introduced BPE for NLP	The origin — BPE's frequency-based merging creates the bias
Conneau et al. (2020)	XLM-R: multilingual at scale	Used balanced sampling (α=0.3) but still English-dominated tokenizer
Rust et al. (2021)	"How Good is Your Tokenizer?"	First systematic evaluation of tokenizer quality across languages
Petrov et al. (2024)	"Language Model Tokenizers Introduce Unfairness"	Extended analysis to generation quality, not just cost

Progress since the paper

The paper has already driven concrete changes:

LLaMA-3 (2024) increased its vocabulary from 32K to 128K tokens, with deliberate inclusion of non-Latin scripts. This reduced the fertility gap for many languages — Hindi dropped from ~4.9x to ~2.5x.

Gemma (2024) uses a 256K vocabulary with explicit multilingual coverage. Google invested in diverse training data for the tokenizer specifically to address this problem.

Per-character pricing has been discussed by some API providers as an alternative to per-token pricing that would eliminate the language tax entirely.

The technical path forward

Ahia et al. suggest several concrete technical improvements:

python
# Technical improvement 1: Balanced tokenizer training
# Instead of training BPE on the raw data distribution
# (90% English), use smoothed sampling (like XLM-R's α=0.3)

def train_equitable_tokenizer(data_by_lang, vocab_size, alpha=0.3):
    # Compute smoothed sampling weights
    sizes = {lang: len(data) for lang, data in data_by_lang.items()}
    probs = {lang: size**alpha for lang, size in sizes.items()}
    total = sum(probs.values())
    probs = {lang: p/total for lang, p in probs.items()}

    # Sample training data according to smoothed distribution
    balanced_data = sample_by_probs(data_by_lang, probs)

    # Train BPE on balanced data
    return train_bpe(balanced_data, vocab_size)

# Technical improvement 2: Script-aware vocabulary allocation
# Reserve minimum vocabulary slots per script:
# Latin:       ~40K tokens (shared across many languages)
# CJK:         ~40K tokens (Chinese, Japanese, Korean)
# Devanagari:  ~15K tokens (Hindi, Marathi, Sanskrit)
# Arabic:      ~15K tokens (Arabic, Urdu, Persian)
# Cyrillic:    ~10K tokens (Russian, Ukrainian, Bulgarian)
# Other:       ~30K tokens (remaining scripts)
# Shared:      ~50K tokens (numbers, punctuation, common subwords)
# Total:       200K tokens

BLOOM showed this works. The BLOOM tokenizer (250K vocabulary) was trained on data balanced across 46 languages. Its Myanmar fertility is 4.2x vs GPT-4's 11.2x — a 63% reduction in the language tax. The cost: a larger embedding matrix (250K×4096 vs 100K×4096, adding ~600M parameters). This is a small price for a more equitable tokenizer.

The deeper lesson. Tokenization seems like a boring preprocessing step — a plumbing detail. But Ahia et al. show that it determines who can afford to use LLMs, how well they work for different populations, and which languages benefit from the AI revolution. The plumbing matters more than the fixtures. This paper turns tokenization from an engineering afterthought into an equity imperative.

"The most consequential design decision in a language model may not be the architecture, the training objective, or the scale — it may be the tokenizer trained on the first day and never questioned again."

Vocabulary Size Evolution

Drag the slider to see how tokenizer vocabulary sizes have grown in response to the equity problem. Larger vocabularies reduce the fertility gap but don't eliminate it.

Model GPT-2 (2019)

What is the most effective solution to tokenization disparity according to the paper?

Training tokenizers on balanced multilingual data with larger vocabularies (250K+) to ensure all scripts get efficient representation — this addresses the root cause (training data bias) and has been adopted by newer models like LLaMA-3 (128K vocab) and Gemma (256K vocab), though the gap hasn't been fully eliminated Making non-English API calls cheaper Using character-level models for all languages

Do All Languages Cost the Same?