Orevaoghene Ahia, Sachin Kumar, Hila Gonen, et al. (University of Washington / Allen AI) — EMNLP 2023

Do All Languages Cost the Same?

Tokenization in the Era of Commercial Language Models — tokenizers trained on English-heavy data produce far more tokens for other languages, making API costs 2-15x higher and creating a systemic language equity problem.

Prerequisites: What a tokenizer does + BPE basics. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Hidden Tax

You and your colleague both use GPT-4's API. You write your prompt in English: "Summarize this article about climate change." Your colleague writes the same prompt in Yoruba, a language spoken by 50 million people in Nigeria. You get charged for 7 tokens. Your colleague gets charged for 45 tokens. Same question. Same model. 6.4x the cost.

This isn't a hypothetical. It's a real, measurable consequence of how modern language models tokenize text. And it affects billions of people.

Commercial LLM APIs (OpenAI, Anthropic, Google) price by the token. The number of tokens depends on the tokenizer — the algorithm that breaks text into the atomic units the model processes. Tokenizers are trained on data, and that data is overwhelmingly English.

LanguageSpeakersTokens for "Hello, how are you?"Cost relative to English
English1.5B61.0x
Spanish550M81.3x
Hindi600M183.0x
Yoruba50M325.3x
Myanmar (Burmese)33M427.0x
The core finding: Tokenizers trained on English-dominated data fragment non-English text into many more tokens. The same semantic content requires 2-15x more tokens in underrepresented languages. Since APIs charge per token, this creates a de facto language tax — users of non-English languages pay more for the same service. This isn't a bug. It's a direct consequence of training data composition and tokenizer design.

Think of it like a highway toll system where the toll is per axle. English drives a sedan (few tokens). Yoruba drives an 18-wheeler carrying the same cargo (many tokens). Same distance, same cargo, 6x the toll.

The Language Tax

Click a language to see how the same sentence gets tokenized. More tokens = higher API cost for the same semantic content.

Why do non-English speakers pay more to use LLM APIs?

Chapter 1: Tokenization Disparity

To understand the problem, we need to understand what tokenizers actually do to non-English text. Let's trace how GPT-4's tokenizer (cl100k_base, byte-level BPE with 100,256 tokens) handles the same word in different languages.

The fragmentation problem

BPE builds its vocabulary by merging the most frequent character pairs in the training data. Since the training data is ~90% English, English character sequences get merged into large tokens. Non-English sequences — especially those using non-Latin scripts — remain fragmented.

python
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4 tokenizer

# English: efficient — common words are single tokens
len(enc.encode("The weather is beautiful today"))
# 5 tokens: ['The', ' weather', ' is', ' beautiful', ' today']

# Hindi: fragmented — Devanagari characters rarely merged
len(enc.encode("आज मौसम बहुत सुंदर है"))  # same meaning
# 22 tokens: each syllable or even byte is a separate token

# Yoruba: heavily fragmented — diacritics break merges
len(enc.encode("Ojú ọjọ́ dára lónìí"))  # same meaning
# 18 tokens: diacritical marks prevent efficient merging

# Myanmar: extremely fragmented — unique script
len(enc.encode("ဒီနေ့ ရာသီဥတု လှပါတယ်"))  # same meaning
# 35 tokens: almost byte-level decomposition

The Hindi version says exactly the same thing as the English version. But it uses 4.4x as many tokens. Why? Because the BPE tokenizer saw relatively little Devanagari text during training, so it never learned to merge Devanagari character pairs into efficient subword tokens.

Fertility: the metric

Ahia et al. define fertility as the ratio of the number of tokens produced by the tokenizer to the number of words in the original text. High fertility means more fragmentation:

Fertility = # tokens / # words
LanguageScriptFertility (GPT-4)Interpretation
EnglishLatin1.0Baseline — roughly 1 token per word
SpanishLatin1.2Close to English, shared Latin script
ChineseHanzi2.3Each character tends to be a separate token
HindiDevanagari4.1Heavy fragmentation of syllables
YorubaLatin+diacritics6.5Diacritics prevent merges
MyanmarMyanmar11.2Near byte-level decomposition
Fertility reveals the bias. A perfectly equitable tokenizer would have similar fertility across languages. In practice, fertility ranges from 1.0 (English) to 15+ (some African and Southeast Asian languages). This 15x range means a Myanmar speaker's "context window" is effectively 15x smaller than an English speaker's — the same 4096-token window holds 15x less semantic content.
Fertility Comparison

Compare token fertility across languages. Each bar shows how many tokens are needed per word. Higher = worse efficiency = higher cost.

Generation quality degrades too

Tokenization disparity doesn't just affect input costs — it directly degrades the quality of generated text. When an LLM generates text token-by-token, high-fertility languages require more sequential decisions per word. Each decision is a chance for error, and errors compound:

python
# Error compounding in generation
#
# English: "understanding" = 1 token
# → 1 decision, accuracy: 99%
#
# Myanmar equivalent = 8 byte tokens
# → 8 sequential decisions
# → If each byte is 99% accurate: 0.99^8 = 92% word accuracy
# → For a 100-word sentence: 0.92^100 = 0.02% chance of ALL correct
#
# English 100-word sentence: 0.99^100 = 36.6% chance of ALL correct
# The gap is enormous — and it's a DIRECT consequence of tokenization

This helps explain why LLMs produce lower-quality output in underrepresented languages — the tokenizer forces the model to make more low-level byte-by-byte decisions, each compounding potential errors. It's not just about seeing less training data — the representation itself is fundamentally less efficient.

What does "fertility" measure in the context of tokenization?

Chapter 2: Measuring the Gap

Ahia et al. systematically measured the tokenization disparity across 17 languages and 4 commercial tokenizers. The results are damning — and consistent across all tokenizers.

Experimental setup

They used FLORES-200, a parallel translation benchmark where the same sentences exist in 200 languages. This is critical: because the sentences are parallel translations, any difference in token count is purely a tokenizer artifact — the semantic content is identical.

python
# The experimental methodology
# 1. Take FLORES-200 parallel sentences (same meaning in all languages)
# 2. Tokenize each sentence with each commercial tokenizer
# 3. Compare token counts across languages
# 4. Compute cost multiplier = tokens(language) / tokens(English)

# Tokenizers tested:
# - GPT-4 (cl100k_base): 100,256 tokens, byte-level BPE
# - GPT-3.5 (r50k_base): 50,257 tokens, byte-level BPE
# - LLaMA (SentencePiece): 32,000 tokens
# - BLOOM (BPE): 250,680 tokens

for lang in ['en', 'hi', 'yo', 'my', 'ta', ...]:
    tokens_en = tokenizer.encode(flores['en'])
    tokens_lang = tokenizer.encode(flores[lang])
    cost_multiplier = len(tokens_lang) / len(tokens_en)
    print(f"{lang}: {cost_multiplier:.1f}x")

Results across tokenizers

LanguageGPT-4 (100K)GPT-3.5 (50K)LLaMA (32K)BLOOM (250K)
English1.0x1.0x1.0x1.0x
Spanish1.2x1.4x1.2x1.1x
Chinese2.3x2.7x1.9x1.5x
Arabic3.2x4.1x3.5x1.8x
Hindi4.1x6.8x4.9x2.1x
Tamil5.8x9.2x6.5x2.4x
Yoruba6.5x10.1x7.2x3.8x
Myanmar11.2x15.3x12.1x4.2x
Two clear patterns emerge. First: larger vocabularies help but don't eliminate the gap. GPT-4's 100K vocabulary is better than GPT-3.5's 50K, but Myanmar is still 11x more expensive. Second: BLOOM, which was specifically trained on diverse languages, has the smallest gaps — but even it shows 4x for Myanmar. The problem is structural, not just a matter of vocabulary size.

Context window disparity

The cost multiplier affects more than just API pricing. It directly reduces the effective context window for non-English languages. If GPT-4's context window is 8,192 tokens:

effective context (in words)
English:  8,192 tokens ÷ 1.0 fertility ≈ 8,192 words
Hindi:    8,192 tokens ÷ 4.1 fertility ≈ 1,998 words  (4x less content)
Yoruba:   8,192 tokens ÷ 6.5 fertility ≈ 1,260 words  (6.5x less content)
Myanmar:  8,192 tokens ÷ 11.2 fertility ≈  731 words  (11x less content)

A Myanmar speaker can fit less than one-tenth the semantic content in the same context window. For tasks like document summarization or RAG, this is devastating — you can retrieve and process far fewer documents.

Tokenizer Comparison

Select a tokenizer to see cost multipliers across languages. BLOOM (trained on diverse data) has the smallest gaps; GPT-3.5 (English-heavy) has the largest.

Why does the tokenization gap also reduce the effective context window?

Chapter 3: Why Tokenizers Are Biased

The tokenization disparity isn't random — it's a direct consequence of three compounding factors in how BPE tokenizers are built.

Factor 1: Training data composition

BPE merges character pairs based on frequency in the training data. The training data for major tokenizers is overwhelmingly English:

TokenizerEnglish %Next languageAfrican languages
GPT-2/3 (r50k)~93%French ~1%~0.001%
GPT-4 (cl100k)~85%Code ~5%~0.01%
LLaMA~89%Mixed European ~5%~0.01%
BLOOM~30%French ~12%~5%

When English text dominates, BPE merges prioritize English character sequences. "th" gets merged early because "the" appears billions of times. The Hindi sequence "म" + "ह" never gets merged because it appears too rarely in the training data.

Factor 2: Script diversity

Languages using the Latin script benefit from shared subword tokens with English. "universal" in English, Spanish ("universal"), French ("universel"), and German ("universell") all share the subword "univers". But Hindi (Devanagari), Arabic, Chinese, and other non-Latin scripts share nothing — every token must be learned from scratch for those scripts.

Factor 3: Byte-level fallback

Modern tokenizers (GPT-4, LLaMA) use byte-level BPE. Characters that weren't frequent enough to earn dedicated tokens get decomposed into raw UTF-8 bytes. A single Hindi character like "ह" is 3 UTF-8 bytes (E0 B9 B9). If the tokenizer hasn't learned "ह" as a token, it becomes 3 separate byte tokens.

python
# Why non-Latin scripts fragment more

# English "a" = 1 UTF-8 byte = 1 token (always merged)
"a".encode('utf-8')  # b'\x61' — 1 byte

# Hindi "ह" = 3 UTF-8 bytes = potentially 3 tokens
"ह".encode('utf-8')  # b'\xe0\xb9\xb9' — 3 bytes

# Myanmar "ပ" = 3 UTF-8 bytes = potentially 3 tokens
"ပ".encode('utf-8')  # b'\xe1\x80\x95' — 3 bytes

# Chinese "你" = 3 UTF-8 bytes, but usually merged (enough data)
"你".encode('utf-8')  # b'\xe4\xbd\xa0' — 3 bytes, but 1 token (frequent)

# The key: frequency in training data determines whether
# multi-byte characters get merged into single tokens or stay as bytes
The compounding effect. These three factors multiply each other. Non-Latin scripts (Factor 2) have multi-byte characters (Factor 3) that appear rarely in English-dominated training data (Factor 1). The result: a single Burmese syllable might become 3-6 byte tokens, while the equivalent English syllable is part of a larger merged token. The tokenizer isn't broken — it's optimizing for the distribution it was trained on, which just happens to be English.
Byte-Level Decomposition

Click a script to see how byte-level BPE handles characters from different writing systems. Latin characters are 1 byte; others are 2-4 bytes and may not get merged.

What three factors compound to cause tokenization disparity for non-English languages?

Chapter 4: Cost Multipliers

Ahia et al. translate the tokenization disparity into concrete dollar amounts. The analysis is sobering — the "language tax" can make LLM usage prohibitively expensive for speakers of underrepresented languages.

API pricing math

OpenAI's GPT-4 pricing (as of the paper): $0.03 per 1K input tokens, $0.06 per 1K output tokens. Let's compute the cost of a simple task — translating a 1,000-word document:

python
# Cost to process 1,000 words with GPT-4
price_per_1k_tokens = 0.03  # USD, input pricing

costs = {
    'English':   1000 * 1.0  * price_per_1k_tokens / 1000,  # $0.030
    'Spanish':   1000 * 1.2  * price_per_1k_tokens / 1000,  # $0.036
    'Hindi':     1000 * 4.1  * price_per_1k_tokens / 1000,  # $0.123
    'Yoruba':    1000 * 6.5  * price_per_1k_tokens / 1000,  # $0.195
    'Myanmar':   1000 * 11.2 * price_per_1k_tokens / 1000,  # $0.336
}
# Myanmar costs 11.2x more than English for the SAME content!

At scale: the cost is devastating

These multipliers compound with usage. Consider a chatbot serving 100K daily queries, each averaging 500 words of input + output:

LanguageDaily costMonthly costAnnual cost
English$150$4,500$54,000
Hindi$615$18,450$221,400
Yoruba$975$29,250$351,000
Myanmar$1,680$50,400$604,800

A Myanmar chatbot costs $604,800/year vs $54,000/year for the same service in English. That's a $550,000 language tax.

Real-world case study: education chatbots

Consider an educational chatbot deployed across Africa. The World Bank has funded AI-powered tutoring programs in Sub-Saharan Africa. A student in Lagos asking questions in Yoruba generates 6.5x more tokens than a student in London asking the same questions in English. If the program has a fixed budget, the Yoruba students get fewer interactions — or the program serves fewer students. The tokenizer bias directly translates to educational inequality.

python
# Case study: Educational chatbot budget allocation
budget = 100000  # $100K annual budget
cost_per_query_en = 0.015  # $0.015 per query (500 words)
cost_per_query_yo = 0.015 * 6.5  # $0.0975 per query

# English students get: 100000/0.015 = 6.7M queries/year
# Yoruba students get:  100000/0.0975 = 1.0M queries/year
# Same budget, same model, 6.5x fewer learning interactions

# To equalize: you'd need 6.5x the budget for Yoruba deployment
# Or: fix the tokenizer
The equity dimension. The languages that cost the most are typically spoken in the world's poorest countries. Myanmar's GDP per capita is $1,210. Nigeria's (Yoruba) is $2,180. The people who can least afford the language tax are the ones who pay the most. This creates a feedback loop: high costs → less usage → less data → worse tokenization → higher costs.

Hidden costs beyond pricing

The cost multiplier affects more than just API bills:

ImpactHow It Manifests
Inference latencyMore tokens = more autoregressive decoding steps = slower responses
Context windowSame token budget holds less semantic content (11x less for Myanmar)
Fine-tuning costMore tokens per example = more training compute per epoch
RAG effectivenessFewer retrieved documents fit in context → worse retrieval quality
Annual Cost Calculator

Drag the slider to set daily query volume. See how the annual API cost differs across languages for the same service.

Daily queries 100K
Beyond API pricing, what other costs does tokenization disparity impose?

Chapter 5: Downstream Performance

The tokenization disparity doesn't just affect cost — it degrades model performance. Ahia et al. show that languages with higher fertility systematically perform worse on downstream NLP tasks.

The fertility-performance correlation

Across multiple benchmarks, there is a strong negative correlation between fertility and task performance. Higher fertility (more fragmentation) leads to lower accuracy. This makes sense mechanistically:

Information bottleneck. Each token position in a Transformer has a fixed-size hidden state (e.g., 4096 dims in LLaMA). If a single English word is one token, that word gets a full 4096-dim representation. If the Hindi equivalent is 4 tokens, the same semantic meaning is spread across 4 representations that must be composed by the Transformer — a harder learning problem.

Sequence length penalties. Longer sequences (more tokens) are harder for Transformers. Attention is O(n²) in sequence length. More tokens mean more attention computation and more opportunities for the model to "lose track" of long-range dependencies.

python
# The fragmentation → performance pipeline
#
# English: "understanding" = 1 token
#   → Full 4096-dim representation for the concept
#   → Model easily learns to process it
#
# Hindi: "समझना" = 4 tokens ["सम", "झ", "ना", "·"]
#   → Meaning spread across 4 positions
#   → Model must learn to compose them via attention
#   → Harder learning problem → worse performance
#   → Also: 4x more tokens → 4x the cost → 4x slower

# Measured correlation (Ahia et al.):
# Pearson r = -0.82 between fertility and NLI accuracy
# This is a STRONG negative correlation
Tokenization is an upstream bottleneck. No amount of model scaling, fine-tuning, or prompt engineering can fully compensate for bad tokenization. If the input representation is fragmented and inefficient, the model starts from a disadvantage. Fixing tokenization is arguably more impactful per dollar than scaling model size — but it requires rebuilding the tokenizer and retraining the model from scratch.

The causal chain: data → tokenizer → performance

Ahia et al. trace the causal chain precisely. It's not just correlation — there's a clear mechanism:

1. Training Data Imbalance
Tokenizer training corpus is ~90% English. Non-English character patterns are rare.
2. Tokenizer Bias
BPE merges English pairs early (efficient tokens). Non-English pairs stay fragmented.
3. Higher Fertility
Same meaning requires 2-15x more tokens in underrepresented languages.
4. Performance Degradation
Fragmented representations → harder learning → lower accuracy.
5. Cost Multiplier
More tokens per query → higher API cost → less usage → less data → worse tokenizer (feedback loop).
python
# Demonstrating the information bottleneck
# Each token position has a fixed hidden dimension (e.g., 4096 in LLaMA)

# English: "understanding" = 1 token = 4096 dims for this concept
# Information density: CONCEPT / 4096 = high

# Hindi: "समझना" = 4 tokens = 4 × 4096 dims, but fragmented
# The model must learn to compose 4 sub-word representations
# via attention to reconstruct the concept
# This requires more layers of processing → worse for shallow layers

# Experiment: probe accuracy by layer
# English sentiment probe at layer 1:  72%  (already linearly decodable)
# Hindi sentiment probe at layer 1:    54%  (still fragmented)
# Hindi catches up by layer 12:        71%  (after attention composition)
# But it never fully closes the gap to English: 71% vs 78%

Potential mitigations

MitigationHowTradeoff
Balanced training dataTrain tokenizer on data with equal representation per languageReduces English efficiency; may require larger vocab
Larger vocabularyMore tokens (250K+) to cover more scriptsLarger embedding matrix → more memory
Language-specific tokenizersDifferent tokenizer per language familyComplexity; breaks shared vocabulary benefits
Character-level modelsNo tokenizer at all — operate on characters/bytesMuch longer sequences → higher compute
Per-language pricingCharge by semantic content, not tokensBusiness model change; hard to define "semantic unit"
Fertility vs Performance

This scatter plot shows the correlation between token fertility (x-axis) and NLI accuracy (y-axis) across languages. The strong negative correlation (r = -0.82) shows that fragmented tokenization directly hurts model performance.

Why does higher token fertility lead to worse downstream performance?

Chapter 6: Cost Calculator

This interactive tool lets you explore the real-world cost implications of tokenization disparity. Configure your use case and see how much extra you'd pay in each language.

Language Cost Simulator

Configure your LLM usage scenario. The simulator calculates annual costs for each language based on actual tokenizer fertility data. Drag sliders to adjust daily queries, average prompt length, and per-token price.

Daily queries 50K
Avg words/query 500
Price ($/1K tokens) $0.030
Try this: Set daily queries to 500,000 (a mid-sized chatbot), average length to 800 words (a typical conversation), and price to $0.03/1K tokens (GPT-4 input). English costs about $4.4M/year. Myanmar costs $49M/year. Same service, same model, 11x the cost — solely because of how the tokenizer segments text.
A company wants to deploy a GPT-4-based chatbot serving 100K daily queries in both English and Hindi. Based on tokenizer fertility data, how much more will the Hindi deployment cost?

Chapter 7: Connections

This paper is a wake-up call for the LLM community. It quantifies a problem that was informally known but never rigorously measured, and it positions tokenization as a first-class equity concern.

Related work and context

PaperContributionRelationship
Sennrich et al. (2016)Introduced BPE for NLPThe origin — BPE's frequency-based merging creates the bias
Conneau et al. (2020)XLM-R: multilingual at scaleUsed balanced sampling (α=0.3) but still English-dominated tokenizer
Rust et al. (2021)"How Good is Your Tokenizer?"First systematic evaluation of tokenizer quality across languages
Petrov et al. (2024)"Language Model Tokenizers Introduce Unfairness"Extended analysis to generation quality, not just cost

Progress since the paper

The paper has already driven concrete changes:

LLaMA-3 (2024) increased its vocabulary from 32K to 128K tokens, with deliberate inclusion of non-Latin scripts. This reduced the fertility gap for many languages — Hindi dropped from ~4.9x to ~2.5x.

Gemma (2024) uses a 256K vocabulary with explicit multilingual coverage. Google invested in diverse training data for the tokenizer specifically to address this problem.

Per-character pricing has been discussed by some API providers as an alternative to per-token pricing that would eliminate the language tax entirely.

The technical path forward

Ahia et al. suggest several concrete technical improvements:

python
# Technical improvement 1: Balanced tokenizer training
# Instead of training BPE on the raw data distribution
# (90% English), use smoothed sampling (like XLM-R's α=0.3)

def train_equitable_tokenizer(data_by_lang, vocab_size, alpha=0.3):
    # Compute smoothed sampling weights
    sizes = {lang: len(data) for lang, data in data_by_lang.items()}
    probs = {lang: size**alpha for lang, size in sizes.items()}
    total = sum(probs.values())
    probs = {lang: p/total for lang, p in probs.items()}

    # Sample training data according to smoothed distribution
    balanced_data = sample_by_probs(data_by_lang, probs)

    # Train BPE on balanced data
    return train_bpe(balanced_data, vocab_size)

# Technical improvement 2: Script-aware vocabulary allocation
# Reserve minimum vocabulary slots per script:
# Latin:       ~40K tokens (shared across many languages)
# CJK:         ~40K tokens (Chinese, Japanese, Korean)
# Devanagari:  ~15K tokens (Hindi, Marathi, Sanskrit)
# Arabic:      ~15K tokens (Arabic, Urdu, Persian)
# Cyrillic:    ~10K tokens (Russian, Ukrainian, Bulgarian)
# Other:       ~30K tokens (remaining scripts)
# Shared:      ~50K tokens (numbers, punctuation, common subwords)
# Total:       200K tokens
BLOOM showed this works. The BLOOM tokenizer (250K vocabulary) was trained on data balanced across 46 languages. Its Myanmar fertility is 4.2x vs GPT-4's 11.2x — a 63% reduction in the language tax. The cost: a larger embedding matrix (250K×4096 vs 100K×4096, adding ~600M parameters). This is a small price for a more equitable tokenizer.
The deeper lesson. Tokenization seems like a boring preprocessing step — a plumbing detail. But Ahia et al. show that it determines who can afford to use LLMs, how well they work for different populations, and which languages benefit from the AI revolution. The plumbing matters more than the fixtures. This paper turns tokenization from an engineering afterthought into an equity imperative.

"The most consequential design decision in a language model may not be the architecture, the training objective, or the scale — it may be the tokenizer trained on the first day and never questioned again."

Vocabulary Size Evolution

Drag the slider to see how tokenizer vocabulary sizes have grown in response to the equity problem. Larger vocabularies reduce the fertility gap but don't eliminate it.

Model GPT-2 (2019)
What is the most effective solution to tokenization disparity according to the paper?