Tokenization in the Era of Commercial Language Models — tokenizers trained on English-heavy data produce far more tokens for other languages, making API costs 2-15x higher and creating a systemic language equity problem.
You and your colleague both use GPT-4's API. You write your prompt in English: "Summarize this article about climate change." Your colleague writes the same prompt in Yoruba, a language spoken by 50 million people in Nigeria. You get charged for 7 tokens. Your colleague gets charged for 45 tokens. Same question. Same model. 6.4x the cost.
This isn't a hypothetical. It's a real, measurable consequence of how modern language models tokenize text. And it affects billions of people.
Commercial LLM APIs (OpenAI, Anthropic, Google) price by the token. The number of tokens depends on the tokenizer — the algorithm that breaks text into the atomic units the model processes. Tokenizers are trained on data, and that data is overwhelmingly English.
| Language | Speakers | Tokens for "Hello, how are you?" | Cost relative to English |
|---|---|---|---|
| English | 1.5B | 6 | 1.0x |
| Spanish | 550M | 8 | 1.3x |
| Hindi | 600M | 18 | 3.0x |
| Yoruba | 50M | 32 | 5.3x |
| Myanmar (Burmese) | 33M | 42 | 7.0x |
Think of it like a highway toll system where the toll is per axle. English drives a sedan (few tokens). Yoruba drives an 18-wheeler carrying the same cargo (many tokens). Same distance, same cargo, 6x the toll.
Click a language to see how the same sentence gets tokenized. More tokens = higher API cost for the same semantic content.
To understand the problem, we need to understand what tokenizers actually do to non-English text. Let's trace how GPT-4's tokenizer (cl100k_base, byte-level BPE with 100,256 tokens) handles the same word in different languages.
BPE builds its vocabulary by merging the most frequent character pairs in the training data. Since the training data is ~90% English, English character sequences get merged into large tokens. Non-English sequences — especially those using non-Latin scripts — remain fragmented.
python import tiktoken enc = tiktoken.get_encoding("cl100k_base") # GPT-4 tokenizer # English: efficient — common words are single tokens len(enc.encode("The weather is beautiful today")) # 5 tokens: ['The', ' weather', ' is', ' beautiful', ' today'] # Hindi: fragmented — Devanagari characters rarely merged len(enc.encode("आज मौसम बहुत सुंदर है")) # same meaning # 22 tokens: each syllable or even byte is a separate token # Yoruba: heavily fragmented — diacritics break merges len(enc.encode("Ojú ọjọ́ dára lónìí")) # same meaning # 18 tokens: diacritical marks prevent efficient merging # Myanmar: extremely fragmented — unique script len(enc.encode("ဒီနေ့ ရာသီဥတု လှပါတယ်")) # same meaning # 35 tokens: almost byte-level decomposition
The Hindi version says exactly the same thing as the English version. But it uses 4.4x as many tokens. Why? Because the BPE tokenizer saw relatively little Devanagari text during training, so it never learned to merge Devanagari character pairs into efficient subword tokens.
Ahia et al. define fertility as the ratio of the number of tokens produced by the tokenizer to the number of words in the original text. High fertility means more fragmentation:
| Language | Script | Fertility (GPT-4) | Interpretation |
|---|---|---|---|
| English | Latin | 1.0 | Baseline — roughly 1 token per word |
| Spanish | Latin | 1.2 | Close to English, shared Latin script |
| Chinese | Hanzi | 2.3 | Each character tends to be a separate token |
| Hindi | Devanagari | 4.1 | Heavy fragmentation of syllables |
| Yoruba | Latin+diacritics | 6.5 | Diacritics prevent merges |
| Myanmar | Myanmar | 11.2 | Near byte-level decomposition |
Compare token fertility across languages. Each bar shows how many tokens are needed per word. Higher = worse efficiency = higher cost.
Tokenization disparity doesn't just affect input costs — it directly degrades the quality of generated text. When an LLM generates text token-by-token, high-fertility languages require more sequential decisions per word. Each decision is a chance for error, and errors compound:
python # Error compounding in generation # # English: "understanding" = 1 token # → 1 decision, accuracy: 99% # # Myanmar equivalent = 8 byte tokens # → 8 sequential decisions # → If each byte is 99% accurate: 0.99^8 = 92% word accuracy # → For a 100-word sentence: 0.92^100 = 0.02% chance of ALL correct # # English 100-word sentence: 0.99^100 = 36.6% chance of ALL correct # The gap is enormous — and it's a DIRECT consequence of tokenization
This helps explain why LLMs produce lower-quality output in underrepresented languages — the tokenizer forces the model to make more low-level byte-by-byte decisions, each compounding potential errors. It's not just about seeing less training data — the representation itself is fundamentally less efficient.
Ahia et al. systematically measured the tokenization disparity across 17 languages and 4 commercial tokenizers. The results are damning — and consistent across all tokenizers.
They used FLORES-200, a parallel translation benchmark where the same sentences exist in 200 languages. This is critical: because the sentences are parallel translations, any difference in token count is purely a tokenizer artifact — the semantic content is identical.
python # The experimental methodology # 1. Take FLORES-200 parallel sentences (same meaning in all languages) # 2. Tokenize each sentence with each commercial tokenizer # 3. Compare token counts across languages # 4. Compute cost multiplier = tokens(language) / tokens(English) # Tokenizers tested: # - GPT-4 (cl100k_base): 100,256 tokens, byte-level BPE # - GPT-3.5 (r50k_base): 50,257 tokens, byte-level BPE # - LLaMA (SentencePiece): 32,000 tokens # - BLOOM (BPE): 250,680 tokens for lang in ['en', 'hi', 'yo', 'my', 'ta', ...]: tokens_en = tokenizer.encode(flores['en']) tokens_lang = tokenizer.encode(flores[lang]) cost_multiplier = len(tokens_lang) / len(tokens_en) print(f"{lang}: {cost_multiplier:.1f}x")
| Language | GPT-4 (100K) | GPT-3.5 (50K) | LLaMA (32K) | BLOOM (250K) |
|---|---|---|---|---|
| English | 1.0x | 1.0x | 1.0x | 1.0x |
| Spanish | 1.2x | 1.4x | 1.2x | 1.1x |
| Chinese | 2.3x | 2.7x | 1.9x | 1.5x |
| Arabic | 3.2x | 4.1x | 3.5x | 1.8x |
| Hindi | 4.1x | 6.8x | 4.9x | 2.1x |
| Tamil | 5.8x | 9.2x | 6.5x | 2.4x |
| Yoruba | 6.5x | 10.1x | 7.2x | 3.8x |
| Myanmar | 11.2x | 15.3x | 12.1x | 4.2x |
The cost multiplier affects more than just API pricing. It directly reduces the effective context window for non-English languages. If GPT-4's context window is 8,192 tokens:
effective context (in words)
English: 8,192 tokens ÷ 1.0 fertility ≈ 8,192 words
Hindi: 8,192 tokens ÷ 4.1 fertility ≈ 1,998 words (4x less content)
Yoruba: 8,192 tokens ÷ 6.5 fertility ≈ 1,260 words (6.5x less content)
Myanmar: 8,192 tokens ÷ 11.2 fertility ≈ 731 words (11x less content)
A Myanmar speaker can fit less than one-tenth the semantic content in the same context window. For tasks like document summarization or RAG, this is devastating — you can retrieve and process far fewer documents.
Select a tokenizer to see cost multipliers across languages. BLOOM (trained on diverse data) has the smallest gaps; GPT-3.5 (English-heavy) has the largest.
The tokenization disparity isn't random — it's a direct consequence of three compounding factors in how BPE tokenizers are built.
BPE merges character pairs based on frequency in the training data. The training data for major tokenizers is overwhelmingly English:
| Tokenizer | English % | Next language | African languages |
|---|---|---|---|
| GPT-2/3 (r50k) | ~93% | French ~1% | ~0.001% |
| GPT-4 (cl100k) | ~85% | Code ~5% | ~0.01% |
| LLaMA | ~89% | Mixed European ~5% | ~0.01% |
| BLOOM | ~30% | French ~12% | ~5% |
When English text dominates, BPE merges prioritize English character sequences. "th" gets merged early because "the" appears billions of times. The Hindi sequence "म" + "ह" never gets merged because it appears too rarely in the training data.
Languages using the Latin script benefit from shared subword tokens with English. "universal" in English, Spanish ("universal"), French ("universel"), and German ("universell") all share the subword "univers". But Hindi (Devanagari), Arabic, Chinese, and other non-Latin scripts share nothing — every token must be learned from scratch for those scripts.
Modern tokenizers (GPT-4, LLaMA) use byte-level BPE. Characters that weren't frequent enough to earn dedicated tokens get decomposed into raw UTF-8 bytes. A single Hindi character like "ह" is 3 UTF-8 bytes (E0 B9 B9). If the tokenizer hasn't learned "ह" as a token, it becomes 3 separate byte tokens.
python # Why non-Latin scripts fragment more # English "a" = 1 UTF-8 byte = 1 token (always merged) "a".encode('utf-8') # b'\x61' — 1 byte # Hindi "ह" = 3 UTF-8 bytes = potentially 3 tokens "ह".encode('utf-8') # b'\xe0\xb9\xb9' — 3 bytes # Myanmar "ပ" = 3 UTF-8 bytes = potentially 3 tokens "ပ".encode('utf-8') # b'\xe1\x80\x95' — 3 bytes # Chinese "你" = 3 UTF-8 bytes, but usually merged (enough data) "你".encode('utf-8') # b'\xe4\xbd\xa0' — 3 bytes, but 1 token (frequent) # The key: frequency in training data determines whether # multi-byte characters get merged into single tokens or stay as bytes
Click a script to see how byte-level BPE handles characters from different writing systems. Latin characters are 1 byte; others are 2-4 bytes and may not get merged.
Ahia et al. translate the tokenization disparity into concrete dollar amounts. The analysis is sobering — the "language tax" can make LLM usage prohibitively expensive for speakers of underrepresented languages.
OpenAI's GPT-4 pricing (as of the paper): $0.03 per 1K input tokens, $0.06 per 1K output tokens. Let's compute the cost of a simple task — translating a 1,000-word document:
python # Cost to process 1,000 words with GPT-4 price_per_1k_tokens = 0.03 # USD, input pricing costs = { 'English': 1000 * 1.0 * price_per_1k_tokens / 1000, # $0.030 'Spanish': 1000 * 1.2 * price_per_1k_tokens / 1000, # $0.036 'Hindi': 1000 * 4.1 * price_per_1k_tokens / 1000, # $0.123 'Yoruba': 1000 * 6.5 * price_per_1k_tokens / 1000, # $0.195 'Myanmar': 1000 * 11.2 * price_per_1k_tokens / 1000, # $0.336 } # Myanmar costs 11.2x more than English for the SAME content!
These multipliers compound with usage. Consider a chatbot serving 100K daily queries, each averaging 500 words of input + output:
| Language | Daily cost | Monthly cost | Annual cost |
|---|---|---|---|
| English | $150 | $4,500 | $54,000 |
| Hindi | $615 | $18,450 | $221,400 |
| Yoruba | $975 | $29,250 | $351,000 |
| Myanmar | $1,680 | $50,400 | $604,800 |
A Myanmar chatbot costs $604,800/year vs $54,000/year for the same service in English. That's a $550,000 language tax.
Consider an educational chatbot deployed across Africa. The World Bank has funded AI-powered tutoring programs in Sub-Saharan Africa. A student in Lagos asking questions in Yoruba generates 6.5x more tokens than a student in London asking the same questions in English. If the program has a fixed budget, the Yoruba students get fewer interactions — or the program serves fewer students. The tokenizer bias directly translates to educational inequality.
python # Case study: Educational chatbot budget allocation budget = 100000 # $100K annual budget cost_per_query_en = 0.015 # $0.015 per query (500 words) cost_per_query_yo = 0.015 * 6.5 # $0.0975 per query # English students get: 100000/0.015 = 6.7M queries/year # Yoruba students get: 100000/0.0975 = 1.0M queries/year # Same budget, same model, 6.5x fewer learning interactions # To equalize: you'd need 6.5x the budget for Yoruba deployment # Or: fix the tokenizer
The cost multiplier affects more than just API bills:
| Impact | How It Manifests |
|---|---|
| Inference latency | More tokens = more autoregressive decoding steps = slower responses |
| Context window | Same token budget holds less semantic content (11x less for Myanmar) |
| Fine-tuning cost | More tokens per example = more training compute per epoch |
| RAG effectiveness | Fewer retrieved documents fit in context → worse retrieval quality |
Drag the slider to set daily query volume. See how the annual API cost differs across languages for the same service.
The tokenization disparity doesn't just affect cost — it degrades model performance. Ahia et al. show that languages with higher fertility systematically perform worse on downstream NLP tasks.
Across multiple benchmarks, there is a strong negative correlation between fertility and task performance. Higher fertility (more fragmentation) leads to lower accuracy. This makes sense mechanistically:
Information bottleneck. Each token position in a Transformer has a fixed-size hidden state (e.g., 4096 dims in LLaMA). If a single English word is one token, that word gets a full 4096-dim representation. If the Hindi equivalent is 4 tokens, the same semantic meaning is spread across 4 representations that must be composed by the Transformer — a harder learning problem.
Sequence length penalties. Longer sequences (more tokens) are harder for Transformers. Attention is O(n²) in sequence length. More tokens mean more attention computation and more opportunities for the model to "lose track" of long-range dependencies.
python # The fragmentation → performance pipeline # # English: "understanding" = 1 token # → Full 4096-dim representation for the concept # → Model easily learns to process it # # Hindi: "समझना" = 4 tokens ["सम", "झ", "ना", "·"] # → Meaning spread across 4 positions # → Model must learn to compose them via attention # → Harder learning problem → worse performance # → Also: 4x more tokens → 4x the cost → 4x slower # Measured correlation (Ahia et al.): # Pearson r = -0.82 between fertility and NLI accuracy # This is a STRONG negative correlation
Ahia et al. trace the causal chain precisely. It's not just correlation — there's a clear mechanism:
python # Demonstrating the information bottleneck # Each token position has a fixed hidden dimension (e.g., 4096 in LLaMA) # English: "understanding" = 1 token = 4096 dims for this concept # Information density: CONCEPT / 4096 = high # Hindi: "समझना" = 4 tokens = 4 × 4096 dims, but fragmented # The model must learn to compose 4 sub-word representations # via attention to reconstruct the concept # This requires more layers of processing → worse for shallow layers # Experiment: probe accuracy by layer # English sentiment probe at layer 1: 72% (already linearly decodable) # Hindi sentiment probe at layer 1: 54% (still fragmented) # Hindi catches up by layer 12: 71% (after attention composition) # But it never fully closes the gap to English: 71% vs 78%
| Mitigation | How | Tradeoff |
|---|---|---|
| Balanced training data | Train tokenizer on data with equal representation per language | Reduces English efficiency; may require larger vocab |
| Larger vocabulary | More tokens (250K+) to cover more scripts | Larger embedding matrix → more memory |
| Language-specific tokenizers | Different tokenizer per language family | Complexity; breaks shared vocabulary benefits |
| Character-level models | No tokenizer at all — operate on characters/bytes | Much longer sequences → higher compute |
| Per-language pricing | Charge by semantic content, not tokens | Business model change; hard to define "semantic unit" |
This scatter plot shows the correlation between token fertility (x-axis) and NLI accuracy (y-axis) across languages. The strong negative correlation (r = -0.82) shows that fragmented tokenization directly hurts model performance.
This interactive tool lets you explore the real-world cost implications of tokenization disparity. Configure your use case and see how much extra you'd pay in each language.
Configure your LLM usage scenario. The simulator calculates annual costs for each language based on actual tokenizer fertility data. Drag sliders to adjust daily queries, average prompt length, and per-token price.
This paper is a wake-up call for the LLM community. It quantifies a problem that was informally known but never rigorously measured, and it positions tokenization as a first-class equity concern.
| Paper | Contribution | Relationship |
|---|---|---|
| Sennrich et al. (2016) | Introduced BPE for NLP | The origin — BPE's frequency-based merging creates the bias |
| Conneau et al. (2020) | XLM-R: multilingual at scale | Used balanced sampling (α=0.3) but still English-dominated tokenizer |
| Rust et al. (2021) | "How Good is Your Tokenizer?" | First systematic evaluation of tokenizer quality across languages |
| Petrov et al. (2024) | "Language Model Tokenizers Introduce Unfairness" | Extended analysis to generation quality, not just cost |
The paper has already driven concrete changes:
LLaMA-3 (2024) increased its vocabulary from 32K to 128K tokens, with deliberate inclusion of non-Latin scripts. This reduced the fertility gap for many languages — Hindi dropped from ~4.9x to ~2.5x.
Gemma (2024) uses a 256K vocabulary with explicit multilingual coverage. Google invested in diverse training data for the tokenizer specifically to address this problem.
Per-character pricing has been discussed by some API providers as an alternative to per-token pricing that would eliminate the language tax entirely.
Ahia et al. suggest several concrete technical improvements:
python # Technical improvement 1: Balanced tokenizer training # Instead of training BPE on the raw data distribution # (90% English), use smoothed sampling (like XLM-R's α=0.3) def train_equitable_tokenizer(data_by_lang, vocab_size, alpha=0.3): # Compute smoothed sampling weights sizes = {lang: len(data) for lang, data in data_by_lang.items()} probs = {lang: size**alpha for lang, size in sizes.items()} total = sum(probs.values()) probs = {lang: p/total for lang, p in probs.items()} # Sample training data according to smoothed distribution balanced_data = sample_by_probs(data_by_lang, probs) # Train BPE on balanced data return train_bpe(balanced_data, vocab_size) # Technical improvement 2: Script-aware vocabulary allocation # Reserve minimum vocabulary slots per script: # Latin: ~40K tokens (shared across many languages) # CJK: ~40K tokens (Chinese, Japanese, Korean) # Devanagari: ~15K tokens (Hindi, Marathi, Sanskrit) # Arabic: ~15K tokens (Arabic, Urdu, Persian) # Cyrillic: ~10K tokens (Russian, Ukrainian, Bulgarian) # Other: ~30K tokens (remaining scripts) # Shared: ~50K tokens (numbers, punctuation, common subwords) # Total: 200K tokens
"The most consequential design decision in a language model may not be the architecture, the training objective, or the scale — it may be the tokenizer trained on the first day and never questioned again."
Drag the slider to see how tokenizer vocabulary sizes have grown in response to the equity problem. Larger vocabularies reduce the fertility gap but don't eliminate it.