Unsupervised Cross-lingual Representation Learning at Scale — train one masked language model on 100 languages simultaneously, enabling zero-shot cross-lingual transfer without parallel data.
You've built a sentiment classifier for English product reviews. It works great — 94% accuracy. Now your company expands to Japan, Brazil, and Turkey. You need the same classifier for Japanese, Portuguese, and Turkish. The problem: you have zero labeled training data in those languages.
This is the language barrier in NLP. The vast majority of labeled datasets exist only in English. Building separate models for each of the world's 7,000+ languages would require labeled data for each one — data that simply doesn't exist for most languages.
| Approach | What It Requires | Scales? |
|---|---|---|
| Train per language | Labeled data in EACH target language | No — most languages have no labeled data |
| Translate and train | Machine translation system for each language pair | Partially — but translation introduces noise |
| Cross-lingual transfer | Multilingual representations + labels in ONE language | Yes — one model, many languages |
Cross-lingual transfer is the dream: train a classifier on English data, then apply it directly to Japanese, Portuguese, Turkish — any language — without any additional training. The classifier works because the representations are shared across languages.
Think of it like a polyglot who reads extensively in many languages. After enough reading, they develop abstract concepts of "positive sentiment" and "named entity" that transcend any specific language. XLM-R does the same — but computationally.
Click a language to see how XLM-R transfers English sentiment training to other languages with zero additional labeled data. The shared representation space makes features work across languages.
How can a model learn representations that work across languages? The key insight is that if you train a masked language model on text from multiple languages simultaneously, the model discovers shared structure that transcends any single language.
Consider sentiment. In English, "This movie is amazing!" is positive. In French, "Ce film est incroyable!" is positive. The surface tokens are completely different, but the syntactic structures and semantic patterns are similar. A multilingual model that processes both languages learns to map these parallel structures into similar vector spaces.
More concretely, multilingual models exploit three types of cross-lingual signal:
| Signal Type | What It Is | Example |
|---|---|---|
| Shared subwords | Many languages share subword tokens — numbers, borrowed words, named entities | "Obama", "2024", "COVID" are the same tokens in any language |
| Structural similarity | Many languages have SVO order, use articles, have similar syntactic patterns | "The cat sat" vs "Le chat s'est assis" — similar structure |
| Parameter sharing | One Transformer processes all languages — forced to find shared representations | The same attention heads learn "subject" and "object" across languages |
Multilingual BERT (mBERT) by Google (2019) was the first demonstration. It was simply BERT trained on Wikipedia text from 104 languages. Remarkably, despite no cross-lingual objective, it learned cross-lingual representations. But it had limitations: Wikipedia is small for many languages, and the model capacity (110M parameters) was spread too thin across 104 languages.
XLM (Conneau & Lample, 2019) improved on mBERT by adding a Translation Language Modeling (TLM) objective: mask tokens in parallel sentences and let the model attend to both languages. This gave an explicit cross-lingual signal but required parallel data — which is scarce for most language pairs.
python # The evolution of multilingual models # 1. mBERT (2019) # - BERT trained on Wikipedia in 104 languages # - Objective: MLM (masked language modeling) only # - Data: ~2.5B words total across all languages # - Problem: Wikipedia is tiny for low-resource languages # 2. XLM (2019) # - Added TLM (translation language modeling) # - Objective: MLM + TLM (requires parallel data) # - Better cross-lingual, but parallel data is limited # 3. XLM-R (2020) — THIS PAPER # - Just MLM, no TLM needed # - Massive data: 2.5 TB of CommonCrawl text # - Key insight: scale the DATA, not the objective # - Result: beats XLM-TLM despite using LESS supervision
Drag the slider to see how multilingual models evolved from mBERT to XLM-R. Each step shows the data, objective, and performance on cross-lingual benchmarks.
The foundation of XLM-R is data. Where mBERT used Wikipedia (2.5 billion words across 104 languages), XLM-R uses CommonCrawl (over 2 terabytes across 100 languages). This isn't just "more data" — it's a qualitative shift in what low-resource languages can access.
Raw CommonCrawl data is messy — it's scraped from the entire web. Conneau et al. built the CC-100 dataset by applying the same pipeline used for English in RoBERTa to 100 languages:
The scale difference between Wikipedia and CommonCrawl is enormous for low-resource languages:
| Language | Wikipedia (mBERT) | CC-100 (XLM-R) | Increase |
|---|---|---|---|
| English | 2.5 GB | 55 GB | 22x |
| French | 1.1 GB | 57 GB | 52x |
| Swahili | 11 MB | 332 MB | 30x |
| Burmese | 8 MB | 214 MB | 27x |
| Urdu | 30 MB | 730 MB | 24x |
| Yoruba | 1.2 MB | 28 MB | 23x |
For Yoruba, the data goes from 1.2 MB (barely enough to learn anything) to 28 MB. This is still small compared to English's 55 GB, but it's enough for the model to learn basic Yoruba linguistic structure. The exponential smoothing in sampling further amplifies low-resource languages' training signal.
python # Data quality matters as much as quantity # Wikipedia: curated, factual, encyclopedic style # CommonCrawl: diverse but noisy — includes: # - News articles (formal) # - Blog posts (informal) # - Forum discussions (colloquial) # - Product descriptions (commercial) # - Boilerplate HTML (noise) # # The deduplication step is crucial: # Without it, the model memorizes repeated boilerplate # (cookie notices, navigation menus, copyright footers) # instead of learning language structure
Even after cleaning, the data is wildly imbalanced. English has 300 billion tokens. Swahili has 275 million. Urdu has 730 million. If you train on the natural distribution, the model overwhelmingly learns English features and barely touches low-resource languages.
XLM-R uses exponential smoothing to rebalance sampling. Let ni be the number of tokens in language i. The sampling probability for language i is:
Where α is a smoothing parameter. When α = 1, you sample proportionally (English dominates). When α = 0, every language is equally sampled (low-resource gets a huge boost but English suffers). Conneau et al. found α = 0.3 works best — it upsamples low-resource languages significantly while not starving high-resource ones.
python import numpy as np # Example: 5 languages with different data sizes data_sizes = { 'English': 300_000, # 300B tokens (millions) 'French': 56_000, # 56B 'Hindi': 2_000, # 2B 'Swahili': 275, # 275M 'Urdu': 730, # 730M } def compute_sampling(sizes, alpha): values = np.array(list(sizes.values()), dtype=np.float64) probs = values ** alpha return probs / probs.sum() # α = 1.0 (proportional): English gets 83.7% of training # α = 0.3 (smoothed): English gets 37.2%, Swahili gets 4.8% # α = 0.0 (uniform): Each language gets 20%
Drag the α slider to see how exponential smoothing rebalances language sampling. At α=1 (proportional), English dominates. At α=0 (uniform), all languages are equal. XLM-R uses α=0.3.
Here's the central tension of multilingual models: every language you add dilutes the model's capacity for every other language. This is the curse of multilinguality.
Imagine a Transformer with 110M parameters (like BERT-base). If it learns English only, all 110M parameters are dedicated to English. If it learns 100 languages, those same 110M parameters must represent the vocabulary, syntax, semantics, and factual knowledge of ALL 100 languages. Each language gets roughly 1.1M parameters' worth of capacity — a 100x reduction.
Conneau et al. ran a controlled experiment. They trained multiple XLM-R models, each on a different number of languages (1, 7, 15, 30, 100), and measured performance on a shared evaluation set. The results are striking:
| # Languages | EN accuracy | Low-resource avg | Overall avg |
|---|---|---|---|
| 1 (English only) | 92.3 | — (no transfer) | — |
| 7 | 91.1 | 78.5 | 85.4 |
| 15 | 89.8 | 79.2 | 84.9 |
| 30 | 88.5 | 79.8 | 84.5 |
| 100 | 87.1 | 80.1 | 84.2 |
Adding more languages hurts high-resource languages (English drops from 92.3 to 87.1) but helps low-resource languages (which benefit from cross-lingual transfer). This is a genuine tradeoff — you can't get both without increasing capacity.
mBERT used a 110K shared vocabulary (WordPiece). XLM-R uses a 250K SentencePiece vocabulary. This larger vocabulary is crucial for multilingual coverage. With only 110K tokens, many scripts (Chinese, Japanese, Korean, Arabic) get fragmented into tiny pieces, making sequences very long and learning harder.
python # Vocabulary allocation across scripts # mBERT (110K WordPiece): # Latin scripts: ~40K tokens (dominates) # CJK characters: ~20K tokens (underfragmented) # Other scripts: ~50K tokens (heavily fragmented) # XLM-R (250K SentencePiece): # Latin scripts: ~60K tokens # CJK characters: ~50K tokens (much better coverage) # Other scripts: ~140K tokens (decent coverage) # Result: "नमस्ते" (Hindi for "hello") # mBERT: 5 tokens (heavily fragmented) # XLM-R: 2 tokens (better coverage of Devanagari)
Drag the slider to add more languages to the model. Watch how English performance drops while cross-lingual transfer to low-resource languages improves. Then toggle "Large model" to see how increased capacity overcomes the curse.
XLM-R is architecturally identical to RoBERTa — a Transformer encoder trained with masked language modeling. The novelty isn't in the architecture but in the scale of multilingual data and the training recipe. Let's trace the exact setup.
| Config | XLM-R Base | XLM-R Large |
|---|---|---|
| Layers | 12 | 24 |
| Hidden dim | 768 | 1024 |
| Attention heads | 12 | 16 |
| Parameters | 270M | 550M |
| Vocabulary | 250K SentencePiece (shared) | |
| Max sequence length | 512 tokens | |
| Training data | CC-100: 2.5 TB, 100 languages | |
XLM-R uses only Masked Language Modeling (MLM). Randomly mask 15% of the input tokens and predict them. No next-sentence prediction (removed following RoBERTa's finding that NSP hurts). No translation language modeling (the key finding — you don't need parallel data).
Where xi is the masked token and x\i is all other tokens. The model must predict each masked word from its multilingual context.
XLM-R adopts the optimized training recipe from RoBERTa:
python # XLM-R training hyperparameters config = { 'optimizer': 'Adam', 'lr': 6e-4, # peak learning rate 'warmup_steps': 10000, 'batch_size': 8192, # very large batch 'total_steps': 1_500_000, # 1.5M steps 'masking': 'dynamic', # re-mask each epoch (not static) 'nsp': False, # no next-sentence prediction 'full_sentences': True, # pack full sentences (not pairs) 'sampling_alpha': 0.3, # language smoothing factor 'fp16': True, # mixed precision training } # Total compute: ~500 V100 GPUs for ~3 weeks
XLM-R uses SentencePiece with a 250K vocabulary trained on the CC-100 data. SentencePiece is critical for multilingual models because it doesn't assume whitespace-separated words — it can tokenize Japanese, Chinese, and Thai directly from the raw character stream.
python # SentencePiece tokenization examples from transformers import XLMRobertaTokenizer tok = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base") # English tok.tokenize("The cat sat on the mat") # ['▁The', '▁cat', '▁sat', '▁on', '▁the', '▁mat'] # Japanese (no spaces between words!) tok.tokenize("猫がマットの上に座った") # ['▁', '猫', 'が', 'マ', 'ット', 'の', '上', 'に', '座', 'った'] # Arabic (right-to-left) tok.tokenize("القطة جلست على الحصيرة") # ['▁', 'القط', 'ة', '▁', 'جل', 'ست', '▁على', '▁ال', 'حص', 'ير', 'ة']
Let's trace the full data flow for a single training batch:
python # XLM-R training step — detailed data flow import torch # 1. Sample a language according to smoothed distribution lang = sample_language(alpha=0.3) # might be "sw" (Swahili) # 2. Sample a batch of sentences from that language batch = get_sentences(lang, batch_size=8192) # Each sentence is up to 512 SentencePiece tokens # Sentences are packed: if one is 200 tokens, the next starts at 201 # 3. Apply dynamic masking (different each epoch!) masked_ids, labels = dynamic_mask(batch, mask_prob=0.15) # 15% of tokens are replaced: # 80% → [MASK] token # 10% → random token from vocab # 10% → unchanged (but still predicted) # 4. Forward pass through Transformer outputs = model(masked_ids) # [8192, 512, 250000] # Output: probability distribution over 250K vocab for each position # 5. Compute loss only on masked positions loss = cross_entropy(outputs[mask_positions], labels[mask_positions]) # The model must predict the original token from context # "The [MASK] sat on the mat" → predict "cat" # 6. Backward + optimizer step loss.backward() optimizer.step() # Adam with warmup + linear decay
Watch XLM-R perform masked language modeling on multilingual text. Click "Mask & Predict" to randomly mask tokens and see the model's predictions. The same model handles all languages.
XLM-R was evaluated on several cross-lingual benchmarks. The headline result: it outperforms mBERT on every benchmark and matches or beats XLM (which uses parallel data) while using no parallel data at all.
XNLI is the primary benchmark for cross-lingual transfer. The task: given two sentences (premise and hypothesis), classify their relationship as entailment, contradiction, or neutral. Training is done in English only; evaluation is done in 15 languages.
| Model | EN | FR | DE | AR | ZH | HI | SW | Avg (15 langs) |
|---|---|---|---|---|---|---|---|---|
| mBERT | 82.1 | 76.6 | 74.2 | 67.7 | 69.6 | 62.1 | 58.8 | 70.4 |
| XLM (MLM+TLM) | 85.0 | 80.2 | 78.7 | 73.1 | 76.7 | 69.6 | 64.6 | 75.0 |
| XLM-R Base | 85.8 | 79.7 | 78.7 | 73.8 | 76.5 | 72.4 | 66.5 | 76.2 |
| XLM-R Large | 89.1 | 84.1 | 83.9 | 77.3 | 80.9 | 75.6 | 71.2 | 80.9 |
Key observations:
XLM-R Large dominates. It beats mBERT by 10.5 points on average and XLM by 5.9 points — a massive improvement. And it achieves this without any parallel data (unlike XLM which requires translation pairs).
python # Analyzing where XLM-R's gains come from # Compare mBERT → XLM-R Large improvements by language family: # # Indo-European (Latin script): avg +7.5 points # French: 76.6 → 84.1 (+7.5) # German: 74.2 → 83.9 (+9.7) # Spanish: 77.3 → 83.7 (+6.4) # # Non-Indo-European / Non-Latin script: avg +11.2 points # Arabic: 67.7 → 77.3 (+9.6) # Hindi: 62.1 → 75.6 (+13.5) # Swahili: 58.8 → 71.2 (+12.4) # Chinese: 69.6 → 80.9 (+11.3) # # The pattern: non-Latin languages gain MORE # Why? Two factors: # 1. CC-100 has much more data than Wikipedia for these languages # 2. 250K SentencePiece vocab gives better coverage of non-Latin scripts # 3. Exponential smoothing (α=0.3) gives them more training signal
Low-resource languages gain the most. Swahili goes from 58.8 (mBERT) to 71.2 (XLM-R Large) — a 12.4-point improvement. Hindi goes from 62.1 to 75.6 (+13.5). The exponential smoothing and larger data pay off most where they're needed most.
On WikiANN NER (recognize person, location, organization names in 40 languages), XLM-R Large achieves 65.4 F1 averaged across all languages, compared to 62.2 for mBERT. The improvement is particularly large for languages with non-Latin scripts.
Click a language to see how each model performs on XNLI. The gap between mBERT and XLM-R is largest for low-resource languages.
Let's put cross-lingual transfer to the test. The showcase simulation lets you experience what XLM-R does: train a classifier on English, then deploy it on any of 100 languages without additional training.
python from transformers import XLMRobertaForSequenceClassification, XLMRobertaTokenizer # 1. Load pre-trained XLM-R model = XLMRobertaForSequenceClassification.from_pretrained( "xlm-roberta-large", num_labels=3 # entail/contradict/neutral ) tok = XLMRobertaTokenizer.from_pretrained("xlm-roberta-large") # 2. Fine-tune on English XNLI training data for epoch in range(3): for batch in english_xnli_train: inputs = tok(batch['premise'], batch['hypothesis'], return_tensors="pt", padding=True, truncation=True) loss = model(**inputs, labels=batch['label']).loss loss.backward() optimizer.step() # 3. Zero-shot evaluation on OTHER languages (no training!) test_ja = "前提: 男がギターを弾いている。仮説: 男は音楽家だ。" inputs = tok(test_ja, return_tensors="pt") pred = model(**inputs).logits.argmax() # 0=entailment ✓ # Correct! Despite never seeing a single Japanese labeled example.
This simulation shows XLM-R classifying sentence pairs in different languages — all using a classifier trained ONLY on English data. Select a language and click "Classify" to see zero-shot transfer in action. Drag the "Training examples" slider to see how fine-tuning data affects transfer quality.
XLM-R stands at the crossroads of two trends: scaling language models and making them multilingual. Its impact continues to shape how we build models for the world's languages.
| Model | Year | Languages | Key Innovation |
|---|---|---|---|
| mBERT | 2019 | 104 | First: BERT on Wikipedia in 104 languages |
| XLM | 2019 | 100 | TLM: use parallel data for cross-lingual signal |
| XLM-R (this paper) | 2020 | 100 | Scale: 2.5 TB of CommonCrawl, no parallel data needed |
| mT5 | 2021 | 101 | Encoder-decoder architecture on mC4 (6.3T tokens) |
| BLOOM | 2022 | 46 | Open science: community-built multilingual GPT |
| PaLM | 2022 | ~100 | 540B params, few-shot multilingual capabilities |
| LLaMA-3 | 2024 | ~30 | 128K vocab, strong multilingual despite English focus |
The tokenization tax. XLM-R uses a shared SentencePiece vocabulary, but it's still biased toward high-resource languages. Bengali text requires 2-3x more tokens than English for the same semantic content. This means Bengali "costs" more compute and context window — a problem quantified in Ahia et al. (EMNLP 2023).
Transfer gap remains. Even XLM-R Large has an 8-point gap between English and the average language on XNLI. For safety-critical applications, zero-shot transfer may not be sufficient — some in-language data is still needed.
Encoder only. XLM-R is an encoder model — it produces representations, not text. For generative multilingual tasks, you need models like mT5 or multilingual GPT variants.
XLM-R is typically used as a feature extractor or fine-tuning base for downstream tasks. Here's a complete example of zero-shot cross-lingual sentiment classification:
python # Complete XLM-R fine-tuning + cross-lingual deployment from transformers import ( XLMRobertaForSequenceClassification, XLMRobertaTokenizer, Trainer, TrainingArguments ) # 1. Load model and tokenizer model = XLMRobertaForSequenceClassification.from_pretrained( "xlm-roberta-large", num_labels=3 # positive, neutral, negative ) tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-large") # 2. Fine-tune on English data ONLY training_args = TrainingArguments( output_dir="./xlmr-sentiment", num_train_epochs=3, per_device_train_batch_size=16, learning_rate=2e-5, warmup_ratio=0.1, ) trainer = Trainer( model=model, args=training_args, train_dataset=english_sentiment_dataset, # English only! eval_dataset=english_eval_dataset, ) trainer.train() # 3. Deploy on ANY language (zero-shot) def classify(text, lang="auto"): inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) outputs = model(**inputs) pred = outputs.logits.argmax(dim=-1).item() return ["negative", "neutral", "positive"][pred] # Works on languages never seen during fine-tuning! classify("I love this product!") # "positive" (English) classify("この製品が大好きです!") # "positive" (Japanese) classify("Ninapenda bidhaa hii!") # "positive" (Swahili)
"One model, one hundred languages, no parallel data. The tower of Babel was a problem of capacity, not of principle."
Explore the evolution of multilingual models from mBERT to modern multilingual LLMs.