XLM-RoBERTa (Conneau 2020)

Chapter 0: The Language Barrier

You've built a sentiment classifier for English product reviews. It works great — 94% accuracy. Now your company expands to Japan, Brazil, and Turkey. You need the same classifier for Japanese, Portuguese, and Turkish. The problem: you have zero labeled training data in those languages.

This is the language barrier in NLP. The vast majority of labeled datasets exist only in English. Building separate models for each of the world's 7,000+ languages would require labeled data for each one — data that simply doesn't exist for most languages.

Approach	What It Requires	Scales?
Train per language	Labeled data in EACH target language	No — most languages have no labeled data
Translate and train	Machine translation system for each language pair	Partially — but translation introduces noise
Cross-lingual transfer	Multilingual representations + labels in ONE language	Yes — one model, many languages

Cross-lingual transfer is the dream: train a classifier on English data, then apply it directly to Japanese, Portuguese, Turkish — any language — without any additional training. The classifier works because the representations are shared across languages.

XLM-RoBERTa's promise: Train a single masked language model on 2.5 terabytes of text in 100 languages. The resulting representations are so deeply multilingual that a classifier trained on English features can understand Japanese, Arabic, and Swahili — despite never seeing a single labeled example in those languages. No parallel data. No translation. Just one giant model that learns language-agnostic features by reading the internet in 100 languages.

Think of it like a polyglot who reads extensively in many languages. After enough reading, they develop abstract concepts of "positive sentiment" and "named entity" that transcend any specific language. XLM-R does the same — but computationally.

Cross-lingual Transfer Demo

Click a language to see how XLM-R transfers English sentiment training to other languages with zero additional labeled data. The shared representation space makes features work across languages.

What problem does cross-lingual transfer learning solve?

Most languages lack labeled training data — cross-lingual transfer lets you train on labeled data in one language (typically English) and apply the model to other languages, using shared multilingual representations as the bridge Machine translation is too expensive for every language pair English models are too large to deploy

Chapter 1: Cross-lingual Transfer

How can a model learn representations that work across languages? The key insight is that if you train a masked language model on text from multiple languages simultaneously, the model discovers shared structure that transcends any single language.

Why does this work at all?

Consider sentiment. In English, "This movie is amazing!" is positive. In French, "Ce film est incroyable!" is positive. The surface tokens are completely different, but the syntactic structures and semantic patterns are similar. A multilingual model that processes both languages learns to map these parallel structures into similar vector spaces.

More concretely, multilingual models exploit three types of cross-lingual signal:

Signal Type	What It Is	Example
Shared subwords	Many languages share subword tokens — numbers, borrowed words, named entities	"Obama", "2024", "COVID" are the same tokens in any language
Structural similarity	Many languages have SVO order, use articles, have similar syntactic patterns	"The cat sat" vs "Le chat s'est assis" — similar structure
Parameter sharing	One Transformer processes all languages — forced to find shared representations	The same attention heads learn "subject" and "object" across languages

The anchor hypothesis. Shared subwords (names, numbers, loanwords) act as "anchors" that pull different languages into the same region of representation space. If "Obama" appears in English, French, and Japanese sentences with similar contexts, the Transformer learns to give all those sentences similar representations — and other words in those contexts get pulled along. These anchors bootstrap cross-lingual alignment without any explicit parallel data.

Prior work: mBERT and XLM

Multilingual BERT (mBERT) by Google (2019) was the first demonstration. It was simply BERT trained on Wikipedia text from 104 languages. Remarkably, despite no cross-lingual objective, it learned cross-lingual representations. But it had limitations: Wikipedia is small for many languages, and the model capacity (110M parameters) was spread too thin across 104 languages.

XLM (Conneau & Lample, 2019) improved on mBERT by adding a Translation Language Modeling (TLM) objective: mask tokens in parallel sentences and let the model attend to both languages. This gave an explicit cross-lingual signal but required parallel data — which is scarce for most language pairs.

python
# The evolution of multilingual models
# 1. mBERT (2019)
#    - BERT trained on Wikipedia in 104 languages
#    - Objective: MLM (masked language modeling) only
#    - Data: ~2.5B words total across all languages
#    - Problem: Wikipedia is tiny for low-resource languages

# 2. XLM (2019)
#    - Added TLM (translation language modeling)
#    - Objective: MLM + TLM (requires parallel data)
#    - Better cross-lingual, but parallel data is limited

# 3. XLM-R (2020) — THIS PAPER
#    - Just MLM, no TLM needed
#    - Massive data: 2.5 TB of CommonCrawl text
#    - Key insight: scale the DATA, not the objective
#    - Result: beats XLM-TLM despite using LESS supervision

XLM-R's key insight: data > objectives. The XLM paper showed that TLM (parallel data) helps cross-lingual transfer. XLM-R showed something surprising: with ENOUGH monolingual data, you don't need parallel data at all. Pure MLM on massive multilingual text works better than MLM+TLM on smaller data. More data in each language gives the model more context to discover cross-lingual patterns on its own.

Multilingual Model Evolution

Drag the slider to see how multilingual models evolved from mBERT to XLM-R. Each step shows the data, objective, and performance on cross-lingual benchmarks.

Model XLM-R

What was XLM-R's key insight compared to XLM?

XLM-R used a bigger Transformer architecture With enough monolingual data (2.5 TB from CommonCrawl), plain masked language modeling works better than adding a translation objective (TLM) on smaller data — scaling data matters more than adding cross-lingual supervision XLM-R added a new type of attention mechanism

Chapter 2: Scaling the Data

The foundation of XLM-R is data. Where mBERT used Wikipedia (2.5 billion words across 104 languages), XLM-R uses CommonCrawl (over 2 terabytes across 100 languages). This isn't just "more data" — it's a qualitative shift in what low-resource languages can access.

CC-100: Cleaning the CommonCrawl

Raw CommonCrawl data is messy — it's scraped from the entire web. Conneau et al. built the CC-100 dataset by applying the same pipeline used for English in RoBERTa to 100 languages:

1. CommonCrawl Dumps

Raw web crawl snapshots. Hundreds of TB of text in all languages mixed together.

↓

2. Language ID (fastText)

Classify each page into one of 176 languages using a fastText classifier. Keep pages with confidence > 0.5.

↓

3. Deduplication

Remove near-duplicate paragraphs within each language. Crucial for data quality.

↓

4. CC-100 Dataset

2.5 TB total. 100 languages. From 55GB (English) down to 1MB (lowest). 1000x more data than Wikipedia for many languages.

Data size comparison

The scale difference between Wikipedia and CommonCrawl is enormous for low-resource languages:

Language	Wikipedia (mBERT)	CC-100 (XLM-R)	Increase
English	2.5 GB	55 GB	22x
French	1.1 GB	57 GB	52x
Swahili	11 MB	332 MB	30x
Burmese	8 MB	214 MB	27x
Urdu	30 MB	730 MB	24x
Yoruba	1.2 MB	28 MB	23x

For Yoruba, the data goes from 1.2 MB (barely enough to learn anything) to 28 MB. This is still small compared to English's 55 GB, but it's enough for the model to learn basic Yoruba linguistic structure. The exponential smoothing in sampling further amplifies low-resource languages' training signal.

python
# Data quality matters as much as quantity
# Wikipedia: curated, factual, encyclopedic style
# CommonCrawl: diverse but noisy — includes:
#   - News articles (formal)
#   - Blog posts (informal)
#   - Forum discussions (colloquial)
#   - Product descriptions (commercial)
#   - Boilerplate HTML (noise)
#
# The deduplication step is crucial:
# Without it, the model memorizes repeated boilerplate
# (cookie notices, navigation menus, copyright footers)
# instead of learning language structure

The data imbalance problem

Even after cleaning, the data is wildly imbalanced. English has 300 billion tokens. Swahili has 275 million. Urdu has 730 million. If you train on the natural distribution, the model overwhelmingly learns English features and barely touches low-resource languages.

XLM-R uses exponential smoothing to rebalance sampling. Let n_i be the number of tokens in language i. The sampling probability for language i is:

p_i = n_i^α / ∑_j n_j^α

Where α is a smoothing parameter. When α = 1, you sample proportionally (English dominates). When α = 0, every language is equally sampled (low-resource gets a huge boost but English suffers). Conneau et al. found α = 0.3 works best — it upsamples low-resource languages significantly while not starving high-resource ones.

python
import numpy as np

# Example: 5 languages with different data sizes
data_sizes = {
    'English':   300_000,  # 300B tokens (millions)
    'French':     56_000,  # 56B
    'Hindi':       2_000,  # 2B
    'Swahili':       275,  # 275M
    'Urdu':          730,  # 730M
}

def compute_sampling(sizes, alpha):
    values = np.array(list(sizes.values()), dtype=np.float64)
    probs = values ** alpha
    return probs / probs.sum()

# α = 1.0 (proportional): English gets 83.7% of training
# α = 0.3 (smoothed):     English gets 37.2%, Swahili gets 4.8%
# α = 0.0 (uniform):      Each language gets 20%

α = 0.3 is the Goldilocks value. Too high, and the model ignores low-resource languages. Too low, and it forgets high-resource patterns. At 0.3, Swahili's sampling probability increases from 0.08% to 4.8% — a 60x boost. This means the model sees Swahili text 60 times more often than it would under proportional sampling, giving it a fighting chance to learn useful Swahili representations.

Sampling Distribution Explorer

Drag the α slider to see how exponential smoothing rebalances language sampling. At α=1 (proportional), English dominates. At α=0 (uniform), all languages are equal. XLM-R uses α=0.3.

α 0.30

Why does XLM-R use exponential smoothing (α = 0.3) instead of proportional sampling?

Proportional sampling would let English (300B tokens) dominate training — low-resource languages like Swahili (275M tokens) would barely be seen. Smoothing with α=0.3 gives Swahili a 60x boost in sampling probability, ensuring the model learns useful representations for all 100 languages Because proportional sampling is computationally more expensive Because all languages have equal amounts of training data

Chapter 3: The Curse of Multilinguality

Here's the central tension of multilingual models: every language you add dilutes the model's capacity for every other language. This is the curse of multilinguality.

Imagine a Transformer with 110M parameters (like BERT-base). If it learns English only, all 110M parameters are dedicated to English. If it learns 100 languages, those same 110M parameters must represent the vocabulary, syntax, semantics, and factual knowledge of ALL 100 languages. Each language gets roughly 1.1M parameters' worth of capacity — a 100x reduction.

The evidence

Conneau et al. ran a controlled experiment. They trained multiple XLM-R models, each on a different number of languages (1, 7, 15, 30, 100), and measured performance on a shared evaluation set. The results are striking:

# Languages	EN accuracy	Low-resource avg	Overall avg
1 (English only)	92.3	— (no transfer)	—
7	91.1	78.5	85.4
15	89.8	79.2	84.9
30	88.5	79.8	84.5
100	87.1	80.1	84.2

Adding more languages hurts high-resource languages (English drops from 92.3 to 87.1) but helps low-resource languages (which benefit from cross-lingual transfer). This is a genuine tradeoff — you can't get both without increasing capacity.

The solution: make the model bigger. If the curse is limited capacity, the cure is more capacity. XLM-R uses a 550M-parameter model (XLM-R Large) — 5x bigger than mBERT. With sufficient capacity, the curse is largely overcome: English performance approaches monolingual levels while cross-lingual transfer remains strong. This is the paper's key practical finding: multilingual models need to be large to avoid the dilution penalty.

Vocabulary size matters too

mBERT used a 110K shared vocabulary (WordPiece). XLM-R uses a 250K SentencePiece vocabulary. This larger vocabulary is crucial for multilingual coverage. With only 110K tokens, many scripts (Chinese, Japanese, Korean, Arabic) get fragmented into tiny pieces, making sequences very long and learning harder.

python
# Vocabulary allocation across scripts
# mBERT (110K WordPiece):
#   Latin scripts:  ~40K tokens (dominates)
#   CJK characters: ~20K tokens (underfragmented)
#   Other scripts:  ~50K tokens (heavily fragmented)

# XLM-R (250K SentencePiece):
#   Latin scripts:  ~60K tokens
#   CJK characters: ~50K tokens (much better coverage)
#   Other scripts:  ~140K tokens (decent coverage)

# Result: "नमस्ते" (Hindi for "hello")
# mBERT:  5 tokens (heavily fragmented)
# XLM-R:  2 tokens (better coverage of Devanagari)

Curse of Multilinguality

Drag the slider to add more languages to the model. Watch how English performance drops while cross-lingual transfer to low-resource languages improves. Then toggle "Large model" to see how increased capacity overcomes the curse.

# Languages 100

What is the "curse of multilinguality" and how does XLM-R address it?

Adding more languages dilutes model capacity — each language gets fewer parameters. High-resource languages (English) lose accuracy while low-resource languages gain from cross-lingual transfer. XLM-R addresses this by using a much larger model (550M params vs 110M for mBERT), giving enough capacity for all 100 languages without severe dilution The model becomes too slow when processing 100 languages Some languages don't have Unicode support

Chapter 4: Architecture & Training

XLM-R is architecturally identical to RoBERTa — a Transformer encoder trained with masked language modeling. The novelty isn't in the architecture but in the scale of multilingual data and the training recipe. Let's trace the exact setup.

Model specifications

Config	XLM-R Base	XLM-R Large
Layers	12	24
Hidden dim	768	1024
Attention heads	12	16
Parameters	270M	550M
Vocabulary	250K SentencePiece (shared)
Max sequence length	512 tokens
Training data	CC-100: 2.5 TB, 100 languages

Training objective: Just MLM

XLM-R uses only Masked Language Modeling (MLM). Randomly mask 15% of the input tokens and predict them. No next-sentence prediction (removed following RoBERTa's finding that NSP hurts). No translation language modeling (the key finding — you don't need parallel data).

L_MLM = − ∑_{i ∈ masked} log p(x_i | x_\i)

Where x_i is the masked token and x_\i is all other tokens. The model must predict each masked word from its multilingual context.

Training recipe from RoBERTa

XLM-R adopts the optimized training recipe from RoBERTa:

python
# XLM-R training hyperparameters
config = {
    'optimizer': 'Adam',
    'lr': 6e-4,           # peak learning rate
    'warmup_steps': 10000,
    'batch_size': 8192,    # very large batch
    'total_steps': 1_500_000,  # 1.5M steps
    'masking': 'dynamic',  # re-mask each epoch (not static)
    'nsp': False,          # no next-sentence prediction
    'full_sentences': True,  # pack full sentences (not pairs)
    'sampling_alpha': 0.3,  # language smoothing factor
    'fp16': True,          # mixed precision training
}
# Total compute: ~500 V100 GPUs for ~3 weeks

Dynamic masking is crucial. BERT used static masking — the same tokens are masked every epoch. RoBERTa (and XLM-R) use dynamic masking — different tokens are masked each time the model sees a sentence. This effectively creates 10x more training examples from the same data. For a model seeing 2.5 TB of text over 1.5M steps, dynamic masking ensures maximum data efficiency.

SentencePiece tokenization

XLM-R uses SentencePiece with a 250K vocabulary trained on the CC-100 data. SentencePiece is critical for multilingual models because it doesn't assume whitespace-separated words — it can tokenize Japanese, Chinese, and Thai directly from the raw character stream.

python
# SentencePiece tokenization examples
from transformers import XLMRobertaTokenizer

tok = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")

# English
tok.tokenize("The cat sat on the mat")
# ['▁The', '▁cat', '▁sat', '▁on', '▁the', '▁mat']

# Japanese (no spaces between words!)
tok.tokenize("猫がマットの上に座った")
# ['▁', '猫', 'が', 'マ', 'ット', 'の', '上', 'に', '座', 'った']

# Arabic (right-to-left)
tok.tokenize("القطة جلست على الحصيرة")
# ['▁', 'القط', 'ة', '▁', 'جل', 'ست', '▁على', '▁ال', 'حص', 'ير', 'ة']

What each component does

Let's trace the full data flow for a single training batch:

python
# XLM-R training step — detailed data flow
import torch

# 1. Sample a language according to smoothed distribution
lang = sample_language(alpha=0.3)  # might be "sw" (Swahili)

# 2. Sample a batch of sentences from that language
batch = get_sentences(lang, batch_size=8192)
# Each sentence is up to 512 SentencePiece tokens
# Sentences are packed: if one is 200 tokens, the next starts at 201

# 3. Apply dynamic masking (different each epoch!)
masked_ids, labels = dynamic_mask(batch, mask_prob=0.15)
# 15% of tokens are replaced:
#   80% → [MASK] token
#   10% → random token from vocab
#   10% → unchanged (but still predicted)

# 4. Forward pass through Transformer
outputs = model(masked_ids)  # [8192, 512, 250000]
# Output: probability distribution over 250K vocab for each position

# 5. Compute loss only on masked positions
loss = cross_entropy(outputs[mask_positions], labels[mask_positions])
# The model must predict the original token from context
# "The [MASK] sat on the mat" → predict "cat"

# 6. Backward + optimizer step
loss.backward()
optimizer.step()  # Adam with warmup + linear decay

Full sentence packing. Unlike BERT, which packs two sentences and separates them with [SEP], XLM-R packs as many full sentences as fit in 512 tokens, separated only by the end-of-sentence token. This means more text per training example and no wasted padding tokens. RoBERTa showed this improves efficiency by ~15% with no quality loss.

MLM Training Visualizer

Watch XLM-R perform masked language modeling on multilingual text. Click "Mask & Predict" to randomly mask tokens and see the model's predictions. The same model handles all languages.

What training recipe changes from original BERT does XLM-R inherit from RoBERTa?

Dynamic masking (different masks each epoch for more training signal), removal of NSP objective (it hurts performance), larger batch sizes, and full-sentence packing — these optimizations make pure MLM sufficient for strong cross-lingual transfer without needing parallel data XLM-R uses a different attention mechanism from RoBERTa XLM-R uses a smaller learning rate than RoBERTa

Chapter 5: Results & Benchmarks

XLM-R was evaluated on several cross-lingual benchmarks. The headline result: it outperforms mBERT on every benchmark and matches or beats XLM (which uses parallel data) while using no parallel data at all.

XNLI: Cross-lingual Natural Language Inference

XNLI is the primary benchmark for cross-lingual transfer. The task: given two sentences (premise and hypothesis), classify their relationship as entailment, contradiction, or neutral. Training is done in English only; evaluation is done in 15 languages.

Model	EN	FR	DE	AR	ZH	HI	SW	Avg (15 langs)
mBERT	82.1	76.6	74.2	67.7	69.6	62.1	58.8	70.4
XLM (MLM+TLM)	85.0	80.2	78.7	73.1	76.7	69.6	64.6	75.0
XLM-R Base	85.8	79.7	78.7	73.8	76.5	72.4	66.5	76.2
XLM-R Large	89.1	84.1	83.9	77.3	80.9	75.6	71.2	80.9

Key observations:

XLM-R Large dominates. It beats mBERT by 10.5 points on average and XLM by 5.9 points — a massive improvement. And it achieves this without any parallel data (unlike XLM which requires translation pairs).

Breaking down the results

python
# Analyzing where XLM-R's gains come from
# Compare mBERT → XLM-R Large improvements by language family:
#
# Indo-European (Latin script): avg +7.5 points
#   French: 76.6 → 84.1 (+7.5)
#   German: 74.2 → 83.9 (+9.7)
#   Spanish: 77.3 → 83.7 (+6.4)
#
# Non-Indo-European / Non-Latin script: avg +11.2 points
#   Arabic: 67.7 → 77.3 (+9.6)
#   Hindi: 62.1 → 75.6 (+13.5)
#   Swahili: 58.8 → 71.2 (+12.4)
#   Chinese: 69.6 → 80.9 (+11.3)
#
# The pattern: non-Latin languages gain MORE
# Why? Two factors:
# 1. CC-100 has much more data than Wikipedia for these languages
# 2. 250K SentencePiece vocab gives better coverage of non-Latin scripts
# 3. Exponential smoothing (α=0.3) gives them more training signal

Low-resource languages gain the most. Swahili goes from 58.8 (mBERT) to 71.2 (XLM-R Large) — a 12.4-point improvement. Hindi goes from 62.1 to 75.6 (+13.5). The exponential smoothing and larger data pay off most where they're needed most.

Named Entity Recognition

On WikiANN NER (recognize person, location, organization names in 40 languages), XLM-R Large achieves 65.4 F1 averaged across all languages, compared to 62.2 for mBERT. The improvement is particularly large for languages with non-Latin scripts.

The GLUE sanity check. XLM-R also matches RoBERTa (English-only) on the English GLUE benchmark: 89.8 vs 90.2. This proves that being multilingual barely hurts English performance when the model is large enough. The curse of multilinguality is largely cured by scale.

XNLI Results Explorer

Click a language to see how each model performs on XNLI. The gap between mBERT and XLM-R is largest for low-resource languages.

What is the most significant finding from XLM-R's XNLI results?

XLM-R Large outperforms XLM (which uses parallel data) by 5.9 points on average using only monolingual data — proving that massive scale of monolingual data can replace the need for parallel data. The biggest gains are for low-resource languages (Swahili +12.4 points over mBERT) XLM-R is faster than mBERT at inference XLM-R uses less memory than mBERT

Chapter 6: Transfer Showcase

Let's put cross-lingual transfer to the test. The showcase simulation lets you experience what XLM-R does: train a classifier on English, then deploy it on any of 100 languages without additional training.

How zero-shot transfer works in practice

1. Load XLM-R

Pre-trained on 100 languages. Representations already cross-lingual.

↓

2. Fine-tune on English

Add a classification head. Train on English labeled data (e.g., XNLI). Update all weights.

↓

3. Deploy on any language

Input text in Japanese, Arabic, Swahili — the model classifies it correctly because the representations are language-agnostic.

python
from transformers import XLMRobertaForSequenceClassification, XLMRobertaTokenizer

# 1. Load pre-trained XLM-R
model = XLMRobertaForSequenceClassification.from_pretrained(
    "xlm-roberta-large", num_labels=3  # entail/contradict/neutral
)
tok = XLMRobertaTokenizer.from_pretrained("xlm-roberta-large")

# 2. Fine-tune on English XNLI training data
for epoch in range(3):
    for batch in english_xnli_train:
        inputs = tok(batch['premise'], batch['hypothesis'],
                    return_tensors="pt", padding=True, truncation=True)
        loss = model(**inputs, labels=batch['label']).loss
        loss.backward()
        optimizer.step()

# 3. Zero-shot evaluation on OTHER languages (no training!)
test_ja = "前提: 男がギターを弾いている。仮説: 男は音楽家だ。"
inputs = tok(test_ja, return_tensors="pt")
pred = model(**inputs).logits.argmax()  # 0=entailment ✓
# Correct! Despite never seeing a single Japanese labeled example.

This actually works. On the XNLI test set, XLM-R Large achieves 89.1% on English (where it was trained) and 80.9% averaged across 15 languages (where it was NOT trained). The ~8 point gap between English and the average is the "transfer gap" — it's remarkably small given that the model saw zero labeled data in those languages.

Cross-lingual NLI Simulator

This simulation shows XLM-R classifying sentence pairs in different languages — all using a classifier trained ONLY on English data. Select a language and click "Classify" to see zero-shot transfer in action. Drag the "Training examples" slider to see how fine-tuning data affects transfer quality.

Training examples 5000

In zero-shot cross-lingual transfer with XLM-R, what enables a classifier trained on English to work on Japanese?

XLM-R's pre-training on 100 languages creates representations where the same concepts (sentiment, entailment) are encoded similarly regardless of language — so a linear classifier trained on English features automatically works on Japanese features because they occupy the same regions of representation space XLM-R internally translates Japanese to English before classifying The classification head has separate parameters for each language

Chapter 7: Connections

XLM-R stands at the crossroads of two trends: scaling language models and making them multilingual. Its impact continues to shape how we build models for the world's languages.

The multilingual model landscape

Model	Year	Languages	Key Innovation
mBERT	2019	104	First: BERT on Wikipedia in 104 languages
XLM	2019	100	TLM: use parallel data for cross-lingual signal
XLM-R (this paper)	2020	100	Scale: 2.5 TB of CommonCrawl, no parallel data needed
mT5	2021	101	Encoder-decoder architecture on mC4 (6.3T tokens)
BLOOM	2022	46	Open science: community-built multilingual GPT
PaLM	2022	~100	540B params, few-shot multilingual capabilities
LLaMA-3	2024	~30	128K vocab, strong multilingual despite English focus

Limitations acknowledged

The tokenization tax. XLM-R uses a shared SentencePiece vocabulary, but it's still biased toward high-resource languages. Bengali text requires 2-3x more tokens than English for the same semantic content. This means Bengali "costs" more compute and context window — a problem quantified in Ahia et al. (EMNLP 2023).

Transfer gap remains. Even XLM-R Large has an 8-point gap between English and the average language on XNLI. For safety-critical applications, zero-shot transfer may not be sufficient — some in-language data is still needed.

Encoder only. XLM-R is an encoder model — it produces representations, not text. For generative multilingual tasks, you need models like mT5 or multilingual GPT variants.

How to use XLM-R in practice

XLM-R is typically used as a feature extractor or fine-tuning base for downstream tasks. Here's a complete example of zero-shot cross-lingual sentiment classification:

python
# Complete XLM-R fine-tuning + cross-lingual deployment
from transformers import (
    XLMRobertaForSequenceClassification,
    XLMRobertaTokenizer,
    Trainer,
    TrainingArguments
)

# 1. Load model and tokenizer
model = XLMRobertaForSequenceClassification.from_pretrained(
    "xlm-roberta-large",
    num_labels=3  # positive, neutral, negative
)
tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-large")

# 2. Fine-tune on English data ONLY
training_args = TrainingArguments(
    output_dir="./xlmr-sentiment",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    warmup_ratio=0.1,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=english_sentiment_dataset,  # English only!
    eval_dataset=english_eval_dataset,
)
trainer.train()

# 3. Deploy on ANY language (zero-shot)
def classify(text, lang="auto"):
    inputs = tokenizer(text, return_tensors="pt",
                       truncation=True, max_length=512)
    outputs = model(**inputs)
    pred = outputs.logits.argmax(dim=-1).item()
    return ["negative", "neutral", "positive"][pred]

# Works on languages never seen during fine-tuning!
classify("I love this product!")      # "positive" (English)
classify("この製品が大好きです！")      # "positive" (Japanese)
classify("Ninapenda bidhaa hii!")      # "positive" (Swahili)

XLM-R's lasting contribution is proving that multilingual pre-training scales: more data and more parameters overcome the curse of multilinguality. Every subsequent multilingual model — from mT5 to PaLM to LLaMA — builds on this insight. The recipe is now standard: train on massive multilingual corpora with smoothed sampling, use a large SentencePiece vocabulary, and make the model big enough to handle all languages without unacceptable dilution.

"One model, one hundred languages, no parallel data. The tower of Babel was a problem of capacity, not of principle."

Multilingual Model Timeline

Explore the evolution of multilingual models from mBERT to modern multilingual LLMs.

Year XLM-R (2020)

What is XLM-R's most lasting contribution to the field?

Proving that multilingual pre-training scales — with enough data (2.5 TB) and model capacity (550M params), a single model can serve 100 languages with strong cross-lingual transfer, no parallel data needed. This recipe (massive multilingual corpora + smoothed sampling + large model) became the standard for all subsequent multilingual models Inventing the Transformer architecture Creating the first translation model

XLM-R: Cross-lingual Representations at Scale