Alexis Conneau, Kartikay Khandelwal, Naman Goyal, et al. (Facebook AI) — ACL 2020

XLM-R: Cross-lingual Representations at Scale

Unsupervised Cross-lingual Representation Learning at Scale — train one masked language model on 100 languages simultaneously, enabling zero-shot cross-lingual transfer without parallel data.

Prerequisites: BERT masked LM basics + Tokenization (BPE/SentencePiece). That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Language Barrier

You've built a sentiment classifier for English product reviews. It works great — 94% accuracy. Now your company expands to Japan, Brazil, and Turkey. You need the same classifier for Japanese, Portuguese, and Turkish. The problem: you have zero labeled training data in those languages.

This is the language barrier in NLP. The vast majority of labeled datasets exist only in English. Building separate models for each of the world's 7,000+ languages would require labeled data for each one — data that simply doesn't exist for most languages.

ApproachWhat It RequiresScales?
Train per languageLabeled data in EACH target languageNo — most languages have no labeled data
Translate and trainMachine translation system for each language pairPartially — but translation introduces noise
Cross-lingual transferMultilingual representations + labels in ONE languageYes — one model, many languages

Cross-lingual transfer is the dream: train a classifier on English data, then apply it directly to Japanese, Portuguese, Turkish — any language — without any additional training. The classifier works because the representations are shared across languages.

XLM-RoBERTa's promise: Train a single masked language model on 2.5 terabytes of text in 100 languages. The resulting representations are so deeply multilingual that a classifier trained on English features can understand Japanese, Arabic, and Swahili — despite never seeing a single labeled example in those languages. No parallel data. No translation. Just one giant model that learns language-agnostic features by reading the internet in 100 languages.

Think of it like a polyglot who reads extensively in many languages. After enough reading, they develop abstract concepts of "positive sentiment" and "named entity" that transcend any specific language. XLM-R does the same — but computationally.

Cross-lingual Transfer Demo

Click a language to see how XLM-R transfers English sentiment training to other languages with zero additional labeled data. The shared representation space makes features work across languages.

What problem does cross-lingual transfer learning solve?

Chapter 1: Cross-lingual Transfer

How can a model learn representations that work across languages? The key insight is that if you train a masked language model on text from multiple languages simultaneously, the model discovers shared structure that transcends any single language.

Why does this work at all?

Consider sentiment. In English, "This movie is amazing!" is positive. In French, "Ce film est incroyable!" is positive. The surface tokens are completely different, but the syntactic structures and semantic patterns are similar. A multilingual model that processes both languages learns to map these parallel structures into similar vector spaces.

More concretely, multilingual models exploit three types of cross-lingual signal:

Signal TypeWhat It IsExample
Shared subwordsMany languages share subword tokens — numbers, borrowed words, named entities"Obama", "2024", "COVID" are the same tokens in any language
Structural similarityMany languages have SVO order, use articles, have similar syntactic patterns"The cat sat" vs "Le chat s'est assis" — similar structure
Parameter sharingOne Transformer processes all languages — forced to find shared representationsThe same attention heads learn "subject" and "object" across languages
The anchor hypothesis. Shared subwords (names, numbers, loanwords) act as "anchors" that pull different languages into the same region of representation space. If "Obama" appears in English, French, and Japanese sentences with similar contexts, the Transformer learns to give all those sentences similar representations — and other words in those contexts get pulled along. These anchors bootstrap cross-lingual alignment without any explicit parallel data.

Prior work: mBERT and XLM

Multilingual BERT (mBERT) by Google (2019) was the first demonstration. It was simply BERT trained on Wikipedia text from 104 languages. Remarkably, despite no cross-lingual objective, it learned cross-lingual representations. But it had limitations: Wikipedia is small for many languages, and the model capacity (110M parameters) was spread too thin across 104 languages.

XLM (Conneau & Lample, 2019) improved on mBERT by adding a Translation Language Modeling (TLM) objective: mask tokens in parallel sentences and let the model attend to both languages. This gave an explicit cross-lingual signal but required parallel data — which is scarce for most language pairs.

python
# The evolution of multilingual models
# 1. mBERT (2019)
#    - BERT trained on Wikipedia in 104 languages
#    - Objective: MLM (masked language modeling) only
#    - Data: ~2.5B words total across all languages
#    - Problem: Wikipedia is tiny for low-resource languages

# 2. XLM (2019)
#    - Added TLM (translation language modeling)
#    - Objective: MLM + TLM (requires parallel data)
#    - Better cross-lingual, but parallel data is limited

# 3. XLM-R (2020) — THIS PAPER
#    - Just MLM, no TLM needed
#    - Massive data: 2.5 TB of CommonCrawl text
#    - Key insight: scale the DATA, not the objective
#    - Result: beats XLM-TLM despite using LESS supervision
XLM-R's key insight: data > objectives. The XLM paper showed that TLM (parallel data) helps cross-lingual transfer. XLM-R showed something surprising: with ENOUGH monolingual data, you don't need parallel data at all. Pure MLM on massive multilingual text works better than MLM+TLM on smaller data. More data in each language gives the model more context to discover cross-lingual patterns on its own.
Multilingual Model Evolution

Drag the slider to see how multilingual models evolved from mBERT to XLM-R. Each step shows the data, objective, and performance on cross-lingual benchmarks.

Model XLM-R
What was XLM-R's key insight compared to XLM?

Chapter 2: Scaling the Data

The foundation of XLM-R is data. Where mBERT used Wikipedia (2.5 billion words across 104 languages), XLM-R uses CommonCrawl (over 2 terabytes across 100 languages). This isn't just "more data" — it's a qualitative shift in what low-resource languages can access.

CC-100: Cleaning the CommonCrawl

Raw CommonCrawl data is messy — it's scraped from the entire web. Conneau et al. built the CC-100 dataset by applying the same pipeline used for English in RoBERTa to 100 languages:

1. CommonCrawl Dumps
Raw web crawl snapshots. Hundreds of TB of text in all languages mixed together.
2. Language ID (fastText)
Classify each page into one of 176 languages using a fastText classifier. Keep pages with confidence > 0.5.
3. Deduplication
Remove near-duplicate paragraphs within each language. Crucial for data quality.
4. CC-100 Dataset
2.5 TB total. 100 languages. From 55GB (English) down to 1MB (lowest). 1000x more data than Wikipedia for many languages.

Data size comparison

The scale difference between Wikipedia and CommonCrawl is enormous for low-resource languages:

LanguageWikipedia (mBERT)CC-100 (XLM-R)Increase
English2.5 GB55 GB22x
French1.1 GB57 GB52x
Swahili11 MB332 MB30x
Burmese8 MB214 MB27x
Urdu30 MB730 MB24x
Yoruba1.2 MB28 MB23x

For Yoruba, the data goes from 1.2 MB (barely enough to learn anything) to 28 MB. This is still small compared to English's 55 GB, but it's enough for the model to learn basic Yoruba linguistic structure. The exponential smoothing in sampling further amplifies low-resource languages' training signal.

python
# Data quality matters as much as quantity
# Wikipedia: curated, factual, encyclopedic style
# CommonCrawl: diverse but noisy — includes:
#   - News articles (formal)
#   - Blog posts (informal)
#   - Forum discussions (colloquial)
#   - Product descriptions (commercial)
#   - Boilerplate HTML (noise)
#
# The deduplication step is crucial:
# Without it, the model memorizes repeated boilerplate
# (cookie notices, navigation menus, copyright footers)
# instead of learning language structure

The data imbalance problem

Even after cleaning, the data is wildly imbalanced. English has 300 billion tokens. Swahili has 275 million. Urdu has 730 million. If you train on the natural distribution, the model overwhelmingly learns English features and barely touches low-resource languages.

XLM-R uses exponential smoothing to rebalance sampling. Let ni be the number of tokens in language i. The sampling probability for language i is:

pi = niα / ∑j njα

Where α is a smoothing parameter. When α = 1, you sample proportionally (English dominates). When α = 0, every language is equally sampled (low-resource gets a huge boost but English suffers). Conneau et al. found α = 0.3 works best — it upsamples low-resource languages significantly while not starving high-resource ones.

python
import numpy as np

# Example: 5 languages with different data sizes
data_sizes = {
    'English':   300_000,  # 300B tokens (millions)
    'French':     56_000,  # 56B
    'Hindi':       2_000,  # 2B
    'Swahili':       275,  # 275M
    'Urdu':          730,  # 730M
}

def compute_sampling(sizes, alpha):
    values = np.array(list(sizes.values()), dtype=np.float64)
    probs = values ** alpha
    return probs / probs.sum()

# α = 1.0 (proportional): English gets 83.7% of training
# α = 0.3 (smoothed):     English gets 37.2%, Swahili gets 4.8%
# α = 0.0 (uniform):      Each language gets 20%
α = 0.3 is the Goldilocks value. Too high, and the model ignores low-resource languages. Too low, and it forgets high-resource patterns. At 0.3, Swahili's sampling probability increases from 0.08% to 4.8% — a 60x boost. This means the model sees Swahili text 60 times more often than it would under proportional sampling, giving it a fighting chance to learn useful Swahili representations.
Sampling Distribution Explorer

Drag the α slider to see how exponential smoothing rebalances language sampling. At α=1 (proportional), English dominates. At α=0 (uniform), all languages are equal. XLM-R uses α=0.3.

α 0.30
Why does XLM-R use exponential smoothing (α = 0.3) instead of proportional sampling?

Chapter 3: The Curse of Multilinguality

Here's the central tension of multilingual models: every language you add dilutes the model's capacity for every other language. This is the curse of multilinguality.

Imagine a Transformer with 110M parameters (like BERT-base). If it learns English only, all 110M parameters are dedicated to English. If it learns 100 languages, those same 110M parameters must represent the vocabulary, syntax, semantics, and factual knowledge of ALL 100 languages. Each language gets roughly 1.1M parameters' worth of capacity — a 100x reduction.

The evidence

Conneau et al. ran a controlled experiment. They trained multiple XLM-R models, each on a different number of languages (1, 7, 15, 30, 100), and measured performance on a shared evaluation set. The results are striking:

# LanguagesEN accuracyLow-resource avgOverall avg
1 (English only)92.3— (no transfer)
791.178.585.4
1589.879.284.9
3088.579.884.5
10087.180.184.2

Adding more languages hurts high-resource languages (English drops from 92.3 to 87.1) but helps low-resource languages (which benefit from cross-lingual transfer). This is a genuine tradeoff — you can't get both without increasing capacity.

The solution: make the model bigger. If the curse is limited capacity, the cure is more capacity. XLM-R uses a 550M-parameter model (XLM-R Large) — 5x bigger than mBERT. With sufficient capacity, the curse is largely overcome: English performance approaches monolingual levels while cross-lingual transfer remains strong. This is the paper's key practical finding: multilingual models need to be large to avoid the dilution penalty.

Vocabulary size matters too

mBERT used a 110K shared vocabulary (WordPiece). XLM-R uses a 250K SentencePiece vocabulary. This larger vocabulary is crucial for multilingual coverage. With only 110K tokens, many scripts (Chinese, Japanese, Korean, Arabic) get fragmented into tiny pieces, making sequences very long and learning harder.

python
# Vocabulary allocation across scripts
# mBERT (110K WordPiece):
#   Latin scripts:  ~40K tokens (dominates)
#   CJK characters: ~20K tokens (underfragmented)
#   Other scripts:  ~50K tokens (heavily fragmented)

# XLM-R (250K SentencePiece):
#   Latin scripts:  ~60K tokens
#   CJK characters: ~50K tokens (much better coverage)
#   Other scripts:  ~140K tokens (decent coverage)

# Result: "नमस्ते" (Hindi for "hello")
# mBERT:  5 tokens (heavily fragmented)
# XLM-R:  2 tokens (better coverage of Devanagari)
Curse of Multilinguality

Drag the slider to add more languages to the model. Watch how English performance drops while cross-lingual transfer to low-resource languages improves. Then toggle "Large model" to see how increased capacity overcomes the curse.

# Languages 100
What is the "curse of multilinguality" and how does XLM-R address it?

Chapter 4: Architecture & Training

XLM-R is architecturally identical to RoBERTa — a Transformer encoder trained with masked language modeling. The novelty isn't in the architecture but in the scale of multilingual data and the training recipe. Let's trace the exact setup.

Model specifications

ConfigXLM-R BaseXLM-R Large
Layers1224
Hidden dim7681024
Attention heads1216
Parameters270M550M
Vocabulary250K SentencePiece (shared)
Max sequence length512 tokens
Training dataCC-100: 2.5 TB, 100 languages

Training objective: Just MLM

XLM-R uses only Masked Language Modeling (MLM). Randomly mask 15% of the input tokens and predict them. No next-sentence prediction (removed following RoBERTa's finding that NSP hurts). No translation language modeling (the key finding — you don't need parallel data).

LMLM = − ∑i ∈ masked log p(xi | x\i)

Where xi is the masked token and x\i is all other tokens. The model must predict each masked word from its multilingual context.

Training recipe from RoBERTa

XLM-R adopts the optimized training recipe from RoBERTa:

python
# XLM-R training hyperparameters
config = {
    'optimizer': 'Adam',
    'lr': 6e-4,           # peak learning rate
    'warmup_steps': 10000,
    'batch_size': 8192,    # very large batch
    'total_steps': 1_500_000,  # 1.5M steps
    'masking': 'dynamic',  # re-mask each epoch (not static)
    'nsp': False,          # no next-sentence prediction
    'full_sentences': True,  # pack full sentences (not pairs)
    'sampling_alpha': 0.3,  # language smoothing factor
    'fp16': True,          # mixed precision training
}
# Total compute: ~500 V100 GPUs for ~3 weeks
Dynamic masking is crucial. BERT used static masking — the same tokens are masked every epoch. RoBERTa (and XLM-R) use dynamic masking — different tokens are masked each time the model sees a sentence. This effectively creates 10x more training examples from the same data. For a model seeing 2.5 TB of text over 1.5M steps, dynamic masking ensures maximum data efficiency.

SentencePiece tokenization

XLM-R uses SentencePiece with a 250K vocabulary trained on the CC-100 data. SentencePiece is critical for multilingual models because it doesn't assume whitespace-separated words — it can tokenize Japanese, Chinese, and Thai directly from the raw character stream.

python
# SentencePiece tokenization examples
from transformers import XLMRobertaTokenizer

tok = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")

# English
tok.tokenize("The cat sat on the mat")
# ['▁The', '▁cat', '▁sat', '▁on', '▁the', '▁mat']

# Japanese (no spaces between words!)
tok.tokenize("猫がマットの上に座った")
# ['▁', '猫', 'が', 'マ', 'ット', 'の', '上', 'に', '座', 'った']

# Arabic (right-to-left)
tok.tokenize("القطة جلست على الحصيرة")
# ['▁', 'القط', 'ة', '▁', 'جل', 'ست', '▁على', '▁ال', 'حص', 'ير', 'ة']

What each component does

Let's trace the full data flow for a single training batch:

python
# XLM-R training step — detailed data flow
import torch

# 1. Sample a language according to smoothed distribution
lang = sample_language(alpha=0.3)  # might be "sw" (Swahili)

# 2. Sample a batch of sentences from that language
batch = get_sentences(lang, batch_size=8192)
# Each sentence is up to 512 SentencePiece tokens
# Sentences are packed: if one is 200 tokens, the next starts at 201

# 3. Apply dynamic masking (different each epoch!)
masked_ids, labels = dynamic_mask(batch, mask_prob=0.15)
# 15% of tokens are replaced:
#   80% → [MASK] token
#   10% → random token from vocab
#   10% → unchanged (but still predicted)

# 4. Forward pass through Transformer
outputs = model(masked_ids)  # [8192, 512, 250000]
# Output: probability distribution over 250K vocab for each position

# 5. Compute loss only on masked positions
loss = cross_entropy(outputs[mask_positions], labels[mask_positions])
# The model must predict the original token from context
# "The [MASK] sat on the mat" → predict "cat"

# 6. Backward + optimizer step
loss.backward()
optimizer.step()  # Adam with warmup + linear decay
Full sentence packing. Unlike BERT, which packs two sentences and separates them with [SEP], XLM-R packs as many full sentences as fit in 512 tokens, separated only by the end-of-sentence token. This means more text per training example and no wasted padding tokens. RoBERTa showed this improves efficiency by ~15% with no quality loss.
MLM Training Visualizer

Watch XLM-R perform masked language modeling on multilingual text. Click "Mask & Predict" to randomly mask tokens and see the model's predictions. The same model handles all languages.

What training recipe changes from original BERT does XLM-R inherit from RoBERTa?

Chapter 5: Results & Benchmarks

XLM-R was evaluated on several cross-lingual benchmarks. The headline result: it outperforms mBERT on every benchmark and matches or beats XLM (which uses parallel data) while using no parallel data at all.

XNLI: Cross-lingual Natural Language Inference

XNLI is the primary benchmark for cross-lingual transfer. The task: given two sentences (premise and hypothesis), classify their relationship as entailment, contradiction, or neutral. Training is done in English only; evaluation is done in 15 languages.

ModelENFRDEARZHHISWAvg (15 langs)
mBERT82.176.674.267.769.662.158.870.4
XLM (MLM+TLM)85.080.278.773.176.769.664.675.0
XLM-R Base85.879.778.773.876.572.466.576.2
XLM-R Large89.184.183.977.380.975.671.280.9

Key observations:

XLM-R Large dominates. It beats mBERT by 10.5 points on average and XLM by 5.9 points — a massive improvement. And it achieves this without any parallel data (unlike XLM which requires translation pairs).

Breaking down the results

python
# Analyzing where XLM-R's gains come from
# Compare mBERT → XLM-R Large improvements by language family:
#
# Indo-European (Latin script): avg +7.5 points
#   French: 76.6 → 84.1 (+7.5)
#   German: 74.2 → 83.9 (+9.7)
#   Spanish: 77.3 → 83.7 (+6.4)
#
# Non-Indo-European / Non-Latin script: avg +11.2 points
#   Arabic: 67.7 → 77.3 (+9.6)
#   Hindi: 62.1 → 75.6 (+13.5)
#   Swahili: 58.8 → 71.2 (+12.4)
#   Chinese: 69.6 → 80.9 (+11.3)
#
# The pattern: non-Latin languages gain MORE
# Why? Two factors:
# 1. CC-100 has much more data than Wikipedia for these languages
# 2. 250K SentencePiece vocab gives better coverage of non-Latin scripts
# 3. Exponential smoothing (α=0.3) gives them more training signal

Low-resource languages gain the most. Swahili goes from 58.8 (mBERT) to 71.2 (XLM-R Large) — a 12.4-point improvement. Hindi goes from 62.1 to 75.6 (+13.5). The exponential smoothing and larger data pay off most where they're needed most.

Named Entity Recognition

On WikiANN NER (recognize person, location, organization names in 40 languages), XLM-R Large achieves 65.4 F1 averaged across all languages, compared to 62.2 for mBERT. The improvement is particularly large for languages with non-Latin scripts.

The GLUE sanity check. XLM-R also matches RoBERTa (English-only) on the English GLUE benchmark: 89.8 vs 90.2. This proves that being multilingual barely hurts English performance when the model is large enough. The curse of multilinguality is largely cured by scale.
XNLI Results Explorer

Click a language to see how each model performs on XNLI. The gap between mBERT and XLM-R is largest for low-resource languages.

What is the most significant finding from XLM-R's XNLI results?

Chapter 6: Transfer Showcase

Let's put cross-lingual transfer to the test. The showcase simulation lets you experience what XLM-R does: train a classifier on English, then deploy it on any of 100 languages without additional training.

How zero-shot transfer works in practice

1. Load XLM-R
Pre-trained on 100 languages. Representations already cross-lingual.
2. Fine-tune on English
Add a classification head. Train on English labeled data (e.g., XNLI). Update all weights.
3. Deploy on any language
Input text in Japanese, Arabic, Swahili — the model classifies it correctly because the representations are language-agnostic.
python
from transformers import XLMRobertaForSequenceClassification, XLMRobertaTokenizer

# 1. Load pre-trained XLM-R
model = XLMRobertaForSequenceClassification.from_pretrained(
    "xlm-roberta-large", num_labels=3  # entail/contradict/neutral
)
tok = XLMRobertaTokenizer.from_pretrained("xlm-roberta-large")

# 2. Fine-tune on English XNLI training data
for epoch in range(3):
    for batch in english_xnli_train:
        inputs = tok(batch['premise'], batch['hypothesis'],
                    return_tensors="pt", padding=True, truncation=True)
        loss = model(**inputs, labels=batch['label']).loss
        loss.backward()
        optimizer.step()

# 3. Zero-shot evaluation on OTHER languages (no training!)
test_ja = "前提: 男がギターを弾いている。仮説: 男は音楽家だ。"
inputs = tok(test_ja, return_tensors="pt")
pred = model(**inputs).logits.argmax()  # 0=entailment ✓
# Correct! Despite never seeing a single Japanese labeled example.
This actually works. On the XNLI test set, XLM-R Large achieves 89.1% on English (where it was trained) and 80.9% averaged across 15 languages (where it was NOT trained). The ~8 point gap between English and the average is the "transfer gap" — it's remarkably small given that the model saw zero labeled data in those languages.
Cross-lingual NLI Simulator

This simulation shows XLM-R classifying sentence pairs in different languages — all using a classifier trained ONLY on English data. Select a language and click "Classify" to see zero-shot transfer in action. Drag the "Training examples" slider to see how fine-tuning data affects transfer quality.

Training examples 5000
In zero-shot cross-lingual transfer with XLM-R, what enables a classifier trained on English to work on Japanese?

Chapter 7: Connections

XLM-R stands at the crossroads of two trends: scaling language models and making them multilingual. Its impact continues to shape how we build models for the world's languages.

The multilingual model landscape

ModelYearLanguagesKey Innovation
mBERT2019104First: BERT on Wikipedia in 104 languages
XLM2019100TLM: use parallel data for cross-lingual signal
XLM-R (this paper)2020100Scale: 2.5 TB of CommonCrawl, no parallel data needed
mT52021101Encoder-decoder architecture on mC4 (6.3T tokens)
BLOOM202246Open science: community-built multilingual GPT
PaLM2022~100540B params, few-shot multilingual capabilities
LLaMA-32024~30128K vocab, strong multilingual despite English focus

Limitations acknowledged

The tokenization tax. XLM-R uses a shared SentencePiece vocabulary, but it's still biased toward high-resource languages. Bengali text requires 2-3x more tokens than English for the same semantic content. This means Bengali "costs" more compute and context window — a problem quantified in Ahia et al. (EMNLP 2023).

Transfer gap remains. Even XLM-R Large has an 8-point gap between English and the average language on XNLI. For safety-critical applications, zero-shot transfer may not be sufficient — some in-language data is still needed.

Encoder only. XLM-R is an encoder model — it produces representations, not text. For generative multilingual tasks, you need models like mT5 or multilingual GPT variants.

How to use XLM-R in practice

XLM-R is typically used as a feature extractor or fine-tuning base for downstream tasks. Here's a complete example of zero-shot cross-lingual sentiment classification:

python
# Complete XLM-R fine-tuning + cross-lingual deployment
from transformers import (
    XLMRobertaForSequenceClassification,
    XLMRobertaTokenizer,
    Trainer,
    TrainingArguments
)

# 1. Load model and tokenizer
model = XLMRobertaForSequenceClassification.from_pretrained(
    "xlm-roberta-large",
    num_labels=3  # positive, neutral, negative
)
tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-large")

# 2. Fine-tune on English data ONLY
training_args = TrainingArguments(
    output_dir="./xlmr-sentiment",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    warmup_ratio=0.1,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=english_sentiment_dataset,  # English only!
    eval_dataset=english_eval_dataset,
)
trainer.train()

# 3. Deploy on ANY language (zero-shot)
def classify(text, lang="auto"):
    inputs = tokenizer(text, return_tensors="pt",
                       truncation=True, max_length=512)
    outputs = model(**inputs)
    pred = outputs.logits.argmax(dim=-1).item()
    return ["negative", "neutral", "positive"][pred]

# Works on languages never seen during fine-tuning!
classify("I love this product!")      # "positive" (English)
classify("この製品が大好きです!")      # "positive" (Japanese)
classify("Ninapenda bidhaa hii!")      # "positive" (Swahili)
XLM-R's lasting contribution is proving that multilingual pre-training scales: more data and more parameters overcome the curse of multilinguality. Every subsequent multilingual model — from mT5 to PaLM to LLaMA — builds on this insight. The recipe is now standard: train on massive multilingual corpora with smoothed sampling, use a large SentencePiece vocabulary, and make the model big enough to handle all languages without unacceptable dilution.

"One model, one hundred languages, no parallel data. The tower of Babel was a problem of capacity, not of principle."

Multilingual Model Timeline

Explore the evolution of multilingual models from mBERT to modern multilingual LLMs.

Year XLM-R (2020)
What is XLM-R's most lasting contribution to the field?