MMLU (Hendrycks 2021)

Chapter 0: The Evaluation Gap

It's late 2020 and GPT-3 has just arrived. It can write essays, answer questions, translate languages. But how do you actually measure what it knows? The existing benchmarks are woefully inadequate.

Consider SuperGLUE, the premier NLP benchmark of 2019. It tests things like reading comprehension, textual entailment, and word sense disambiguation. By early 2021, models have already surpassed human performance on SuperGLUE. Does that mean AI "understands" language? Not remotely. SuperGLUE tests linguistic competence — can the model parse sentences? — but never asks whether it actually knows any facts about the world.

A model can ace SuperGLUE without knowing that mitochondria produce ATP, that the Treaty of Westphalia was signed in 1648, or that the Miranda warning is required before custodial interrogation. These are the kinds of knowledge that separate a well-educated human from a very good autocomplete system.

The core problem: Existing benchmarks measure linguistic ability (can the model understand language?) but not world knowledge (does the model actually know anything?). A model could get perfect scores on every NLU benchmark and still be useless as a doctor, lawyer, or engineer because it has never been tested on domain-specific knowledge. MMLU was created to close this gap.

Hendrycks et al. had a bold idea: test language models the same way we test humans. Give them actual exam questions — from high school biology to professional law, from college chemistry to abstract algebra. 57 subjects, 15,908 questions. If a model can pass these exams, it genuinely understands the material. If it can't, we know exactly where its knowledge gaps are.

This was radical. Previous benchmarks treated NLP as a single skill. MMLU treats it as 57 different skills. A model might be excellent at history but terrible at medicine. A model might understand elementary math but fail college-level physics. MMLU reveals these patterns.

Benchmark Gap Visualizer

Compare what traditional NLP benchmarks test (language skills) vs what MMLU tests (domain knowledge). Click each benchmark to see its scope.

Why was MMLU needed when benchmarks like SuperGLUE already existed?

SuperGLUE was too hard for existing models Existing benchmarks tested linguistic competence (understanding sentences) but not world knowledge (knowing facts across 57 domains) — MMLU tests what a model actually knows, like real exams test humans SuperGLUE didn't support few-shot evaluation

Chapter 1: 57 Subjects

MMLU organizes its 15,908 questions into 57 subjects spanning four broad categories. This isn't arbitrary — it mirrors how human education is organized, from elementary school through professional certification.

The four categories

STEM (17 subjects): Abstract algebra, anatomy, astronomy, college biology, college chemistry, college computer science, college mathematics, college physics, computer security, conceptual physics, electrical engineering, elementary mathematics, high school biology, high school chemistry, high school computer science, high school mathematics, high school physics, high school statistics, machine learning.

Humanities (13 subjects): Formal logic, high school European history, high school US history, high school world history, international law, jurisprudence, logical fallacies, moral disputes, moral scenarios, philosophy, prehistory, professional law, world religions.

Social Sciences (12 subjects): Econometrics, high school geography, high school government and politics, high school macroeconomics, high school microeconomics, high school psychology, human aging, human sexuality, marketing, medical genetics, professional accounting, professional psychology, public relations, security studies, sociology, US foreign policy, virology.

Other (15 subjects): Business ethics, clinical knowledge, college medicine, global facts, management, miscellaneous, nutrition, professional medicine.

Why 57 subjects matters: A single accuracy number hides everything. A model scoring 70% on MMLU might get 95% on elementary math and 30% on abstract algebra. The subject-level breakdown reveals what kind of knowledge the model has — and more importantly, what it's missing. This granularity is what makes MMLU diagnostic, not just evaluative.

Difficulty levels

The subjects span a natural difficulty gradient. Elementary mathematics asks "What is 3 + 7 * 2?" — basic arithmetic. College mathematics asks about group theory and real analysis. Professional law asks questions from the actual bar exam. This means MMLU can track progress along the entire difficulty curve.

Level	Example Subject	Typical Question	Human Expert Acc
Elementary	Elementary Math	Basic arithmetic and algebra	~95%
High School	HS Physics	Projectile motion, circuits, waves	~85%
College	College Chemistry	Organic reactions, thermodynamics	~75%
Professional	Professional Law	Bar exam questions	~65%

Notice that even human experts don't score 100% on the hardest subjects. The bar exam has a pass rate of about 60-70%. This calibrates our expectations: a model scoring 60% on professional law is roughly at the human pass level.

Data sources

Every question comes from real educational materials: practice exams, textbooks, online quizzes. This isn't generated data or crowdsourced annotations — it's the same material humans study. The questions are designed to test genuine understanding, not pattern matching.

Each subject has three splits: a few-shot development set (5 examples used as in-context demonstrations), a validation set (for hyperparameter tuning), and a test set (for final evaluation). The test set ranges from ~100 to ~1,500 questions per subject.

MMLU Subject Map

Explore all 57 subjects organized by category. Click a category to highlight its subjects. Each circle size shows the number of test questions.

Why does MMLU use 57 separate subjects instead of pooling all questions into one big test?

Because subject-level scores reveal which domains a model is strong or weak in — a single accuracy number hides whether the model knows medicine but not law, or math but not history Because it's easier to collect questions from separate sources Because 57 subjects create more total questions

Chapter 2: Question Format

Every MMLU question follows the same structure: a question stem followed by exactly four answer choices (A, B, C, D). This is the standard four-way multiple-choice format used in standardized exams like the SAT, GRE, and MCAT.

Anatomy of a question

text
Subject: college_physics

Question:
A particle of mass m moves in a central force field.
If the particle's orbit is circular with radius r,
what is the angular momentum in terms of m, r, and
the central force F(r)?

(A) L = mr√(rF(r)/m)
(B) L = mr√(rF(r))
(C) L = m√(r³F(r)/m)
(D) L = m√(r³F(r))

Answer: (C)

This question requires real understanding. You need to know that for circular orbits, the centripetal force equals F(r), that L = mvr for circular motion, and that v = √(rF(r)/m). Pure pattern matching won't help — you need to derive the answer.

Why four-way multiple choice?

The format is carefully chosen for evaluation. With four choices, random guessing gives exactly 25% accuracy. This is the baseline that any model must beat to demonstrate knowledge. It also makes scoring completely objective — there's no ambiguity about whether an answer is correct.

The 25% baseline. Random chance on a 4-way multiple-choice test is 25%. This is the "knows nothing" baseline. A model scoring 25% on a subject has learned nothing about that subject. A model scoring 50% knows about half the material. 100% is perfect. This clean linear scale makes MMLU scores immediately interpretable.

How LLMs answer multiple-choice questions

Language models don't pick from a dropdown menu. They generate text. So how do you evaluate them on multiple choice? There are two approaches:

Completion scoring: Feed the question as a prompt and compare the log-probabilities of generating "A", "B", "C", or "D" as the next token. Pick the one with the highest probability. This is fast (one forward pass) and doesn't require the model to follow instructions.

Generation scoring: Ask the model to generate its answer as text. Parse out whether it said "A", "B", "C", or "D". This tests whether the model can follow the instruction format, but is slower and can fail if the model generates something unexpected.

python
# Method 1: Completion scoring (MMLU standard)
import torch

def score_mmlu_question(model, tokenizer, question, choices):
    # Format: "Question: ... \nA. choice1\nB. choice2\n...\nAnswer:"
    prompt = format_question(question, choices)
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    # Get logits for next token
    with torch.no_grad():
        logits = model(input_ids).logits[0, -1, :]  # [vocab_size]

    # Extract probabilities for A, B, C, D tokens
    choice_ids = [tokenizer.encode(c)[0] for c in ["A", "B", "C", "D"]]
    choice_logits = logits[choice_ids]           # [4]
    probs = torch.softmax(choice_logits, dim=0)    # [4]

    return probs.argmax().item()  # 0=A, 1=B, 2=C, 3=D

# Method 2: Generation scoring (chat models)
def score_mmlu_chat(model, tokenizer, question, choices):
    prompt = format_question(question, choices) + "\nAnswer:"
    response = model.generate(prompt, max_tokens=1)
    return {"A":0,"B":1,"C":2,"D":3}.get(response.strip(), -1)

Completion scoring is the standard. The original MMLU paper uses completion scoring — comparing log-probabilities of the answer tokens. This is cleaner because it doesn't depend on the model's instruction-following ability. A base model (not fine-tuned for chat) can be evaluated this way. Later benchmarks like MMLU-Pro use 10 choices (A-J) to reduce the chance of guessing correctly.

The prompt format

MMLU uses a specific prompt template. Each question is presented with its subject name and few-shot examples:

text
The following are multiple choice questions (with answers)
about college_physics.

[5 example questions with correct answers]

Question: A particle of mass m moves...
(A) ...
(B) ...
(C) ...
(D) ...
Answer:

The subject name in the prompt is important. It primes the model to use domain-specific knowledge. Removing it drops accuracy by several percentage points, because the model doesn't know whether "What is a field?" is asking about physics, agriculture, or computer science.

MMLU Question Simulator

See real MMLU-style questions from different subjects. Click "Next Question" to cycle through examples. Watch how the same model might score differently on different subjects.

In MMLU's standard evaluation, how does a language model "choose" an answer for a multiple-choice question?

The model compares the log-probabilities of generating the tokens "A", "B", "C", and "D" as the next token after the question prompt, and picks the one with the highest probability The model generates a full paragraph explaining its reasoning The model uses a separate classification head trained for multiple choice

Chapter 3: Few-Shot Evaluation

MMLU was designed with a very specific evaluation protocol: 5-shot evaluation. Before each test question, the model sees five example questions with their correct answers from the same subject. No fine-tuning. No gradient updates. Just in-context learning.

Why few-shot?

The paper's goal is to measure a model's pre-existing knowledge, not its ability to learn from a training set. Fine-tuning on MMLU questions would test how well the model can memorize a study guide. Few-shot evaluation tests what the model already knows.

The five examples serve a specific purpose: they teach the model the format, not the content. They show: "Here's a question. Here are four choices. The answer is the letter." This is crucial because base language models aren't trained to answer multiple-choice questions — they're trained to predict the next token. The few-shot examples bridge that gap.

0-shot

Just the question. No examples. Model must guess the format. Lowest accuracy.

↓

5-shot (standard)

5 examples from the same subject. Model learns the format. Standard MMLU protocol.

↓

Fine-tuned

Trained on MMLU-like data. Highest accuracy but no longer tests pre-existing knowledge.

The few-shot gap

The difference between 0-shot and 5-shot performance reveals something important about language models. In the paper, GPT-3 175B goes from ~43% accuracy (0-shot) to ~43.9% accuracy (5-shot) — a modest gain. But smaller models show a much larger gap, sometimes 10+ percentage points. This suggests that larger models already understand the question format implicitly.

Few-shot examples teach format, not facts. A common misconception is that the 5 example questions "teach" the model the subject. They don't — 5 questions can't teach college chemistry. What they teach is: "Given this format, output a single letter." The model's chemistry knowledge comes entirely from pre-training. The examples just help it express that knowledge in the expected format.

The dev/val/test split

Each subject has exactly 5 development examples (used as few-shot demonstrations), a validation set (for selecting among model variants), and a test set (for final numbers). The separation is strict: no test question ever appears in the few-shot prompt.

python
# Standard MMLU 5-shot evaluation
def evaluate_mmlu(model, tokenizer, subject, test_data, dev_data):
    # Build few-shot prefix from dev set (always 5 examples)
    prefix = build_few_shot_prompt(subject, dev_data[:5])

    correct = 0
    for q, choices, answer in test_data:
        # Append test question after the 5 examples
        prompt = prefix + format_single_question(q, choices)

        # Score each answer choice
        pred = score_mmlu_question(model, tokenizer, prompt)
        if pred == answer:
            correct += 1

    return correct / len(test_data)  # accuracy for this subject

# Evaluate all 57 subjects
results = {}
for subject in ALL_SUBJECTS:
    results[subject] = evaluate_mmlu(model, tokenizer, subject,
                                       test[subject], dev[subject])

# Overall accuracy = macro average over subjects
overall = sum(results.values()) / len(results)
# Category averages
stem_avg = mean([results[s] for s in STEM_SUBJECTS])

Macro vs micro averaging

MMLU reports the macro average across subjects: compute accuracy per subject, then average those 57 numbers. This gives equal weight to each subject regardless of how many questions it has. A subject with 100 questions counts the same as one with 1,000 questions. The alternative — micro averaging (pooling all questions together) — would let large subjects dominate the score.

Few-Shot Effect Visualizer

Adjust the number of few-shot examples (0 to 5) and see how accuracy changes for different model sizes. Larger models benefit less because they already understand the format.

Shots 5

What do the 5 few-shot examples in MMLU's standard evaluation protocol teach the model?

The subject matter (chemistry, law, etc.) The answer format — "here's a question with four choices, respond with a single letter" — not the actual content, which comes entirely from pre-training How to reason step by step about questions

Chapter 4: Scoring LLMs

With the benchmark defined, it's time to measure actual models. The 2021 paper evaluated models ranging from small (1.5B parameters) to the largest available at the time (175B). The results revealed a clear scaling pattern — and some surprising weaknesses.

Models evaluated

Model	Parameters	MMLU (5-shot)	vs Random (25%)
GPT-2	1.5B	32.4%	+7.4%
GPT-3 (Small)	6.7B	35.1%	+10.1%
GPT-3 (Medium)	13B	36.9%	+11.9%
GPT-3 (Large)	175B	43.9%	+18.9%
UnifiedQA	11B	48.9%	+23.9%
Human expert	—	89.8%	+64.8%

Several things jump out. First, even the best model (UnifiedQA, which was fine-tuned on QA tasks) scores below 50% — barely better than random on many subjects. Meanwhile, human experts score nearly 90%. The 45-point gap between the best model and humans is enormous.

Second, scaling helps but slowly. Going from 1.5B to 175B parameters — a 117x increase — only improves accuracy from 32% to 44%. That's ~12 percentage points for over two orders of magnitude more compute. At this rate, you'd need trillions of parameters to approach human performance.

The 2021 verdict: Language models in early 2021 were far from expert-level knowledge. Even the largest models barely exceeded chance on subjects like abstract algebra (26%), professional law (33%), and clinical knowledge (35%). The gap to human performance was vast. MMLU was designed to be hard enough to last — and in 2021, it was.

Subject-level variation

The most revealing finding is the variation across subjects. GPT-3 175B scored:

Subject	GPT-3 Score	Interpretation
US Foreign Policy	72%	Genuinely knows this topic
Marketing	68%	Strong knowledge
High School Psychology	64%	Good recall of factual content
College Physics	32%	Random chance — knows nothing
Abstract Algebra	26%	Below random — confused by the format
College Chemistry	28%	Random chance

The pattern is clear: GPT-3 excels at factual recall subjects (history, politics, psychology) but fails at reasoning-heavy subjects (math, physics, chemistry). It remembers facts from its training data but can't solve multi-step problems. This distinction — recall vs. reasoning — became a major theme in AI research.

The role of fine-tuning

UnifiedQA outperforms GPT-3 despite being 16x smaller (11B vs 175B). Why? Because it was fine-tuned on question-answering datasets. This suggests that targeted fine-tuning on QA tasks transfers to MMLU, even though UnifiedQA never saw MMLU questions during training. The fine-tuning teaches the model to extract and apply knowledge more effectively.

python
# Analyzing MMLU results by category
import numpy as np

gpt3_results = {
    "stem":       [26, 32, 28, 33, 38],   # algebra, physics, chem, ...
    "humanities": [45, 52, 48, 41, 55],   # history, law, philosophy, ...
    "social":     [64, 58, 68, 55, 72],   # psych, econ, marketing, ...
}

# Gap analysis: where is the model weakest?
for cat, scores in gpt3_results.items():
    avg = np.mean(scores)
    gap = 89.8 - avg  # gap to human expert
    print(f"{cat}: avg={avg:.1f}%, gap to human={gap:.1f}%")
# stem:       avg=31.4%, gap to human=58.4%
# humanities: avg=48.2%, gap to human=41.6%
# social:     avg=63.4%, gap to human=26.4%

Model Performance by Subject

Compare model scores across subjects. Each bar shows accuracy (25% = random). Select a model to see its per-subject breakdown.

In the original 2021 evaluation, what pattern did subject-level scores reveal about GPT-3?

GPT-3 scored uniformly across all subjects GPT-3 was best at STEM subjects GPT-3 excelled at factual recall subjects (history, psychology, marketing ~60-70%) but failed at reasoning-heavy subjects (math, physics, chemistry ~25-30%), revealing a recall-vs-reasoning gap

Chapter 5: Results & Scaling

MMLU didn't just benchmark 2021 models — it became the tracking metric for the next four years of progress. Every major model release reported its MMLU score. The trajectory from 2021 to 2024 tells the story of modern AI.

The MMLU leaderboard progression

Year	Model	MMLU Score	Jump from previous
2021	GPT-3 175B	43.9%	—
2022	Chinchilla 70B	67.6%	+23.7%
2022	PaLM 540B	69.3%	+1.7%
2023	GPT-4	86.4%	+17.1%
2023	Gemini Ultra	90.0%	+3.6%
2024	Claude 3.5 Sonnet	88.7%	—
2024	GPT-4o	88.7%	—
Human expert	—	89.8%	—

The progression is dramatic. From 44% to 90% in just three years. And the curve isn't smooth — there are two big jumps. The first is Chinchilla in 2022 (+24 points), which showed that training data quantity matters as much as model size. The second is GPT-4 in 2023 (+17 points), which crossed the 85% threshold approaching human expert performance.

MMLU is now "solved." By late 2023, frontier models matched or exceeded human expert performance on MMLU. The benchmark Hendrycks designed to "last" was saturated in under three years. This is both a testament to the pace of progress and a warning: even benchmarks designed to be hard get conquered. The community moved to MMLU-Pro (10 choices, harder questions) and GPQA (PhD-level) as successors.

What drove the improvements?

Three factors drove the 44% to 90% jump:

1. Better pre-training data. Chinchilla showed that training on more tokens (not just bigger models) dramatically improves knowledge. The "compute-optimal" insight: a 70B model trained on 1.4T tokens beats a 280B model trained on 300B tokens.

2. Instruction tuning. RLHF and instruction-following training didn't just make models more helpful — it made them better at expressing knowledge in the expected format. ChatGPT-style models reliably pick "A"/"B"/"C"/"D" without needing the few-shot format crutch.

3. Chain-of-thought reasoning. GPT-4 and later models use internal reasoning to solve multi-step problems. This closed the "reasoning gap" that plagued GPT-3 on STEM subjects. GPT-4 scores 80%+ on college physics and abstract algebra — subjects where GPT-3 scored at random.

The scaling law for knowledge

Hendrycks et al. observed a rough log-linear relationship between model size and MMLU accuracy: each 10x increase in parameters yields roughly 10 percentage points of improvement. This held from GPT-2 to GPT-3 but broke down with instruction-tuned models, where training methodology matters more than raw size.

MMLU(N) ≈ α · log₁₀(N) + β

Where N is parameter count. For base models, α ≈ 5-7 and β ≈ 20. This relationship is approximate and breaks down for instruction-tuned and RLHF models.

MMLU Progress Timeline

Drag the slider to advance through time and see how model scores on MMLU improved from 2021 to 2024. The dashed line marks human expert performance (89.8%).

Year 2021

What three factors drove MMLU scores from 44% (GPT-3, 2021) to 90% (Gemini Ultra, 2023)?

Better pre-training data (more tokens), instruction tuning (RLHF for format compliance), and chain-of-thought reasoning (closing the STEM reasoning gap) Bigger models, faster GPUs, and longer context windows Training directly on MMLU questions

Chapter 6: MMLU Explorer

Let's bring it all together. This interactive explorer lets you simulate an MMLU evaluation across subjects, model sizes, and shot counts. You'll see exactly how the benchmark works — question by question, subject by subject.

MMLU Evaluation Simulator

Run a simulated MMLU evaluation. Select a model era, pick a subject category, then click "Evaluate" to see the model answer questions one by one. Watch accuracy build up across questions and compare to the 25% random baseline.

What the simulator shows

Each "evaluation" runs 20 simulated questions. For each question, the model produces a probability distribution over A/B/C/D. The highest probability is the model's answer. Green = correct, red = incorrect. The running accuracy updates in real time.

Notice the key patterns: GPT-3-era models hover near 30-40% on STEM but do better on social sciences. GPT-4-era models score 80%+ across most subjects. Frontier models approach 90% everywhere except the hardest reasoning tasks.

Where MMLU fails. MMLU has known weaknesses: some questions have errors in the answer key, the 4-choice format allows educated guessing (25% floor), and contamination is rampant — newer models may have seen MMLU questions in their training data. MMLU-Pro (2024) addresses these with 10 choices, harder questions, and decontamination. But MMLU remains the historical standard everyone compares to.

Contamination: the elephant in the room

By 2023, MMLU contamination became a serious concern. MMLU questions are publicly available online. Any web-crawled training dataset likely includes them. A model that memorized the answers isn't demonstrating knowledge — it's demonstrating memorization.

Researchers use several heuristics to detect contamination: checking if the model can complete question stems verbatim, comparing performance on verbatim vs. paraphrased questions, and looking for suspiciously high scores on obscure subjects. The problem has no clean solution — once a benchmark is public, contamination is inevitable.

python
# Simple contamination check
def check_contamination(model, question_stem):
    # If model can complete the exact question text,
    # it likely memorized it from training data
    prompt = question_stem[:50]  # first 50 chars
    completion = model.generate(prompt, max_tokens=100)

    # Check if completion matches original question
    overlap = compute_rouge_l(completion, question_stem[50:])
    return overlap > 0.8  # high overlap = likely contaminated

Why is benchmark contamination a growing problem for MMLU?

MMLU questions are publicly available online, so web-crawled training data likely contains them — a model that memorized the answers demonstrates memorization, not genuine knowledge Models are getting too good at reasoning The question format is too easy

Chapter 7: Connections

MMLU's impact goes far beyond being a benchmark. It established the paradigm for how we evaluate large language models — and its limitations spawned an entire ecosystem of successor benchmarks.

MMLU's legacy

Benchmark	Year	How It Extends MMLU
MMLU-Pro	2024	10 choices (A-J), harder questions, decontaminated
GPQA	2023	PhD-level questions only, expert-verified, "Google-proof"
ARC	2018	Science questions graded by difficulty
HellaSwag	2019	Commonsense reasoning via sentence completion
HELM	2022	Multi-dimensional evaluation (not just accuracy)
BIG-Bench	2022	200+ tasks crowd-sourced from researchers
TruthfulQA	2022	Tests for hallucination and truthfulness

What MMLU got right

Subject-level granularity. Breaking evaluation into 57 subjects revealed patterns that aggregate scores hide. This influenced every benchmark that followed.

Real human exams. Using actual educational materials grounded the benchmark in human-meaningful knowledge. A score on MMLU has a real-world interpretation: "this model could pass a college chemistry exam."

Clean format. Four-way multiple choice with a 25% floor is simple, unambiguous, and reproducible. No subjective grading, no human judges, no prompt sensitivity debates.

What MMLU got wrong

Four choices is too few. With A/B/C/D, a model can sometimes guess correctly by eliminating one or two obviously wrong answers. MMLU-Pro's 10 choices make this much harder.

No reasoning chain required. MMLU only checks the final answer, not the reasoning. A model might get the right answer for the wrong reason (lucky guess) or the wrong answer despite sound reasoning (computational error). Process-level evaluation would be more informative.

Static and contaminated. Once published, the questions can't be changed. And they leaked into training data within months.

The benchmark lifecycle. MMLU exemplifies a pattern: a hard benchmark is proposed, it drives progress for 2-3 years, models saturate it, and the community moves on. SuperGLUE lasted ~2 years before saturation. MMLU lasted ~3. GPQA (PhD-level) may last longer. The arms race between benchmarks and models is fundamental to the field.

Connected papers

HELM (Holistic Evaluation) — Goes beyond accuracy to measure calibration, robustness, fairness, and efficiency. Read the HELM lesson →

Chain of Thought — The technique that closed MMLU's STEM gap by enabling step-by-step reasoning. Read the CoT lesson →

Self-Consistency — Sampling multiple reasoning paths and voting on the answer, improving MMLU accuracy significantly. Read the Self-Consistency lesson →

MMLU Impact Map

See how MMLU connects to the broader evaluation landscape. Click nodes to highlight relationships.

What is the main limitation that led to MMLU-Pro as a successor benchmark?

MMLU questions were too easy With only 4 choices, models can guess by elimination; MMLU questions leaked into training data causing contamination; and models saturated the benchmark at ~90% — MMLU-Pro addresses all three with 10 choices, harder questions, and decontamination MMLU didn't cover enough subjects

MMLU: Measuring Massive Multitask Language Understanding