A 57-subject, 15,908-question multiple-choice benchmark spanning STEM, humanities, social sciences, and professional domains — the first test that measures what a language model actually knows.
It's late 2020 and GPT-3 has just arrived. It can write essays, answer questions, translate languages. But how do you actually measure what it knows? The existing benchmarks are woefully inadequate.
Consider SuperGLUE, the premier NLP benchmark of 2019. It tests things like reading comprehension, textual entailment, and word sense disambiguation. By early 2021, models have already surpassed human performance on SuperGLUE. Does that mean AI "understands" language? Not remotely. SuperGLUE tests linguistic competence — can the model parse sentences? — but never asks whether it actually knows any facts about the world.
A model can ace SuperGLUE without knowing that mitochondria produce ATP, that the Treaty of Westphalia was signed in 1648, or that the Miranda warning is required before custodial interrogation. These are the kinds of knowledge that separate a well-educated human from a very good autocomplete system.
Hendrycks et al. had a bold idea: test language models the same way we test humans. Give them actual exam questions — from high school biology to professional law, from college chemistry to abstract algebra. 57 subjects, 15,908 questions. If a model can pass these exams, it genuinely understands the material. If it can't, we know exactly where its knowledge gaps are.
This was radical. Previous benchmarks treated NLP as a single skill. MMLU treats it as 57 different skills. A model might be excellent at history but terrible at medicine. A model might understand elementary math but fail college-level physics. MMLU reveals these patterns.
Compare what traditional NLP benchmarks test (language skills) vs what MMLU tests (domain knowledge). Click each benchmark to see its scope.
MMLU organizes its 15,908 questions into 57 subjects spanning four broad categories. This isn't arbitrary — it mirrors how human education is organized, from elementary school through professional certification.
STEM (17 subjects): Abstract algebra, anatomy, astronomy, college biology, college chemistry, college computer science, college mathematics, college physics, computer security, conceptual physics, electrical engineering, elementary mathematics, high school biology, high school chemistry, high school computer science, high school mathematics, high school physics, high school statistics, machine learning.
Humanities (13 subjects): Formal logic, high school European history, high school US history, high school world history, international law, jurisprudence, logical fallacies, moral disputes, moral scenarios, philosophy, prehistory, professional law, world religions.
Social Sciences (12 subjects): Econometrics, high school geography, high school government and politics, high school macroeconomics, high school microeconomics, high school psychology, human aging, human sexuality, marketing, medical genetics, professional accounting, professional psychology, public relations, security studies, sociology, US foreign policy, virology.
Other (15 subjects): Business ethics, clinical knowledge, college medicine, global facts, management, miscellaneous, nutrition, professional medicine.
The subjects span a natural difficulty gradient. Elementary mathematics asks "What is 3 + 7 * 2?" — basic arithmetic. College mathematics asks about group theory and real analysis. Professional law asks questions from the actual bar exam. This means MMLU can track progress along the entire difficulty curve.
| Level | Example Subject | Typical Question | Human Expert Acc |
|---|---|---|---|
| Elementary | Elementary Math | Basic arithmetic and algebra | ~95% |
| High School | HS Physics | Projectile motion, circuits, waves | ~85% |
| College | College Chemistry | Organic reactions, thermodynamics | ~75% |
| Professional | Professional Law | Bar exam questions | ~65% |
Notice that even human experts don't score 100% on the hardest subjects. The bar exam has a pass rate of about 60-70%. This calibrates our expectations: a model scoring 60% on professional law is roughly at the human pass level.
Every question comes from real educational materials: practice exams, textbooks, online quizzes. This isn't generated data or crowdsourced annotations — it's the same material humans study. The questions are designed to test genuine understanding, not pattern matching.
Each subject has three splits: a few-shot development set (5 examples used as in-context demonstrations), a validation set (for hyperparameter tuning), and a test set (for final evaluation). The test set ranges from ~100 to ~1,500 questions per subject.
Explore all 57 subjects organized by category. Click a category to highlight its subjects. Each circle size shows the number of test questions.
Every MMLU question follows the same structure: a question stem followed by exactly four answer choices (A, B, C, D). This is the standard four-way multiple-choice format used in standardized exams like the SAT, GRE, and MCAT.
text Subject: college_physics Question: A particle of mass m moves in a central force field. If the particle's orbit is circular with radius r, what is the angular momentum in terms of m, r, and the central force F(r)? (A) L = mr√(rF(r)/m) (B) L = mr√(rF(r)) (C) L = m√(r³F(r)/m) (D) L = m√(r³F(r)) Answer: (C)
This question requires real understanding. You need to know that for circular orbits, the centripetal force equals F(r), that L = mvr for circular motion, and that v = √(rF(r)/m). Pure pattern matching won't help — you need to derive the answer.
The format is carefully chosen for evaluation. With four choices, random guessing gives exactly 25% accuracy. This is the baseline that any model must beat to demonstrate knowledge. It also makes scoring completely objective — there's no ambiguity about whether an answer is correct.
Language models don't pick from a dropdown menu. They generate text. So how do you evaluate them on multiple choice? There are two approaches:
Completion scoring: Feed the question as a prompt and compare the log-probabilities of generating "A", "B", "C", or "D" as the next token. Pick the one with the highest probability. This is fast (one forward pass) and doesn't require the model to follow instructions.
Generation scoring: Ask the model to generate its answer as text. Parse out whether it said "A", "B", "C", or "D". This tests whether the model can follow the instruction format, but is slower and can fail if the model generates something unexpected.
python # Method 1: Completion scoring (MMLU standard) import torch def score_mmlu_question(model, tokenizer, question, choices): # Format: "Question: ... \nA. choice1\nB. choice2\n...\nAnswer:" prompt = format_question(question, choices) input_ids = tokenizer.encode(prompt, return_tensors="pt") # Get logits for next token with torch.no_grad(): logits = model(input_ids).logits[0, -1, :] # [vocab_size] # Extract probabilities for A, B, C, D tokens choice_ids = [tokenizer.encode(c)[0] for c in ["A", "B", "C", "D"]] choice_logits = logits[choice_ids] # [4] probs = torch.softmax(choice_logits, dim=0) # [4] return probs.argmax().item() # 0=A, 1=B, 2=C, 3=D # Method 2: Generation scoring (chat models) def score_mmlu_chat(model, tokenizer, question, choices): prompt = format_question(question, choices) + "\nAnswer:" response = model.generate(prompt, max_tokens=1) return {"A":0,"B":1,"C":2,"D":3}.get(response.strip(), -1)
MMLU uses a specific prompt template. Each question is presented with its subject name and few-shot examples:
text The following are multiple choice questions (with answers) about college_physics. [5 example questions with correct answers] Question: A particle of mass m moves... (A) ... (B) ... (C) ... (D) ... Answer:
The subject name in the prompt is important. It primes the model to use domain-specific knowledge. Removing it drops accuracy by several percentage points, because the model doesn't know whether "What is a field?" is asking about physics, agriculture, or computer science.
See real MMLU-style questions from different subjects. Click "Next Question" to cycle through examples. Watch how the same model might score differently on different subjects.
MMLU was designed with a very specific evaluation protocol: 5-shot evaluation. Before each test question, the model sees five example questions with their correct answers from the same subject. No fine-tuning. No gradient updates. Just in-context learning.
The paper's goal is to measure a model's pre-existing knowledge, not its ability to learn from a training set. Fine-tuning on MMLU questions would test how well the model can memorize a study guide. Few-shot evaluation tests what the model already knows.
The five examples serve a specific purpose: they teach the model the format, not the content. They show: "Here's a question. Here are four choices. The answer is the letter." This is crucial because base language models aren't trained to answer multiple-choice questions — they're trained to predict the next token. The few-shot examples bridge that gap.
The difference between 0-shot and 5-shot performance reveals something important about language models. In the paper, GPT-3 175B goes from ~43% accuracy (0-shot) to ~43.9% accuracy (5-shot) — a modest gain. But smaller models show a much larger gap, sometimes 10+ percentage points. This suggests that larger models already understand the question format implicitly.
Each subject has exactly 5 development examples (used as few-shot demonstrations), a validation set (for selecting among model variants), and a test set (for final numbers). The separation is strict: no test question ever appears in the few-shot prompt.
python # Standard MMLU 5-shot evaluation def evaluate_mmlu(model, tokenizer, subject, test_data, dev_data): # Build few-shot prefix from dev set (always 5 examples) prefix = build_few_shot_prompt(subject, dev_data[:5]) correct = 0 for q, choices, answer in test_data: # Append test question after the 5 examples prompt = prefix + format_single_question(q, choices) # Score each answer choice pred = score_mmlu_question(model, tokenizer, prompt) if pred == answer: correct += 1 return correct / len(test_data) # accuracy for this subject # Evaluate all 57 subjects results = {} for subject in ALL_SUBJECTS: results[subject] = evaluate_mmlu(model, tokenizer, subject, test[subject], dev[subject]) # Overall accuracy = macro average over subjects overall = sum(results.values()) / len(results) # Category averages stem_avg = mean([results[s] for s in STEM_SUBJECTS])
MMLU reports the macro average across subjects: compute accuracy per subject, then average those 57 numbers. This gives equal weight to each subject regardless of how many questions it has. A subject with 100 questions counts the same as one with 1,000 questions. The alternative — micro averaging (pooling all questions together) — would let large subjects dominate the score.
Adjust the number of few-shot examples (0 to 5) and see how accuracy changes for different model sizes. Larger models benefit less because they already understand the format.
With the benchmark defined, it's time to measure actual models. The 2021 paper evaluated models ranging from small (1.5B parameters) to the largest available at the time (175B). The results revealed a clear scaling pattern — and some surprising weaknesses.
| Model | Parameters | MMLU (5-shot) | vs Random (25%) |
|---|---|---|---|
| GPT-2 | 1.5B | 32.4% | +7.4% |
| GPT-3 (Small) | 6.7B | 35.1% | +10.1% |
| GPT-3 (Medium) | 13B | 36.9% | +11.9% |
| GPT-3 (Large) | 175B | 43.9% | +18.9% |
| UnifiedQA | 11B | 48.9% | +23.9% |
| Human expert | — | 89.8% | +64.8% |
Several things jump out. First, even the best model (UnifiedQA, which was fine-tuned on QA tasks) scores below 50% — barely better than random on many subjects. Meanwhile, human experts score nearly 90%. The 45-point gap between the best model and humans is enormous.
Second, scaling helps but slowly. Going from 1.5B to 175B parameters — a 117x increase — only improves accuracy from 32% to 44%. That's ~12 percentage points for over two orders of magnitude more compute. At this rate, you'd need trillions of parameters to approach human performance.
The most revealing finding is the variation across subjects. GPT-3 175B scored:
| Subject | GPT-3 Score | Interpretation |
|---|---|---|
| US Foreign Policy | 72% | Genuinely knows this topic |
| Marketing | 68% | Strong knowledge |
| High School Psychology | 64% | Good recall of factual content |
| College Physics | 32% | Random chance — knows nothing |
| Abstract Algebra | 26% | Below random — confused by the format |
| College Chemistry | 28% | Random chance |
The pattern is clear: GPT-3 excels at factual recall subjects (history, politics, psychology) but fails at reasoning-heavy subjects (math, physics, chemistry). It remembers facts from its training data but can't solve multi-step problems. This distinction — recall vs. reasoning — became a major theme in AI research.
UnifiedQA outperforms GPT-3 despite being 16x smaller (11B vs 175B). Why? Because it was fine-tuned on question-answering datasets. This suggests that targeted fine-tuning on QA tasks transfers to MMLU, even though UnifiedQA never saw MMLU questions during training. The fine-tuning teaches the model to extract and apply knowledge more effectively.
python # Analyzing MMLU results by category import numpy as np gpt3_results = { "stem": [26, 32, 28, 33, 38], # algebra, physics, chem, ... "humanities": [45, 52, 48, 41, 55], # history, law, philosophy, ... "social": [64, 58, 68, 55, 72], # psych, econ, marketing, ... } # Gap analysis: where is the model weakest? for cat, scores in gpt3_results.items(): avg = np.mean(scores) gap = 89.8 - avg # gap to human expert print(f"{cat}: avg={avg:.1f}%, gap to human={gap:.1f}%") # stem: avg=31.4%, gap to human=58.4% # humanities: avg=48.2%, gap to human=41.6% # social: avg=63.4%, gap to human=26.4%
Compare model scores across subjects. Each bar shows accuracy (25% = random). Select a model to see its per-subject breakdown.
MMLU didn't just benchmark 2021 models — it became the tracking metric for the next four years of progress. Every major model release reported its MMLU score. The trajectory from 2021 to 2024 tells the story of modern AI.
| Year | Model | MMLU Score | Jump from previous |
|---|---|---|---|
| 2021 | GPT-3 175B | 43.9% | — |
| 2022 | Chinchilla 70B | 67.6% | +23.7% |
| 2022 | PaLM 540B | 69.3% | +1.7% |
| 2023 | GPT-4 | 86.4% | +17.1% |
| 2023 | Gemini Ultra | 90.0% | +3.6% |
| 2024 | Claude 3.5 Sonnet | 88.7% | — |
| 2024 | GPT-4o | 88.7% | — |
| Human expert | — | 89.8% | — |
The progression is dramatic. From 44% to 90% in just three years. And the curve isn't smooth — there are two big jumps. The first is Chinchilla in 2022 (+24 points), which showed that training data quantity matters as much as model size. The second is GPT-4 in 2023 (+17 points), which crossed the 85% threshold approaching human expert performance.
Three factors drove the 44% to 90% jump:
1. Better pre-training data. Chinchilla showed that training on more tokens (not just bigger models) dramatically improves knowledge. The "compute-optimal" insight: a 70B model trained on 1.4T tokens beats a 280B model trained on 300B tokens.
2. Instruction tuning. RLHF and instruction-following training didn't just make models more helpful — it made them better at expressing knowledge in the expected format. ChatGPT-style models reliably pick "A"/"B"/"C"/"D" without needing the few-shot format crutch.
3. Chain-of-thought reasoning. GPT-4 and later models use internal reasoning to solve multi-step problems. This closed the "reasoning gap" that plagued GPT-3 on STEM subjects. GPT-4 scores 80%+ on college physics and abstract algebra — subjects where GPT-3 scored at random.
Hendrycks et al. observed a rough log-linear relationship between model size and MMLU accuracy: each 10x increase in parameters yields roughly 10 percentage points of improvement. This held from GPT-2 to GPT-3 but broke down with instruction-tuned models, where training methodology matters more than raw size.
Where N is parameter count. For base models, α ≈ 5-7 and β ≈ 20. This relationship is approximate and breaks down for instruction-tuned and RLHF models.
Drag the slider to advance through time and see how model scores on MMLU improved from 2021 to 2024. The dashed line marks human expert performance (89.8%).
Let's bring it all together. This interactive explorer lets you simulate an MMLU evaluation across subjects, model sizes, and shot counts. You'll see exactly how the benchmark works — question by question, subject by subject.
Run a simulated MMLU evaluation. Select a model era, pick a subject category, then click "Evaluate" to see the model answer questions one by one. Watch accuracy build up across questions and compare to the 25% random baseline.
Each "evaluation" runs 20 simulated questions. For each question, the model produces a probability distribution over A/B/C/D. The highest probability is the model's answer. Green = correct, red = incorrect. The running accuracy updates in real time.
Notice the key patterns: GPT-3-era models hover near 30-40% on STEM but do better on social sciences. GPT-4-era models score 80%+ across most subjects. Frontier models approach 90% everywhere except the hardest reasoning tasks.
By 2023, MMLU contamination became a serious concern. MMLU questions are publicly available online. Any web-crawled training dataset likely includes them. A model that memorized the answers isn't demonstrating knowledge — it's demonstrating memorization.
Researchers use several heuristics to detect contamination: checking if the model can complete question stems verbatim, comparing performance on verbatim vs. paraphrased questions, and looking for suspiciously high scores on obscure subjects. The problem has no clean solution — once a benchmark is public, contamination is inevitable.
python # Simple contamination check def check_contamination(model, question_stem): # If model can complete the exact question text, # it likely memorized it from training data prompt = question_stem[:50] # first 50 chars completion = model.generate(prompt, max_tokens=100) # Check if completion matches original question overlap = compute_rouge_l(completion, question_stem[50:]) return overlap > 0.8 # high overlap = likely contaminated
MMLU's impact goes far beyond being a benchmark. It established the paradigm for how we evaluate large language models — and its limitations spawned an entire ecosystem of successor benchmarks.
| Benchmark | Year | How It Extends MMLU |
|---|---|---|
| MMLU-Pro | 2024 | 10 choices (A-J), harder questions, decontaminated |
| GPQA | 2023 | PhD-level questions only, expert-verified, "Google-proof" |
| ARC | 2018 | Science questions graded by difficulty |
| HellaSwag | 2019 | Commonsense reasoning via sentence completion |
| HELM | 2022 | Multi-dimensional evaluation (not just accuracy) |
| BIG-Bench | 2022 | 200+ tasks crowd-sourced from researchers |
| TruthfulQA | 2022 | Tests for hallucination and truthfulness |
Subject-level granularity. Breaking evaluation into 57 subjects revealed patterns that aggregate scores hide. This influenced every benchmark that followed.
Real human exams. Using actual educational materials grounded the benchmark in human-meaningful knowledge. A score on MMLU has a real-world interpretation: "this model could pass a college chemistry exam."
Clean format. Four-way multiple choice with a 25% floor is simple, unambiguous, and reproducible. No subjective grading, no human judges, no prompt sensitivity debates.
Four choices is too few. With A/B/C/D, a model can sometimes guess correctly by eliminating one or two obviously wrong answers. MMLU-Pro's 10 choices make this much harder.
No reasoning chain required. MMLU only checks the final answer, not the reasoning. A model might get the right answer for the wrong reason (lucky guess) or the wrong answer despite sound reasoning (computational error). Process-level evaluation would be more informative.
Static and contaminated. Once published, the questions can't be changed. And they leaked into training data within months.
HELM (Holistic Evaluation) — Goes beyond accuracy to measure calibration, robustness, fairness, and efficiency. Read the HELM lesson →
Chain of Thought — The technique that closed MMLU's STEM gap by enabling step-by-step reasoning. Read the CoT lesson →
Self-Consistency — Sampling multiple reasoning paths and voting on the answer, improving MMLU accuracy significantly. Read the Self-Consistency lesson →
See how MMLU connects to the broader evaluation landscape. Click nodes to highlight relationships.