CS224N Lecture 11 — Benchmarking and Evaluation

Chapter 0: Why Evaluation?

You train two language models. Both cost millions of dollars. Both produce fluent English. Both can write poems, answer questions, and summarize documents. Which one is better?

This question seems simple until you try to answer it. "Better" at what? Better at answering trivia? Better at writing code? Better at following instructions without producing harmful content? Better at reasoning about physics? The moment you try to pin down "better," you realize you need a measuring stick — and the measuring stick you choose determines what you see.

Here's the uncomfortable truth: evaluation is the bottleneck of progress in NLP. We can train bigger models, invent cleverer architectures, and collect more data. But if our benchmarks don't capture what matters, we optimize for the wrong thing. Goodhart's Law — "when a measure becomes a target, it ceases to be a good measure" — haunts every leaderboard in AI.

The Narrow Benchmark Trap

The simulation below shows the core problem. A model is trained on a narrow benchmark — it learns to ace the test. But when you deploy it on real-world tasks outside the benchmark's distribution, it falls apart. The benchmark score told you the model was excellent. The real world told you it wasn't.

The Narrow Benchmark Trap

Watch a model ace a narrow benchmark but fail on out-of-distribution tasks. Click each evaluation mode to compare.

This is not hypothetical. GPT-3 scored 43.9% on MMLU when it launched in 2020. By 2023, GPT-4 hit 86.4%. Impressive! But users still found GPT-4 making basic mistakes in real conversations — hallucinating facts, misunderstanding context, failing at multi-step reasoning. The benchmark didn't lie, but it didn't tell the whole story.

The gap between "benchmark score" and "useful in practice" is the central problem of evaluation. This lesson teaches you the tools, frameworks, and pitfalls of measuring LLM quality — so you can evaluate models like a scientist, not a marketer.

A benchmark score is not intelligence — it's a measurement of performance on a specific task distribution. Confusing the two is the single most common mistake in LLM evaluation. A model that scores 90% on MMLU might still fail at your specific use case.

What We'll Cover

We'll build up from simple to complex evaluation strategies:

Task-Specific Benchmarks

GLUE, SuperGLUE, SQuAD — narrow tests for specific capabilities.

↓

Broad Knowledge Tests

MMLU — 57-subject exam spanning all of human knowledge.

↓

Holistic Evaluation

HELM — measure accuracy, calibration, fairness, robustness, toxicity all at once.

↓

LLM-as-Judge

Use one LLM to evaluate another — scalable but biased.

↓

Limitations

Contamination, Goodhart's Law, and what benchmarks can't measure.

Why is a high score on a single benchmark not sufficient evidence that an LLM is "good"?

Because benchmark tests are poorly designed Because a single benchmark measures performance on a specific task distribution, which may not generalize to real-world use or capture important qualities like safety and fairness Because benchmarks only test mathematical reasoning

Chapter 1: Task-Specific Benchmarks

Before LLMs could do everything, NLP models were trained for one task at a time: sentiment analysis, textual entailment, question answering. Each task had its own dataset, its own metric, and its own leaderboard. Evaluating a model meant running it on that specific test set and computing the score.

This era gave us some of the most important benchmarks in NLP history. Understanding them matters because (a) they're still used as components of larger evaluations, and (b) their rise and fall teaches us how benchmarks saturate — and what happens when they do.

The GLUE Era (2018)

GLUE (General Language Understanding Evaluation) was created by Wang et al. in 2018 to provide a single number summarizing a model's language understanding. It bundled 9 tasks:

Task	Type	What It Tests
CoLA	Acceptability	Is this sentence grammatical?
SST-2	Sentiment	Is this movie review positive or negative?
MRPC	Paraphrase	Do these two sentences mean the same thing?
STS-B	Similarity	How similar are these two sentences? (0-5 scale)
QQP	Paraphrase	Are these two Quora questions duplicates?
MNLI	Entailment	Does the premise entail/contradict/neutral the hypothesis?
QNLI	QA/Entailment	Does the sentence answer the question?
RTE	Entailment	Does the premise entail the hypothesis?
WNLI	Coreference	Does the pronoun refer to the correct entity?

The GLUE score was the average across all 9 tasks. When BERT was published in October 2018, it smashed the GLUE leaderboard, scoring 80.5 vs. the previous best of 69.0. Within a year, models surpassed human performance (87.1). The benchmark was effectively solved.

SuperGLUE and the Saturation Problem

SuperGLUE (2019) was the harder sequel. It replaced the easier tasks with genuinely difficult ones: reading comprehension (ReCoRD), causal reasoning (COPA), word sense disambiguation (WiC), and multi-sentence reasoning (MultiRC). Human performance was 89.8. Models hit 90+ within two years.

This pattern — create benchmark, models catch up, benchmark saturates, create harder benchmark — repeats throughout NLP history. The simulation below shows this saturation timeline.

Benchmark Saturation Timeline

Watch how quickly models catch up to and surpass human performance on each benchmark. The dashed line is human-level. Click a benchmark to see its saturation curve.

SQuAD (Stanford Question Answering Dataset) followed the same arc. Version 1.1 (2016) asked models to find answer spans in paragraphs. Version 2.0 (2018) added unanswerable questions — forcing models to know what they don't know. Both were surpassed by human performance within a couple of years.

What Saturation Teaches Us

Benchmark saturation is not a failure — it's a signal. It means the specific capability that benchmark measures has been largely solved. The failure is continuing to report scores on a saturated benchmark as if they mean something. When every model scores 90+ on GLUE, GLUE can't distinguish between them.

The deeper issue: these benchmarks measure narrow, well-defined tasks. Real language understanding is broad, fuzzy, and context-dependent. A model that can classify sentiment perfectly might still fail at following a complex multi-step instruction. We need something broader.

When a benchmark saturates, it doesn't mean we've solved the problem — it means we've outgrown the measurement. GLUE measured basic language understanding. Models solved it. But "understanding language" is vastly larger than what GLUE tested.

Metrics: How We Score

Each benchmark uses specific metrics. Understanding them is crucial because metric choice changes what you optimize:

Metric	Formula	Used When
Accuracy	Correct / Total	Classification (MMLU, GLUE sentiment)
F1	2 · (P · R) / (P + R)	Span extraction (SQuAD), balances precision/recall
Exact Match	Predicted == Gold	QA where partial credit is meaningless
BLEU	n-gram overlap with reference	Translation, summarization (being phased out)
ROUGE	Recall-based n-gram overlap	Summarization
Perplexity	exp(-avg log prob)	Language modeling (lower = better)

Accuracy is the simplest: how many questions did you get right? For multiple-choice benchmarks like MMLU, this is the standard. F1 balances precision (of your answers, how many are correct?) and recall (of the correct answers, how many did you find?). Exact Match gives zero partial credit — "Barack Obama" is wrong if the gold answer is "Obama."

BLEU and ROUGE measure n-gram overlap between generated text and reference text. BLEU emphasizes precision (are the n-grams in your output actually in the reference?), while ROUGE emphasizes recall (are the n-grams in the reference present in your output?). Both are being replaced by model-based evaluation for open-ended generation.

Why did the NLP community create SuperGLUE after GLUE?

GLUE had bugs in the test data Models quickly surpassed human performance on GLUE, so it could no longer distinguish between models — they needed harder tasks to measure continuing progress SuperGLUE tested more languages

Chapter 2: MMLU

GLUE and SuperGLUE test whether a model understands language. But do models actually know things? Can they pass a college exam in chemistry? An AP History test? A medical licensing exam? MMLU (Massive Multitask Language Understanding), introduced by Hendrycks et al. in 2020, tests exactly this.

MMLU is a 57-subject multiple-choice exam spanning the entire breadth of human academic knowledge. Each question has 4 options, one correct answer, and comes from one of 57 subjects grouped into 4 broad categories: STEM, Humanities, Social Sciences, and Other (professional subjects like law, medicine, accounting).

What Makes MMLU Special

Previous benchmarks tested narrow NLP tasks. MMLU tests knowledge — the kind that requires years of study. Consider these real examples:

MMLU examples
# Abstract Algebra
Q: Find the degree of the extension Q(√2, √3, √18) over Q.
(A) 0  (B) 4  (C) 2  (D) 6
Answer: (B) 4

# Clinical Knowledge
Q: Which of the following is the body's most abundant electrolyte?
(A) Potassium  (B) Sodium  (C) Calcium  (D) Magnesium
Answer: (B) Sodium

# Moral Scenarios
Q: Is it morally permissible to break a promise to attend a
friend's party in order to help a stranger in an emergency?
(A) Yes  (B) No  (C) It depends  (D) Promises must never be broken
Answer: (A) Yes

Notice the range: abstract algebra requires mathematical reasoning, clinical knowledge requires memorized facts, and moral scenarios require ethical reasoning. A model that does well across all 57 subjects demonstrates broad knowledge, not narrow specialization.

The 57-Subject Landscape

The visualization below shows MMLU performance as a heatmap across all 57 subjects. Switch between models to see how their knowledge profiles differ. Some models are strong in STEM but weak in humanities; others show the reverse pattern. Hover over any cell to see an example question from that subject.

MMLU 57-Subject Heatmap

Each cell is one subject. Color intensity = accuracy. Switch models to compare knowledge profiles. Hover for example questions.

How MMLU Is Evaluated

MMLU uses a few-shot prompting format. The model is shown 5 example questions with answers (the "5-shot" setting), then asked to answer a new question. No fine-tuning, no task-specific training. This tests what the model learned during pretraining alone.

python
# MMLU evaluation: 5-shot prompting
prompt = """The following are multiple choice questions about abstract algebra.

Q: Find all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.
(A) 0 (B) 1 (C) 2 (D) 3
A: (B)

Q: Statement 1: Every group of order p^2 is abelian. Statement 2: ...
(A) True, True (B) False, False ...
A: (A)

[3 more examples...]

Q: [NEW QUESTION]
(A) ... (B) ... (C) ... (D) ...
A:"""

# Extract the model's answer from the first generated token
output = model.generate(prompt, max_tokens=1)
predicted = output.strip()  # should be "(A)", "(B)", "(C)", or "(D)"
correct = predicted == gold_answer

The metric is simple: accuracy — percentage of questions answered correctly. Random guessing gives 25% (4 options). The overall MMLU score is the average accuracy across all 57 subjects, weighted equally.

The Leaderboard Race

Model	Year	MMLU Score	Jump
GPT-3 (175B)	2020	43.9%	—
Chinchilla (70B)	2022	67.6%	+23.7
GPT-4	2023	86.4%	+18.8
Claude 3 Opus	2024	86.8%	+0.4
GPT-4o	2024	88.7%	+1.9

Notice the pattern: early gains were massive (GPT-3 to Chinchilla: +24 points). Recent gains are tiny (Claude 3 to GPT-4o: +2 points). MMLU is approaching saturation — just like GLUE did. When every frontier model scores 85-90%, the benchmark can barely distinguish them.

This is why the community has moved to harder variants: MMLU-Pro (10 choices instead of 4, harder questions) and GPQA (graduate-level questions that even domain experts find hard).

MMLU's power is its breadth: 57 subjects means you can't game it by being good at one thing. Its weakness is that all questions are multiple-choice. A model might "know" the answer without truly understanding it — process of elimination and surface-level pattern matching can inflate scores. Open-ended evaluation addresses this.

What does MMLU primarily measure that earlier benchmarks like GLUE did not?

Speed of text generation Ability to follow instructions Breadth of factual and reasoning knowledge across 57 academic subjects, tested via few-shot prompting without task-specific fine-tuning

Chapter 3: HELM

MMLU tells you whether a model knows facts. But is the model fair? Is it calibrated (when it says 80% sure, is it right 80% of the time)? Is it robust to rephrased questions? Does it generate toxic content?

HELM (Holistic Evaluation of Language Models), introduced by Liang et al. at Stanford in 2022, argues that accuracy alone is insufficient. A model that scores 90% on MMLU but produces racist outputs 5% of the time is not a "good" model. HELM evaluates models across seven dimensions simultaneously, giving you a complete picture rather than a single number.

The Seven Metrics

HELM measures every model on every scenario across these seven axes:

Metric	What It Measures	Why It Matters
Accuracy	Correct answers	The baseline: does the model know the right answer?
Calibration	Confidence matches correctness	A model that says "90% sure" should be right 90% of the time
Robustness	Stable under perturbation	Rephrasing a question shouldn't change the answer
Fairness	Equal performance across groups	The model shouldn't be worse at answering questions about minorities
Bias	Stereotypical associations	The model shouldn't associate certain jobs with certain genders/races
Toxicity	Harmful content generation	The model should refuse to generate slurs, threats, or hate speech
Efficiency	Compute cost per query	A 10% accuracy gain isn't worth 100x more compute for most applications

The Radar Chart: Seeing the Full Picture

HELM's key innovation is presenting results as a multi-dimensional profile rather than a single score. The radar chart below shows how different models trade off between these metrics. A model that excels in accuracy may be poorly calibrated. A model with low toxicity may sacrifice some accuracy.

HELM Radar Chart

Compare models across all 7 HELM metrics. Each axis goes from 0 (center) to 100 (edge). The ideal model fills the entire polygon.

The radar chart reveals something that single-number benchmarks hide: no model dominates on every axis. GPT-4 may lead in accuracy but trail in efficiency. An open-source model may have higher toxicity but better calibration. This is why HELM refuses to compute a single "HELM score" — it would destroy the very information the benchmark is designed to reveal.

Calibration: The Underrated Metric

Calibration deserves special attention because it's almost never reported on leaderboards, yet it's critical for real applications. A calibrated model is one whose confidence scores match reality. If it assigns 70% probability to answer A, then A should be correct about 70% of the time across many such predictions.

Why does this matter? Because downstream systems rely on model confidence to make decisions. A medical assistant that says "I'm 95% sure this is benign" had better be right 95% of the time. An uncalibrated model that says "95% sure" when it's actually right 60% of the time is dangerous.

HELM measures calibration using Expected Calibration Error (ECE):

ECE = ∑_b=1^B (n_b / N) · |acc(b) − conf(b)|

Divide predictions into B bins by confidence. For each bin, compute the gap between average confidence and actual accuracy. ECE is the weighted average of these gaps. Perfect calibration gives ECE = 0.

Scenarios, Not Just Tasks

HELM tests models on 42 scenarios — combinations of datasets and tasks. Examples include: question answering (NaturalQuestions, TriviaQA), summarization (XSUM, CNN/DailyMail), sentiment (IMDB), toxicity detection, and code generation. Each scenario is evaluated on all 7 metrics.

The total evaluation matrix is enormous: 42 scenarios × 7 metrics × N models. This is HELM's strength and its weakness. Strength: it provides the most comprehensive picture of model capabilities ever assembled. Weakness: it's expensive to run (each evaluation costs significant compute) and hard to summarize (who reads a 42 × 7 table?).

Data Flow: Running a HELM Evaluation

Let's trace how a single HELM evaluation works end to end. You choose a model, a scenario (e.g., NaturalQuestions), and HELM generates prompts, collects outputs, and scores them on all 7 metrics.

python
# Simplified HELM evaluation pipeline
for scenario in scenarios:
    prompts = scenario.generate_prompts(num_examples=1000)

    for prompt in prompts:
        # Get model output
        output = model.generate(prompt, max_tokens=256)

        # Score on ALL 7 metrics simultaneously
        results['accuracy']    += exact_match(output, prompt.gold)
        results['calibration'] += ece_score(output.logprobs, prompt.gold)
        results['robustness']  += perturb_and_compare(model, prompt)
        results['fairness']    += group_disparity(output, prompt.demographics)
        results['bias']        += stereotype_score(output)
        results['toxicity']    += perspective_api(output.text)
        results['efficiency']  += measure_cost(prompt, output)

    # Average across all prompts for this scenario
    for metric in results:
        results[metric] /= len(prompts)

Notice: robustness is measured by perturbing the input (adding typos, rephrasing, changing names) and checking if the answer changes. Fairness is measured by checking if accuracy differs across demographic groups (e.g., questions about different racial groups). Toxicity uses Google's Perspective API to score generated text for harmful content.

The cost of a full HELM evaluation is substantial: running one model across all 42 scenarios requires tens of thousands of API calls. This is why HELM is typically run by well-funded research labs, not individual practitioners.

HELM's contribution: evaluation is multi-dimensional. A model's "quality" cannot be captured by a single number. Accuracy, calibration, fairness, robustness, bias, toxicity, and efficiency are independent axes. You must measure all of them.

Why does HELM refuse to compute a single aggregate score?

Because collapsing 7 independent metrics (accuracy, calibration, fairness, robustness, bias, toxicity, efficiency) into one number hides the tradeoffs between them — a model could score high overall while being dangerously uncalibrated or biased Because some models refused to participate Because the math for combining scores is too complex

Chapter 4: LLM-as-Judge

Multiple-choice benchmarks have a fundamental limitation: they can only test tasks with clearly defined correct answers. But most of what we want LLMs to do — write essays, explain concepts, have conversations, give advice — has no single correct answer. How do you evaluate open-ended generation?

The gold standard is human evaluation: pay human annotators to rate model outputs on criteria like helpfulness, accuracy, and harmlessness. This is what Anthropic, OpenAI, and Google do internally. But human evaluation is slow ($10-50 per comparison), noisy (annotators disagree), and doesn't scale (you can't re-run it every time you change a hyperparameter).

LLM-as-Judge is the pragmatic alternative: use a powerful language model (the "judge") to evaluate the outputs of other models. Instead of paying a human to read two answers and pick the better one, you prompt GPT-4 to do it.

How It Works

The basic setup is a pairwise comparison. Given a question and two model responses (A and B), the judge is prompted to evaluate which response is better and explain why:

python
# LLM-as-Judge: pairwise comparison
judge_prompt = """[System] You are a helpful assistant evaluating responses.

[Question] {question}

[Response A] {response_a}

[Response B] {response_b}

Please evaluate which response is better in terms of helpfulness,
accuracy, and clarity. Explain your reasoning, then state your
verdict as: [[A]], [[B]], or [[Tie]].
"""

verdict = judge_model.generate(judge_prompt)
# Parse verdict to extract [[A]], [[B]], or [[Tie]]

The judge's prompt is carefully designed. Zheng et al. (2023) in the MT-Bench/Chatbot Arena paper showed that prompt design matters enormously: asking for a reasoning chain before the verdict improves agreement with human evaluators from ~65% to ~80%.

Simulated Judge Panel

The simulation below lets you compare human judgments with LLM judge verdicts across different question types. Notice where they agree (factual questions) and where they diverge (subjective or nuanced questions).

Human vs. LLM Judge Comparison

Click a question type to see a sample comparison. Green = agreement between human and LLM judges. Red = disagreement.

Biases of LLM Judges

LLM judges are not neutral. Research has identified several systematic biases:

Position bias: LLM judges tend to prefer whichever response is presented first (position A). Mitigation: run each comparison twice with A/B swapped and require consistent verdicts.

Verbosity bias: Longer responses are rated higher, even when the extra length adds nothing. A 500-word response that rambles is often preferred over a concise 100-word answer that nails the point.

Self-enhancement bias: Models rate their own outputs higher than other models' outputs. GPT-4 as judge tends to prefer GPT-4 responses. Claude as judge tends to prefer Claude responses. This is a serious problem when a company evaluates its own model with its own model.

Style over substance: LLM judges are swayed by formatting (bullet points, headers, bold text) and confident tone, even when the content is wrong. A beautifully formatted wrong answer often beats a plainly written correct one.

Bias	Effect	Mitigation
Position bias	Prefers Response A	Swap A/B and require consistency
Verbosity bias	Prefers longer answers	Normalize for length, or instruct judge to penalize padding
Self-enhancement	Prefers own outputs	Use a different model family as judge
Style bias	Prefers formatting over accuracy	Separate accuracy and presentation scores

MT-Bench: Structured Multi-Turn Evaluation

MT-Bench (Zheng et al., 2023) is a curated set of 80 multi-turn questions across 8 categories: writing, roleplay, extraction, reasoning, math, coding, STEM, and humanities. Each question has two turns — a follow-up that tests the model's ability to maintain context and build on its previous answer.

The judge evaluates each response on a 1-10 scale. The scores are averaged across all 80 questions to produce a single MT-Bench score. This structure makes MT-Bench more reliable than open-ended pairwise comparison because the questions are fixed and the scoring is consistent.

Model	MT-Bench Score	Category Strength
GPT-4	8.99	Coding, Reasoning
Claude 3 Opus	8.95	Writing, Humanities
GPT-3.5-turbo	7.94	Extraction
Llama 3 70B	8.22	STEM, Math

Chatbot Arena: Crowdsourced LLM Evaluation

Chatbot Arena (LMSYS) takes a different approach: let the crowd decide. Users chat with two anonymous models side by side, then vote for the better one. The models are identified only after voting. The results are aggregated using an Elo rating system (the same system used in chess), producing a ranking that reflects real user preferences.

As of 2024, Chatbot Arena has collected over 1 million votes. Its ranking is considered the most reliable indicator of real-world LLM quality because it reflects genuine user preferences on genuine tasks — not synthetic benchmarks. The catch: it's biased toward conversational ability and can't measure specialized capabilities (medical QA, legal reasoning) without specialized users.

LLM-as-Judge trades reliability for scale. Human evaluation is the gold standard but costs $10-50 per comparison. LLM evaluation costs $0.01-0.10 per comparison and agrees with humans ~80% of the time. For most research decisions, 80% agreement at 100x lower cost is a good trade. For safety-critical deployments, it isn't.

What is the most concerning bias in LLM-as-Judge evaluation?

The judge model is slow The judge requires internet access Self-enhancement bias — models systematically rate their own outputs higher, so a company evaluating its model with its own model gets inflated scores

Chapter 5: Benchmark Explorer

Now that you understand the individual evaluation frameworks, let's put them all together. The dashboard below lets you compare models across multiple benchmarks, switch between visualization modes, and adjust metric weights to see how rankings change.

This is the core insight of evaluation: rankings are not objective facts — they depend on what you measure and how you weight it. A company that cares about safety will rank models differently than one that cares about raw accuracy. A startup optimizing for cost will favor efficient models even if they're slightly less accurate.

Interactive Dashboard

Use the controls below to explore the evaluation landscape. Switch between bar chart (compare one metric across models), radar chart (compare all metrics for selected models), and scatter plot (see accuracy vs. efficiency tradeoffs). Adjust the metric weight sliders to create your own composite ranking.

Model Evaluation Dashboard

Full evaluation dashboard. Switch views, select models, adjust metric weights to see rankings change in real time.

Accuracy50

Safety50

Efficiency50

Calibration50

Play with the weight sliders. Notice how dramatically the ranking changes. When you max out accuracy and zero out everything else, the largest models dominate. When you crank up efficiency, smaller models climb. When you prioritize safety, models from labs with strong RLHF (Anthropic, OpenAI) pull ahead. There is no "best model" — only the best model for your priorities.

Why This Matters in Practice

When a company evaluates which LLM to deploy, they make implicit weighting decisions. A hospital deploying a medical assistant should weight calibration and accuracy far above efficiency — a wrong answer with high confidence could lead to misdiagnosis. A social media platform building a content moderation tool should weight toxicity detection and fairness above raw accuracy — biased moderation is worse than imperfect moderation.

Most LLM leaderboards implicitly weight accuracy at 100% and everything else at 0%. This is why Chatbot Arena rankings, MMLU scores, and HumanEval scores dominate AI discourse — they're easy to understand and easy to rank. But they paint a dangerously incomplete picture.

The scatter plot view (accuracy vs. efficiency) reveals another critical tradeoff: the Pareto frontier. Models on the frontier offer the best accuracy-efficiency tradeoff — you can't improve one without sacrificing the other. Models below the frontier are dominated: another model exists that is both more accurate AND more efficient. In practice, most deployments should target the Pareto frontier, not the accuracy leader.

python
# Computing composite score with custom weights
def rank_models(models, weights):
    """
    weights: dict with keys 'accuracy', 'safety', 'efficiency', 'calibration'
    Each weight 0-100. Normalized internally.
    """
    total = sum(weights.values())
    if total == 0:
        return models  # no preference = arbitrary order

    normed = {k: v / total for k, v in weights.items()}
    scored = []
    for m in models:
        composite = sum(m[k] * normed[k] for k in normed)
        scored.append((composite, m['name']))

    return sorted(scored, reverse=True)

# Hospital use case: accuracy + calibration dominate
rank_models(models, {'accuracy': 80, 'safety': 60,
                      'efficiency': 10, 'calibration': 90})

The ranking depends on the weights. The weights depend on your use case. A medical chatbot should weight accuracy and calibration highest. A creative writing tool should weight fluency and diversity. A cost-sensitive API should weight efficiency. Anyone who claims "Model X is the best LLM" without specifying criteria is selling something.

Chapter 6: Limitations

Every evaluation method we've discussed has a fundamental flaw. Benchmarks can be gamed. LLM judges are biased. Human evaluation doesn't scale. But the deepest problem is one that threatens the entire benchmarking enterprise: data contamination.

The Contamination Problem

Data contamination occurs when a model's training data includes questions (or answers) from the benchmark it's being evaluated on. The model isn't demonstrating knowledge — it's demonstrating memorization. It's like a student who memorized the answer key to last year's exam: they'll score well, but they haven't learned the material.

This is not hypothetical. LLMs are trained on massive web crawls that include blog posts, forum discussions, and academic papers that contain benchmark questions and answers. When MMLU questions appear in a Reddit discussion in the training data, the model has effectively "seen the test."

The simulation below shows how contamination inflates benchmark scores. Drag the contamination rate slider to see the gap between a model's true ability and its measured score.

Contamination Detector

Adjust the contamination rate to see how benchmark scores diverge from true ability. The gap between the two bars is the "contamination illusion."

Contamination Rate 0%

Notice: contamination hurts weak models more in relative terms. A weak model with 40% true accuracy and 30% contamination appears to score 58% — a 45% inflation. A strong model with 85% true accuracy and 30% contamination appears to score 90% — only a 6% inflation. This means contamination can make a mediocre model look competitive with a genuinely strong one.

Detecting Contamination

Researchers have developed several techniques to detect whether a model's training data is contaminated:

Membership inference: Show the model a benchmark question and measure its perplexity (how surprised it is). If perplexity is abnormally low, the model may have seen the question before. Compare against control questions of similar difficulty that definitely aren't in the training data.

Canary strings: Insert unique, nonsensical strings into benchmark questions before publishing. If a model can complete these strings, it has seen the benchmark data.

Temporal splits: Evaluate on data created after the model's training cutoff. If a model trained on data through December 2023 scores well on questions from March 2024, it genuinely knows the material (unless the questions leaked into fine-tuning data).

Rephrasing: Rephrase benchmark questions in novel ways. If a model scores 90% on the original wording but 60% on semantically identical rephrasings, it memorized the surface form rather than understanding the concept.

Goodhart's Law in Action

"When a measure becomes a target, it ceases to be a good measure." This is Goodhart's Law, and it's the philosophical core of the evaluation problem. Once MMLU scores are used to market models, labs have incentives to optimize for MMLU specifically — even if it means overfitting to the benchmark distribution or inadvertently contaminating training data.

Real-world examples:

Failure Mode	Example	Consequence
Teaching to the test	Training on MMLU-style questions	High MMLU, poor real-world performance
Contamination	Benchmark Q&A pairs in web crawl	Inflated scores, undetectable without auditing
Metric hacking	Optimizing BLEU instead of fluency	High BLEU, poor translation quality
Distribution shift	Fine-tuning on benchmark distribution	Scores don't transfer to other distributions

A Worked Example: Detecting Contamination

Here's how a researcher might detect contamination in practice. Suppose you suspect a model has seen MMLU questions during training. You take 100 MMLU questions and rephrase each one — same concept, different wording:

python
# Original MMLU question
original = "Which of the following is NOT a function of the liver?"
options = ["Bile production", "Insulin production",
           "Detoxification", "Glycogen storage"]

# Rephrased version (same concept, different surface form)
rephrased = "Which process below is NOT performed by the liver?"
options_r = ["Producing bile", "Synthesizing insulin",
             "Breaking down toxins", "Storing glycogen"]

# If model scores 95% on originals but 70% on rephrasings,
# the 25-point gap suggests memorization, not understanding.
gap = original_score - rephrased_score
if gap > 15:
    print("Warning: likely contamination detected")

A genuinely knowledgeable model should score approximately the same on both versions. A contaminated model memorized the exact wording and stumbles when the surface form changes. The gap between original and rephrased scores is a contamination signal.

What Benchmarks Can't Measure

Some of the most important qualities of a language model are inherently hard to benchmark:

Common sense: The ability to understand that "the trophy doesn't fit in the suitcase because it's too big" refers to the trophy, not the suitcase. We have benchmarks for this (Winograd Schema), but they're narrow and gameable.

Creativity: Can the model write a genuinely surprising poem? Generate a novel solution to a problem? No benchmark captures this because creativity is defined by unpredictability.

Consistency: Does the model give the same answer to the same question asked differently? Benchmarks test individual questions, not consistency across them.

Long-horizon reasoning: Can the model plan 20 steps ahead? Most benchmarks test single-turn or short-context abilities.

The fundamental tension: benchmarks measure what's measurable, not what matters. The most important qualities of a language model — reliability, common sense, creativity, consistency — are precisely the ones that resist quantification. This is why human evaluation, despite its flaws, remains the final arbiter.

A model scores 92% on MMLU but only 65% on rephrased versions of the same questions. What's the most likely explanation?

The rephrased questions are harder The model memorized the surface form of benchmark questions from contaminated training data, rather than genuinely understanding the underlying concepts The model's tokenizer handles rephrasings poorly

Chapter 7: Connections

Evaluation is the connective tissue of NLP research. Every paper needs it, every product depends on it, and every claim about model quality is only as strong as the evaluation behind it. This lesson covered the full evaluation stack — from narrow task benchmarks to holistic multi-metric frameworks to scalable LLM judges.

Key Papers

Paper	Year	Contribution
MMLU (Hendrycks et al.)	2020	57-subject knowledge benchmark. The first test to measure breadth of world knowledge at scale.
HELM (Liang et al.)	2022	Holistic 7-metric evaluation. Introduced multi-dimensional profiling of LLM capabilities.
GLUE (Wang et al.)	2018	9-task NLU benchmark. Unified evaluation for language understanding.
SuperGLUE (Wang et al.)	2019	Harder successor to GLUE. Added reading comprehension, causal reasoning, and WSD.
MT-Bench (Zheng et al.)	2023	LLM-as-Judge evaluation framework. Showed LLM judges agree with humans ~80% of the time.
Chatbot Arena (LMSYS)	2023	Crowdsourced pairwise evaluation with Elo ratings. 1M+ votes. Most trusted real-world ranking.

Connections to Other Lessons

Lesson	Connection
L08: Post-training	RLHF uses human evaluation to train reward models. The reward model IS an evaluation model — it predicts human preferences.
L10: Agents	Agent evaluation is uniquely hard: must assess tool selection, reasoning quality, and final answer correctness across multi-step trajectories.
L12: ACL Trends	Evaluation methodology is a recurring theme at ACL: what counts as progress, what metrics to report, how to do responsible evaluation.

The Evaluation Maturity Ladder

Level 0: Vibes

"I tried it and it seemed good." No measurement, no comparison. Most blog posts.

↓

Level 1: Single Benchmark

Report MMLU or HumanEval. Easy to game. Most papers.

↓

Level 2: Multi-Benchmark

Report MMLU + GSM8K + HumanEval + TruthfulQA. Harder to game. Good papers.

↓

Level 3: Holistic (HELM)

Accuracy + calibration + fairness + robustness + toxicity. Hard to fake. Great papers.

↓

Level 4: Human + Arena

Real users, blind comparisons, Elo ratings. The gold standard. Frontier labs.

The Frontier

Evaluation research is evolving rapidly. Key directions include:

Dynamic benchmarks: Benchmarks that are continuously updated with new questions to prevent contamination. DynaBench (Kiela et al., 2021) pioneered this approach — humans adversarially create questions that fool the current best model.

Process evaluation: Instead of just checking if the final answer is right, evaluate the reasoning process. Does the model show its work? Are intermediate steps correct? This matters for math, code, and multi-step reasoning.

Capability-specific evals: Targeted benchmarks for specific risks: WMDP (biosecurity), CyberSecEval (cybersecurity), and TruthfulQA (truthfulness). As models get more capable, evaluating specific risks becomes more important than overall benchmarking.

Multi-turn evaluation: Most benchmarks are single-turn (one question, one answer). Real usage is multi-turn (conversations, follow-ups, clarifications). Evaluating multi-turn interactions requires new methodologies.

The Evaluation Decision Tree

When you need to evaluate a model, the right method depends on what you're measuring:

What You're Measuring	Best Method	Cost
Factual knowledge	MMLU (5-shot)	Low (automated)
Reasoning ability	GSM8K, ARC, GPQA	Low (automated)
Code generation	HumanEval, SWE-Bench	Medium (execution needed)
Open-ended quality	LLM-as-Judge / MT-Bench	Medium (API calls)
Real user preference	Chatbot Arena / A/B test	High (human votes)
Safety profile	HELM, red-teaming	High (multi-metric)
Deployment readiness	Custom eval suite + human review	Very high

The key insight: use the cheapest method that answers your question. Don't run a full HELM evaluation when MMLU would suffice. Don't rely on MMLU when you need to know about safety. Match the evaluation to the question.

The evaluation stack mirrors the capability stack: pretraining (L07) is evaluated by perplexity, post-training (L08) by human preference, PEFT (L09) by task performance, and agents (L10) by trajectory success. Each layer needs its own evaluation methodology. Understanding evaluation is understanding what progress looks like.