How do you measure something that can do everything? The science of telling good models from great ones.
You train two language models. Both cost millions of dollars. Both produce fluent English. Both can write poems, answer questions, and summarize documents. Which one is better?
This question seems simple until you try to answer it. "Better" at what? Better at answering trivia? Better at writing code? Better at following instructions without producing harmful content? Better at reasoning about physics? The moment you try to pin down "better," you realize you need a measuring stick — and the measuring stick you choose determines what you see.
Here's the uncomfortable truth: evaluation is the bottleneck of progress in NLP. We can train bigger models, invent cleverer architectures, and collect more data. But if our benchmarks don't capture what matters, we optimize for the wrong thing. Goodhart's Law — "when a measure becomes a target, it ceases to be a good measure" — haunts every leaderboard in AI.
The simulation below shows the core problem. A model is trained on a narrow benchmark — it learns to ace the test. But when you deploy it on real-world tasks outside the benchmark's distribution, it falls apart. The benchmark score told you the model was excellent. The real world told you it wasn't.
Watch a model ace a narrow benchmark but fail on out-of-distribution tasks. Click each evaluation mode to compare.
This is not hypothetical. GPT-3 scored 43.9% on MMLU when it launched in 2020. By 2023, GPT-4 hit 86.4%. Impressive! But users still found GPT-4 making basic mistakes in real conversations — hallucinating facts, misunderstanding context, failing at multi-step reasoning. The benchmark didn't lie, but it didn't tell the whole story.
The gap between "benchmark score" and "useful in practice" is the central problem of evaluation. This lesson teaches you the tools, frameworks, and pitfalls of measuring LLM quality — so you can evaluate models like a scientist, not a marketer.
We'll build up from simple to complex evaluation strategies:
Before LLMs could do everything, NLP models were trained for one task at a time: sentiment analysis, textual entailment, question answering. Each task had its own dataset, its own metric, and its own leaderboard. Evaluating a model meant running it on that specific test set and computing the score.
This era gave us some of the most important benchmarks in NLP history. Understanding them matters because (a) they're still used as components of larger evaluations, and (b) their rise and fall teaches us how benchmarks saturate — and what happens when they do.
GLUE (General Language Understanding Evaluation) was created by Wang et al. in 2018 to provide a single number summarizing a model's language understanding. It bundled 9 tasks:
| Task | Type | What It Tests |
|---|---|---|
| CoLA | Acceptability | Is this sentence grammatical? |
| SST-2 | Sentiment | Is this movie review positive or negative? |
| MRPC | Paraphrase | Do these two sentences mean the same thing? |
| STS-B | Similarity | How similar are these two sentences? (0-5 scale) |
| QQP | Paraphrase | Are these two Quora questions duplicates? |
| MNLI | Entailment | Does the premise entail/contradict/neutral the hypothesis? |
| QNLI | QA/Entailment | Does the sentence answer the question? |
| RTE | Entailment | Does the premise entail the hypothesis? |
| WNLI | Coreference | Does the pronoun refer to the correct entity? |
The GLUE score was the average across all 9 tasks. When BERT was published in October 2018, it smashed the GLUE leaderboard, scoring 80.5 vs. the previous best of 69.0. Within a year, models surpassed human performance (87.1). The benchmark was effectively solved.
SuperGLUE (2019) was the harder sequel. It replaced the easier tasks with genuinely difficult ones: reading comprehension (ReCoRD), causal reasoning (COPA), word sense disambiguation (WiC), and multi-sentence reasoning (MultiRC). Human performance was 89.8. Models hit 90+ within two years.
This pattern — create benchmark, models catch up, benchmark saturates, create harder benchmark — repeats throughout NLP history. The simulation below shows this saturation timeline.
Watch how quickly models catch up to and surpass human performance on each benchmark. The dashed line is human-level. Click a benchmark to see its saturation curve.
SQuAD (Stanford Question Answering Dataset) followed the same arc. Version 1.1 (2016) asked models to find answer spans in paragraphs. Version 2.0 (2018) added unanswerable questions — forcing models to know what they don't know. Both were surpassed by human performance within a couple of years.
Benchmark saturation is not a failure — it's a signal. It means the specific capability that benchmark measures has been largely solved. The failure is continuing to report scores on a saturated benchmark as if they mean something. When every model scores 90+ on GLUE, GLUE can't distinguish between them.
The deeper issue: these benchmarks measure narrow, well-defined tasks. Real language understanding is broad, fuzzy, and context-dependent. A model that can classify sentiment perfectly might still fail at following a complex multi-step instruction. We need something broader.
Each benchmark uses specific metrics. Understanding them is crucial because metric choice changes what you optimize:
| Metric | Formula | Used When |
|---|---|---|
| Accuracy | Correct / Total | Classification (MMLU, GLUE sentiment) |
| F1 | 2 · (P · R) / (P + R) | Span extraction (SQuAD), balances precision/recall |
| Exact Match | Predicted == Gold | QA where partial credit is meaningless |
| BLEU | n-gram overlap with reference | Translation, summarization (being phased out) |
| ROUGE | Recall-based n-gram overlap | Summarization |
| Perplexity | exp(-avg log prob) | Language modeling (lower = better) |
Accuracy is the simplest: how many questions did you get right? For multiple-choice benchmarks like MMLU, this is the standard. F1 balances precision (of your answers, how many are correct?) and recall (of the correct answers, how many did you find?). Exact Match gives zero partial credit — "Barack Obama" is wrong if the gold answer is "Obama."
BLEU and ROUGE measure n-gram overlap between generated text and reference text. BLEU emphasizes precision (are the n-grams in your output actually in the reference?), while ROUGE emphasizes recall (are the n-grams in the reference present in your output?). Both are being replaced by model-based evaluation for open-ended generation.
GLUE and SuperGLUE test whether a model understands language. But do models actually know things? Can they pass a college exam in chemistry? An AP History test? A medical licensing exam? MMLU (Massive Multitask Language Understanding), introduced by Hendrycks et al. in 2020, tests exactly this.
MMLU is a 57-subject multiple-choice exam spanning the entire breadth of human academic knowledge. Each question has 4 options, one correct answer, and comes from one of 57 subjects grouped into 4 broad categories: STEM, Humanities, Social Sciences, and Other (professional subjects like law, medicine, accounting).
Previous benchmarks tested narrow NLP tasks. MMLU tests knowledge — the kind that requires years of study. Consider these real examples:
MMLU examples # Abstract Algebra Q: Find the degree of the extension Q(√2, √3, √18) over Q. (A) 0 (B) 4 (C) 2 (D) 6 Answer: (B) 4 # Clinical Knowledge Q: Which of the following is the body's most abundant electrolyte? (A) Potassium (B) Sodium (C) Calcium (D) Magnesium Answer: (B) Sodium # Moral Scenarios Q: Is it morally permissible to break a promise to attend a friend's party in order to help a stranger in an emergency? (A) Yes (B) No (C) It depends (D) Promises must never be broken Answer: (A) Yes
Notice the range: abstract algebra requires mathematical reasoning, clinical knowledge requires memorized facts, and moral scenarios require ethical reasoning. A model that does well across all 57 subjects demonstrates broad knowledge, not narrow specialization.
The visualization below shows MMLU performance as a heatmap across all 57 subjects. Switch between models to see how their knowledge profiles differ. Some models are strong in STEM but weak in humanities; others show the reverse pattern. Hover over any cell to see an example question from that subject.
Each cell is one subject. Color intensity = accuracy. Switch models to compare knowledge profiles. Hover for example questions.
MMLU uses a few-shot prompting format. The model is shown 5 example questions with answers (the "5-shot" setting), then asked to answer a new question. No fine-tuning, no task-specific training. This tests what the model learned during pretraining alone.
python # MMLU evaluation: 5-shot prompting prompt = """The following are multiple choice questions about abstract algebra. Q: Find all c in Z_3 such that Z_3[x]/(x^2 + c) is a field. (A) 0 (B) 1 (C) 2 (D) 3 A: (B) Q: Statement 1: Every group of order p^2 is abelian. Statement 2: ... (A) True, True (B) False, False ... A: (A) [3 more examples...] Q: [NEW QUESTION] (A) ... (B) ... (C) ... (D) ... A:""" # Extract the model's answer from the first generated token output = model.generate(prompt, max_tokens=1) predicted = output.strip() # should be "(A)", "(B)", "(C)", or "(D)" correct = predicted == gold_answer
The metric is simple: accuracy — percentage of questions answered correctly. Random guessing gives 25% (4 options). The overall MMLU score is the average accuracy across all 57 subjects, weighted equally.
| Model | Year | MMLU Score | Jump |
|---|---|---|---|
| GPT-3 (175B) | 2020 | 43.9% | — |
| Chinchilla (70B) | 2022 | 67.6% | +23.7 |
| GPT-4 | 2023 | 86.4% | +18.8 |
| Claude 3 Opus | 2024 | 86.8% | +0.4 |
| GPT-4o | 2024 | 88.7% | +1.9 |
Notice the pattern: early gains were massive (GPT-3 to Chinchilla: +24 points). Recent gains are tiny (Claude 3 to GPT-4o: +2 points). MMLU is approaching saturation — just like GLUE did. When every frontier model scores 85-90%, the benchmark can barely distinguish them.
This is why the community has moved to harder variants: MMLU-Pro (10 choices instead of 4, harder questions) and GPQA (graduate-level questions that even domain experts find hard).
MMLU tells you whether a model knows facts. But is the model fair? Is it calibrated (when it says 80% sure, is it right 80% of the time)? Is it robust to rephrased questions? Does it generate toxic content?
HELM (Holistic Evaluation of Language Models), introduced by Liang et al. at Stanford in 2022, argues that accuracy alone is insufficient. A model that scores 90% on MMLU but produces racist outputs 5% of the time is not a "good" model. HELM evaluates models across seven dimensions simultaneously, giving you a complete picture rather than a single number.
HELM measures every model on every scenario across these seven axes:
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Accuracy | Correct answers | The baseline: does the model know the right answer? |
| Calibration | Confidence matches correctness | A model that says "90% sure" should be right 90% of the time |
| Robustness | Stable under perturbation | Rephrasing a question shouldn't change the answer |
| Fairness | Equal performance across groups | The model shouldn't be worse at answering questions about minorities |
| Bias | Stereotypical associations | The model shouldn't associate certain jobs with certain genders/races |
| Toxicity | Harmful content generation | The model should refuse to generate slurs, threats, or hate speech |
| Efficiency | Compute cost per query | A 10% accuracy gain isn't worth 100x more compute for most applications |
HELM's key innovation is presenting results as a multi-dimensional profile rather than a single score. The radar chart below shows how different models trade off between these metrics. A model that excels in accuracy may be poorly calibrated. A model with low toxicity may sacrifice some accuracy.
Compare models across all 7 HELM metrics. Each axis goes from 0 (center) to 100 (edge). The ideal model fills the entire polygon.
The radar chart reveals something that single-number benchmarks hide: no model dominates on every axis. GPT-4 may lead in accuracy but trail in efficiency. An open-source model may have higher toxicity but better calibration. This is why HELM refuses to compute a single "HELM score" — it would destroy the very information the benchmark is designed to reveal.
Calibration deserves special attention because it's almost never reported on leaderboards, yet it's critical for real applications. A calibrated model is one whose confidence scores match reality. If it assigns 70% probability to answer A, then A should be correct about 70% of the time across many such predictions.
Why does this matter? Because downstream systems rely on model confidence to make decisions. A medical assistant that says "I'm 95% sure this is benign" had better be right 95% of the time. An uncalibrated model that says "95% sure" when it's actually right 60% of the time is dangerous.
HELM measures calibration using Expected Calibration Error (ECE):
Divide predictions into B bins by confidence. For each bin, compute the gap between average confidence and actual accuracy. ECE is the weighted average of these gaps. Perfect calibration gives ECE = 0.
HELM tests models on 42 scenarios — combinations of datasets and tasks. Examples include: question answering (NaturalQuestions, TriviaQA), summarization (XSUM, CNN/DailyMail), sentiment (IMDB), toxicity detection, and code generation. Each scenario is evaluated on all 7 metrics.
The total evaluation matrix is enormous: 42 scenarios × 7 metrics × N models. This is HELM's strength and its weakness. Strength: it provides the most comprehensive picture of model capabilities ever assembled. Weakness: it's expensive to run (each evaluation costs significant compute) and hard to summarize (who reads a 42 × 7 table?).
Let's trace how a single HELM evaluation works end to end. You choose a model, a scenario (e.g., NaturalQuestions), and HELM generates prompts, collects outputs, and scores them on all 7 metrics.
python # Simplified HELM evaluation pipeline for scenario in scenarios: prompts = scenario.generate_prompts(num_examples=1000) for prompt in prompts: # Get model output output = model.generate(prompt, max_tokens=256) # Score on ALL 7 metrics simultaneously results['accuracy'] += exact_match(output, prompt.gold) results['calibration'] += ece_score(output.logprobs, prompt.gold) results['robustness'] += perturb_and_compare(model, prompt) results['fairness'] += group_disparity(output, prompt.demographics) results['bias'] += stereotype_score(output) results['toxicity'] += perspective_api(output.text) results['efficiency'] += measure_cost(prompt, output) # Average across all prompts for this scenario for metric in results: results[metric] /= len(prompts)
Notice: robustness is measured by perturbing the input (adding typos, rephrasing, changing names) and checking if the answer changes. Fairness is measured by checking if accuracy differs across demographic groups (e.g., questions about different racial groups). Toxicity uses Google's Perspective API to score generated text for harmful content.
The cost of a full HELM evaluation is substantial: running one model across all 42 scenarios requires tens of thousands of API calls. This is why HELM is typically run by well-funded research labs, not individual practitioners.
Multiple-choice benchmarks have a fundamental limitation: they can only test tasks with clearly defined correct answers. But most of what we want LLMs to do — write essays, explain concepts, have conversations, give advice — has no single correct answer. How do you evaluate open-ended generation?
The gold standard is human evaluation: pay human annotators to rate model outputs on criteria like helpfulness, accuracy, and harmlessness. This is what Anthropic, OpenAI, and Google do internally. But human evaluation is slow ($10-50 per comparison), noisy (annotators disagree), and doesn't scale (you can't re-run it every time you change a hyperparameter).
LLM-as-Judge is the pragmatic alternative: use a powerful language model (the "judge") to evaluate the outputs of other models. Instead of paying a human to read two answers and pick the better one, you prompt GPT-4 to do it.
The basic setup is a pairwise comparison. Given a question and two model responses (A and B), the judge is prompted to evaluate which response is better and explain why:
python # LLM-as-Judge: pairwise comparison judge_prompt = """[System] You are a helpful assistant evaluating responses. [Question] {question} [Response A] {response_a} [Response B] {response_b} Please evaluate which response is better in terms of helpfulness, accuracy, and clarity. Explain your reasoning, then state your verdict as: [[A]], [[B]], or [[Tie]]. """ verdict = judge_model.generate(judge_prompt) # Parse verdict to extract [[A]], [[B]], or [[Tie]]
The judge's prompt is carefully designed. Zheng et al. (2023) in the MT-Bench/Chatbot Arena paper showed that prompt design matters enormously: asking for a reasoning chain before the verdict improves agreement with human evaluators from ~65% to ~80%.
The simulation below lets you compare human judgments with LLM judge verdicts across different question types. Notice where they agree (factual questions) and where they diverge (subjective or nuanced questions).
Click a question type to see a sample comparison. Green = agreement between human and LLM judges. Red = disagreement.
LLM judges are not neutral. Research has identified several systematic biases:
Position bias: LLM judges tend to prefer whichever response is presented first (position A). Mitigation: run each comparison twice with A/B swapped and require consistent verdicts.
Verbosity bias: Longer responses are rated higher, even when the extra length adds nothing. A 500-word response that rambles is often preferred over a concise 100-word answer that nails the point.
Self-enhancement bias: Models rate their own outputs higher than other models' outputs. GPT-4 as judge tends to prefer GPT-4 responses. Claude as judge tends to prefer Claude responses. This is a serious problem when a company evaluates its own model with its own model.
Style over substance: LLM judges are swayed by formatting (bullet points, headers, bold text) and confident tone, even when the content is wrong. A beautifully formatted wrong answer often beats a plainly written correct one.
| Bias | Effect | Mitigation |
|---|---|---|
| Position bias | Prefers Response A | Swap A/B and require consistency |
| Verbosity bias | Prefers longer answers | Normalize for length, or instruct judge to penalize padding |
| Self-enhancement | Prefers own outputs | Use a different model family as judge |
| Style bias | Prefers formatting over accuracy | Separate accuracy and presentation scores |
MT-Bench (Zheng et al., 2023) is a curated set of 80 multi-turn questions across 8 categories: writing, roleplay, extraction, reasoning, math, coding, STEM, and humanities. Each question has two turns — a follow-up that tests the model's ability to maintain context and build on its previous answer.
The judge evaluates each response on a 1-10 scale. The scores are averaged across all 80 questions to produce a single MT-Bench score. This structure makes MT-Bench more reliable than open-ended pairwise comparison because the questions are fixed and the scoring is consistent.
| Model | MT-Bench Score | Category Strength |
|---|---|---|
| GPT-4 | 8.99 | Coding, Reasoning |
| Claude 3 Opus | 8.95 | Writing, Humanities |
| GPT-3.5-turbo | 7.94 | Extraction |
| Llama 3 70B | 8.22 | STEM, Math |
Chatbot Arena (LMSYS) takes a different approach: let the crowd decide. Users chat with two anonymous models side by side, then vote for the better one. The models are identified only after voting. The results are aggregated using an Elo rating system (the same system used in chess), producing a ranking that reflects real user preferences.
As of 2024, Chatbot Arena has collected over 1 million votes. Its ranking is considered the most reliable indicator of real-world LLM quality because it reflects genuine user preferences on genuine tasks — not synthetic benchmarks. The catch: it's biased toward conversational ability and can't measure specialized capabilities (medical QA, legal reasoning) without specialized users.
Now that you understand the individual evaluation frameworks, let's put them all together. The dashboard below lets you compare models across multiple benchmarks, switch between visualization modes, and adjust metric weights to see how rankings change.
This is the core insight of evaluation: rankings are not objective facts — they depend on what you measure and how you weight it. A company that cares about safety will rank models differently than one that cares about raw accuracy. A startup optimizing for cost will favor efficient models even if they're slightly less accurate.
Use the controls below to explore the evaluation landscape. Switch between bar chart (compare one metric across models), radar chart (compare all metrics for selected models), and scatter plot (see accuracy vs. efficiency tradeoffs). Adjust the metric weight sliders to create your own composite ranking.
Full evaluation dashboard. Switch views, select models, adjust metric weights to see rankings change in real time.
Play with the weight sliders. Notice how dramatically the ranking changes. When you max out accuracy and zero out everything else, the largest models dominate. When you crank up efficiency, smaller models climb. When you prioritize safety, models from labs with strong RLHF (Anthropic, OpenAI) pull ahead. There is no "best model" — only the best model for your priorities.
When a company evaluates which LLM to deploy, they make implicit weighting decisions. A hospital deploying a medical assistant should weight calibration and accuracy far above efficiency — a wrong answer with high confidence could lead to misdiagnosis. A social media platform building a content moderation tool should weight toxicity detection and fairness above raw accuracy — biased moderation is worse than imperfect moderation.
Most LLM leaderboards implicitly weight accuracy at 100% and everything else at 0%. This is why Chatbot Arena rankings, MMLU scores, and HumanEval scores dominate AI discourse — they're easy to understand and easy to rank. But they paint a dangerously incomplete picture.
The scatter plot view (accuracy vs. efficiency) reveals another critical tradeoff: the Pareto frontier. Models on the frontier offer the best accuracy-efficiency tradeoff — you can't improve one without sacrificing the other. Models below the frontier are dominated: another model exists that is both more accurate AND more efficient. In practice, most deployments should target the Pareto frontier, not the accuracy leader.
python # Computing composite score with custom weights def rank_models(models, weights): """ weights: dict with keys 'accuracy', 'safety', 'efficiency', 'calibration' Each weight 0-100. Normalized internally. """ total = sum(weights.values()) if total == 0: return models # no preference = arbitrary order normed = {k: v / total for k, v in weights.items()} scored = [] for m in models: composite = sum(m[k] * normed[k] for k in normed) scored.append((composite, m['name'])) return sorted(scored, reverse=True) # Hospital use case: accuracy + calibration dominate rank_models(models, {'accuracy': 80, 'safety': 60, 'efficiency': 10, 'calibration': 90})
Every evaluation method we've discussed has a fundamental flaw. Benchmarks can be gamed. LLM judges are biased. Human evaluation doesn't scale. But the deepest problem is one that threatens the entire benchmarking enterprise: data contamination.
Data contamination occurs when a model's training data includes questions (or answers) from the benchmark it's being evaluated on. The model isn't demonstrating knowledge — it's demonstrating memorization. It's like a student who memorized the answer key to last year's exam: they'll score well, but they haven't learned the material.
This is not hypothetical. LLMs are trained on massive web crawls that include blog posts, forum discussions, and academic papers that contain benchmark questions and answers. When MMLU questions appear in a Reddit discussion in the training data, the model has effectively "seen the test."
The simulation below shows how contamination inflates benchmark scores. Drag the contamination rate slider to see the gap between a model's true ability and its measured score.
Adjust the contamination rate to see how benchmark scores diverge from true ability. The gap between the two bars is the "contamination illusion."
Notice: contamination hurts weak models more in relative terms. A weak model with 40% true accuracy and 30% contamination appears to score 58% — a 45% inflation. A strong model with 85% true accuracy and 30% contamination appears to score 90% — only a 6% inflation. This means contamination can make a mediocre model look competitive with a genuinely strong one.
Researchers have developed several techniques to detect whether a model's training data is contaminated:
Membership inference: Show the model a benchmark question and measure its perplexity (how surprised it is). If perplexity is abnormally low, the model may have seen the question before. Compare against control questions of similar difficulty that definitely aren't in the training data.
Canary strings: Insert unique, nonsensical strings into benchmark questions before publishing. If a model can complete these strings, it has seen the benchmark data.
Temporal splits: Evaluate on data created after the model's training cutoff. If a model trained on data through December 2023 scores well on questions from March 2024, it genuinely knows the material (unless the questions leaked into fine-tuning data).
Rephrasing: Rephrase benchmark questions in novel ways. If a model scores 90% on the original wording but 60% on semantically identical rephrasings, it memorized the surface form rather than understanding the concept.
"When a measure becomes a target, it ceases to be a good measure." This is Goodhart's Law, and it's the philosophical core of the evaluation problem. Once MMLU scores are used to market models, labs have incentives to optimize for MMLU specifically — even if it means overfitting to the benchmark distribution or inadvertently contaminating training data.
Real-world examples:
| Failure Mode | Example | Consequence |
|---|---|---|
| Teaching to the test | Training on MMLU-style questions | High MMLU, poor real-world performance |
| Contamination | Benchmark Q&A pairs in web crawl | Inflated scores, undetectable without auditing |
| Metric hacking | Optimizing BLEU instead of fluency | High BLEU, poor translation quality |
| Distribution shift | Fine-tuning on benchmark distribution | Scores don't transfer to other distributions |
Here's how a researcher might detect contamination in practice. Suppose you suspect a model has seen MMLU questions during training. You take 100 MMLU questions and rephrase each one — same concept, different wording:
python # Original MMLU question original = "Which of the following is NOT a function of the liver?" options = ["Bile production", "Insulin production", "Detoxification", "Glycogen storage"] # Rephrased version (same concept, different surface form) rephrased = "Which process below is NOT performed by the liver?" options_r = ["Producing bile", "Synthesizing insulin", "Breaking down toxins", "Storing glycogen"] # If model scores 95% on originals but 70% on rephrasings, # the 25-point gap suggests memorization, not understanding. gap = original_score - rephrased_score if gap > 15: print("Warning: likely contamination detected")
A genuinely knowledgeable model should score approximately the same on both versions. A contaminated model memorized the exact wording and stumbles when the surface form changes. The gap between original and rephrased scores is a contamination signal.
Some of the most important qualities of a language model are inherently hard to benchmark:
Common sense: The ability to understand that "the trophy doesn't fit in the suitcase because it's too big" refers to the trophy, not the suitcase. We have benchmarks for this (Winograd Schema), but they're narrow and gameable.
Creativity: Can the model write a genuinely surprising poem? Generate a novel solution to a problem? No benchmark captures this because creativity is defined by unpredictability.
Consistency: Does the model give the same answer to the same question asked differently? Benchmarks test individual questions, not consistency across them.
Long-horizon reasoning: Can the model plan 20 steps ahead? Most benchmarks test single-turn or short-context abilities.
Evaluation is the connective tissue of NLP research. Every paper needs it, every product depends on it, and every claim about model quality is only as strong as the evaluation behind it. This lesson covered the full evaluation stack — from narrow task benchmarks to holistic multi-metric frameworks to scalable LLM judges.
| Paper | Year | Contribution |
|---|---|---|
| MMLU (Hendrycks et al.) | 2020 | 57-subject knowledge benchmark. The first test to measure breadth of world knowledge at scale. |
| HELM (Liang et al.) | 2022 | Holistic 7-metric evaluation. Introduced multi-dimensional profiling of LLM capabilities. |
| GLUE (Wang et al.) | 2018 | 9-task NLU benchmark. Unified evaluation for language understanding. |
| SuperGLUE (Wang et al.) | 2019 | Harder successor to GLUE. Added reading comprehension, causal reasoning, and WSD. |
| MT-Bench (Zheng et al.) | 2023 | LLM-as-Judge evaluation framework. Showed LLM judges agree with humans ~80% of the time. |
| Chatbot Arena (LMSYS) | 2023 | Crowdsourced pairwise evaluation with Elo ratings. 1M+ votes. Most trusted real-world ranking. |
| Lesson | Connection |
|---|---|
| L08: Post-training | RLHF uses human evaluation to train reward models. The reward model IS an evaluation model — it predicts human preferences. |
| L10: Agents | Agent evaluation is uniquely hard: must assess tool selection, reasoning quality, and final answer correctness across multi-step trajectories. |
| L12: ACL Trends | Evaluation methodology is a recurring theme at ACL: what counts as progress, what metrics to report, how to do responsible evaluation. |
Evaluation research is evolving rapidly. Key directions include:
Dynamic benchmarks: Benchmarks that are continuously updated with new questions to prevent contamination. DynaBench (Kiela et al., 2021) pioneered this approach — humans adversarially create questions that fool the current best model.
Process evaluation: Instead of just checking if the final answer is right, evaluate the reasoning process. Does the model show its work? Are intermediate steps correct? This matters for math, code, and multi-step reasoning.
Capability-specific evals: Targeted benchmarks for specific risks: WMDP (biosecurity), CyberSecEval (cybersecurity), and TruthfulQA (truthfulness). As models get more capable, evaluating specific risks becomes more important than overall benchmarking.
Multi-turn evaluation: Most benchmarks are single-turn (one question, one answer). Real usage is multi-turn (conversations, follow-ups, clarifications). Evaluating multi-turn interactions requires new methodologies.
When you need to evaluate a model, the right method depends on what you're measuring:
| What You're Measuring | Best Method | Cost |
|---|---|---|
| Factual knowledge | MMLU (5-shot) | Low (automated) |
| Reasoning ability | GSM8K, ARC, GPQA | Low (automated) |
| Code generation | HumanEval, SWE-Bench | Medium (execution needed) |
| Open-ended quality | LLM-as-Judge / MT-Bench | Medium (API calls) |
| Real user preference | Chatbot Arena / A/B test | High (human votes) |
| Safety profile | HELM, red-teaming | High (multi-metric) |
| Deployment readiness | Custom eval suite + human review | Very high |
The key insight: use the cheapest method that answers your question. Don't run a full HELM evaluation when MMLU would suffice. Don't rely on MMLU when you need to know about safety. Match the evaluation to the question.