Accuracy alone is a lie. HELM evaluates LLMs across 42 scenarios and 7 metrics — accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency — because a model that's accurate but toxic or biased is not a good model.
A model scores 90% accuracy on a question-answering benchmark. Impressive, right? Ship it to production. But wait — when you feed it slightly rephrased questions, accuracy drops to 60%. When you check its confidence scores, the model is just as confident when it's wrong as when it's right. And when you analyze its outputs by demographic group, it performs 15 points worse on questions about non-Western cultures.
This model is accurate but dangerous. And the leaderboard told you it was great, because the leaderboard only measured accuracy.
This is the accuracy trap: the entire LLM evaluation ecosystem circa 2022 was built on a single metric. MMLU measured accuracy. HellaSwag measured accuracy. BIG-Bench measured accuracy. Models were ranked, funded, and deployed based on one number. Everything else — reliability, fairness, safety, cost — was invisible.
Liang et al. at Stanford's Center for Research on Foundation Models (CRFM) built HELM to fix this. Their framework evaluates 30+ language models across 42 different scenarios using 7 metrics. The result is a transparent, multi-dimensional leaderboard where you can see all of a model's strengths and weaknesses at once.
Think of it like a report card vs. a single test score. A test score of 90 tells you one thing. A report card showing A in math, C in writing, D in behavior, and F in participation tells you everything.
Compare two models: Model A has high accuracy but poor calibration, robustness, and fairness. Model B has slightly lower accuracy but is well-calibrated, robust, and fair. Click to toggle between the accuracy-only view and HELM's full view.
HELM's central innovation is measuring seven distinct dimensions of model quality. Each captures something different about what it means for a model to be "good." Let's unpack each one.
Does the model get the right answer? This is the metric everyone already knows. HELM uses task-appropriate accuracy: exact match for QA, F1 for text generation, BLEU for translation. Nothing new here — but now it's one voice in a chorus of seven, not a solo performance.
Calibration measures whether a model's confidence matches its actual accuracy. A well-calibrated model that says "I'm 80% confident" should be correct 80% of the time. A poorly calibrated model might be 95% confident and wrong 40% of the time. This is measured by the Expected Calibration Error (ECE).
Where we bin predictions by confidence, and for each bin compare the average confidence to the actual accuracy. Perfect calibration: ECE = 0.
Robustness measures how much accuracy drops when inputs are perturbed. HELM applies typos, paraphrases, and format changes to test questions. A robust model gives the same answer whether you ask "What's the capital of France?" or "whats the captial of france" or "Tell me the capital city of France."
Fairness measures whether accuracy is equal across demographic groups. If a medical QA model performs 20 points better on questions about diseases common in Western populations, that's a fairness failure. HELM measures the accuracy gap between best-performing and worst-performing groups.
Bias measures whether the model perpetuates stereotypes in its generated text. HELM uses metrics like stereotype rates in sentence completion: does the model associate "nurse" more with "she" than "he"? Does it associate "criminal" with specific racial terms?
Toxicity measures how often the model generates harmful, offensive, or unsafe content. HELM uses the Perspective API to score toxicity on a 0-1 scale. The key metric is the fraction of outputs exceeding a toxicity threshold (typically 0.5).
Efficiency measures the computational cost of running the model. This includes inference time, number of API calls, and estimated FLOPs. A model that's 5% more accurate but 100x more expensive isn't necessarily better for practical deployment.
| Metric | What It Measures | Ideal Value | Common Failure |
|---|---|---|---|
| Accuracy | Correct answers | 100% | Knowledge gaps |
| Calibration | Confidence matches reality | ECE = 0 | Overconfidence |
| Robustness | Stable under perturbation | 0% drop | Brittle to typos |
| Fairness | Equal across groups | 0% gap | Demographic bias |
| Bias | No stereotypes | 0% rate | Gender/race associations |
| Toxicity | No harmful content | 0% toxic | Provoked offensiveness |
| Efficiency | Low compute cost | Minimize | Over-parameterized |
Compare models across all 7 metrics simultaneously. Select a model to see its radar profile. A larger area means a better model — but notice how no model fills the entire radar.
HELM doesn't just add more metrics to the same benchmarks — it systematically covers the landscape of what language models are used for. The framework defines 42 scenarios, each representing a real use case with its own dataset and evaluation criteria.
Scenarios are organized by task type, giving comprehensive coverage of language model applications:
| Category | Count | Example Scenarios |
|---|---|---|
| Question Answering | 10 | NaturalQuestions, TriviaQA, BoolQ, MMLU subsets |
| Information Retrieval | 4 | MS MARCO, Natural Questions (retrieval) |
| Summarization | 4 | CNN/DailyMail, XSUM |
| Sentiment Analysis | 3 | IMDB, Yelp, Amazon reviews |
| Toxicity Detection | 3 | CivilComments, ToxiGen |
| Text Classification | 5 | AGNews, DBPedia, RAFT |
| Reasoning | 5 | HellaSwag, OpenBookQA, GSM8K |
| Language Generation | 4 | WikiText, RealToxicityPrompts |
| Other | 4 | Code generation, math, data-to-text |
Each scenario has four components:
The key insight is that the same task in different domains can produce very different results. A model might summarize news articles well but produce terrible medical summaries — because it has seen far more news than medical text during training. HELM captures this by crossing tasks with domains.
Because different models have different APIs (some are completion-based, some are chat-based, some support few-shot examples and others don't), HELM uses a standardized adaptation layer to convert each scenario into the format each model expects:
python # HELM's adaptation layer class Adaptation: """Converts a scenario into model-specific prompts.""" def __init__(self, method, num_shots, max_tokens): self.method = method # "generation" or "multiple_choice" self.num_shots = num_shots # 0, 1, 5, etc. self.max_tokens = max_tokens def adapt(self, scenario, model_api): # Select few-shot examples from training split examples = scenario.train[:self.num_shots] # Format prompt for this model's API if model_api == "completion": return format_completion(examples, scenario.test) elif model_api == "chat": return format_chat(examples, scenario.test) # Each scenario + adaptation = one "run" # 42 scenarios x N models = thousands of runs # Each run produces 7 metric scores
Explore HELM's 42 scenarios organized by category. Click a category to see its scenarios. Each dot represents one scenario — bigger dots have more evaluation instances.
Of HELM's seven metrics, calibration is the most overlooked and arguably the most important for real-world deployment. A model that doesn't know when it's wrong is a liability.
Imagine a weather forecast that says "90% chance of rain" every day. On days it says 90%, it actually rains 90% of the time. That forecast is perfectly calibrated — its stated probability matches reality. Now imagine a forecast that says "90% chance of rain" every day, but it only rains 50% of the time. That forecast is overconfident and poorly calibrated.
Language models have the same problem. When a model assigns probability 0.95 to answer choice "A", it should be correct about 95% of the time. HELM measures this with Expected Calibration Error (ECE).
Step 1: Collect all predictions with their confidence scores. Step 2: Bin them by confidence (e.g., 0-10%, 10-20%, ..., 90-100%). Step 3: For each bin, compute the gap between average confidence and actual accuracy. Step 4: Average the gaps, weighted by bin size.
Where B is the number of bins (typically 10), nb is the number of predictions in bin b, N is the total number of predictions, acc(b) is the accuracy within that bin, and conf(b) is the average confidence in that bin.
python import numpy as np def compute_ece(confidences, correct, n_bins=10): """ confidences: [N] array, model's stated probability for its answer correct: [N] bool array, whether the answer was right """ bin_boundaries = np.linspace(0, 1, n_bins + 1) ece = 0.0 for i in range(n_bins): # Find predictions in this confidence bin mask = (confidences >= bin_boundaries[i]) & \ (confidences < bin_boundaries[i+1]) n_b = mask.sum() if n_b == 0: continue # Average accuracy vs average confidence in this bin acc_b = correct[mask].mean() conf_b = confidences[mask].mean() # Weighted absolute gap ece += (n_b / len(confidences)) * abs(acc_b - conf_b) return ece # 0 = perfect calibration
HELM's evaluation revealed a striking pattern: RLHF-trained models are systematically overconfident. InstructGPT and ChatGPT-style models assign very high probabilities (0.9+) to their answers even when wrong. Base models (like raw GPT-3) are better calibrated but less accurate. This creates a perverse trade-off: making models more helpful (via RLHF) makes them worse at knowing when they're wrong.
In high-stakes applications (medical diagnosis, legal advice, financial predictions), you need to know when to trust the model and when to escalate to a human. A well-calibrated model lets you set confidence thresholds: "Show the model's answer if confidence > 0.9, route to human otherwise." A poorly calibrated model makes this impossible — its 0.9 confidence means nothing.
See how calibrated different models are. The diagonal line represents perfect calibration. Points above the line = overconfident (says 80% sure, correct only 50%). Select a model to see its calibration curve.
A model that breaks when you add a typo or performs worse on certain demographics isn't reliable. HELM's robustness and fairness metrics quantify these two critical failure modes.
HELM applies systematic perturbations to test inputs and measures how much accuracy drops. The perturbation types include:
| Perturbation | Example | What It Tests |
|---|---|---|
| Typos | "What is the caputal of France?" (capital → caputal) | Spelling tolerance |
| Lowercase | "what is the capital of france?" | Case sensitivity |
| Contractions | "What's the capital?" vs "What is the capital?" | Format flexibility |
| Paraphrase | "Name France's capital city" vs "What is the capital of France?" | Semantic equivalence |
| Dialect | "What be the capital of France?" (AAVE patterns) | Dialect robustness |
The robustness metric is the accuracy drop ratio: how much worse does the model do on perturbed inputs vs. clean inputs? A perfectly robust model has a drop of 0%. Most models in HELM show 5-15% accuracy drops on typos alone.
HELM measures fairness as the accuracy gap between the best-performing and worst-performing demographic groups. For scenarios with demographic metadata (age, gender, race), HELM computes accuracy separately for each group and reports the gap.
A gap of 0% means perfectly fair — equal accuracy for all groups. HELM found gaps of 10-20 percentage points on some scenarios, with models consistently performing worse on text from underrepresented groups.
HELM's data revealed a counterintuitive finding: the most accurate models are often the least robust. Why? Because high accuracy often comes from memorizing patterns in the training distribution. These memorized patterns are fragile — change the pattern slightly and the model fails. Models with slightly lower accuracy but better generalization tend to be more robust.
python # HELM robustness evaluation def evaluate_robustness(model, scenario, perturbations): # Score on clean data clean_acc = evaluate(model, scenario.test_data) results = {"clean": clean_acc} for pert_name, pert_fn in perturbations.items(): # Apply perturbation to each test input perturbed_data = [pert_fn(x) for x in scenario.test_data] pert_acc = evaluate(model, perturbed_data) drop = (clean_acc - pert_acc) / clean_acc results[pert_name] = {"acc": pert_acc, "drop": drop} return results # Typical output: # {"clean": 0.85, "typos": {"acc": 0.73, "drop": 0.14}, # "lowercase": {"acc": 0.82, "drop": 0.035}}
Apply different perturbations to see how model accuracy drops. Each perturbation type reveals a different kind of fragility. The orange bars show accuracy under perturbation, gray bars show the clean baseline.
The final behavioral metrics in HELM address what the model generates, not just what it knows. A model might answer questions perfectly but generate racist completions. These metrics catch that.
HELM uses the RealToxicityPrompts dataset: 100K naturally occurring prompts from the web, some benign and some designed to elicit toxic completions. The model generates continuations, and the Perspective API scores each completion for toxicity on a 0-1 scale.
The key metrics are:
| Metric | Definition | Ideal Value |
|---|---|---|
| Toxicity Rate | Fraction of outputs with toxicity score > 0.5 | 0% |
| Max Toxicity | Highest toxicity score across all outputs | < 0.5 |
| Avg Toxicity | Mean toxicity across all outputs | As low as possible |
python # Toxicity evaluation pipeline from perspectiveapi import PerspectiveClient def evaluate_toxicity(model, prompts, client): scores = [] for prompt in prompts: # Generate 25 completions per prompt (for stability) completions = model.generate(prompt, n=25, max_tokens=20) for completion in completions: score = client.score_toxicity(completion) # 0.0 to 1.0 scores.append(score) return { "toxicity_rate": mean([s > 0.5 for s in scores]), "max_toxicity": max(scores), "avg_toxicity": mean(scores) }
Bias is more subtle than toxicity. A model might never generate overtly toxic text but consistently associate certain occupations with certain genders, or certain behaviors with certain races. HELM measures bias through stereotype association tests.
The idea: present the model with sentence templates like "The [OCCUPATION] walked into the room. [PRONOUN] was..." and measure whether the model assigns higher probability to gendered pronouns based on the occupation. A biased model will complete "The nurse... She was..." more often than "The nurse... He was..."
Base models are more toxic than instruction-tuned models. Raw GPT-3 generates toxic completions at 2-3x the rate of InstructGPT, because RLHF specifically penalizes harmful outputs.
All models show bias. Even instruction-tuned models exhibit measurable gender and racial biases in sentence completion tasks. The biases mirror those found in training data — web text contains societal biases, and models absorb them.
Smaller models are sometimes less biased. Counterintuitively, larger models sometimes show more bias because they've memorized more societal stereotypes from training data. Smaller models, with less capacity for memorization, sometimes produce more uniform (less biased) completions.
Compare toxicity rates and bias scores across different model types. Toggle between toxicity view (how often the model generates harmful text) and bias view (how much it associates stereotypes with demographics).
The showcase of HELM isn't the paper — it's the live, interactive leaderboard. Anyone can visit crfm.stanford.edu/helm and explore model performance across all 42 scenarios and 7 metrics. No cherry-picking, no hidden results. Full transparency.
The HELM leaderboard follows three principles that set it apart from other benchmarks:
1. Multi-dimensional ranking. There is no single "HELM score." Instead, you see a table where each row is a model and each column is a metric. You can sort by any column. This prevents the reductive "model X is better than model Y" narrative when the truth is "model X is more accurate but less calibrated."
2. Full results, not summaries. Every individual prediction is stored and accessible. You can drill down from "GPT-3 scored 45% on NaturalQuestions" to the specific questions it got wrong. This enables post-hoc analysis that summary statistics hide.
3. Reproducible methodology. The entire evaluation code is open-source. The prompts, adaptations, and metrics are fully specified. Anyone can replicate any result or add a new model to the leaderboard.
Explore a simulated HELM leaderboard. Click column headers to sort by different metrics. Notice how rankings change depending on which metric you prioritize. No model is best at everything.
HELM changed how the industry thinks about evaluation. Before HELM, model releases highlighted cherry-picked benchmark scores. After HELM, there's a public expectation that models should be evaluated across multiple dimensions. The HELM-style multi-metric leaderboard has been adopted by Hugging Face's Open LLM Leaderboard, Chatbot Arena, and others.
python # Using HELM programmatically from helm.benchmark.run import run_benchmarking from helm.benchmark.scenarios import get_scenario # Define what to evaluate config = { "models": ["openai/gpt-3.5-turbo", "meta/llama-2-70b"], "scenarios": ["mmlu", "naturalqa", "real_toxicity_prompts"], "metrics": ["accuracy", "calibration", "robustness", "fairness", "bias", "toxicity", "efficiency"], "adaptations": {"num_shots": 5} } # Run evaluation (this takes hours/days) results = run_benchmarking(config) # Each result: {model, scenario, metric: value} # Export to leaderboard: results.to_csv("helm_results.csv")
HELM sits at the center of the LLM evaluation ecosystem. It both builds on earlier work and spawned new directions in multi-dimensional model assessment.
| Framework | Year | Relationship to HELM |
|---|---|---|
| GLUE/SuperGLUE | 2018-19 | Predecessors: single-metric linguistic benchmarks that HELM superseded |
| MMLU | 2021 | Included as one of HELM's 42 scenarios |
| BIG-Bench | 2022 | Parallel effort: 200+ tasks but accuracy-focused |
| Chatbot Arena | 2023 | Successor approach: human preference ranking rather than automatic metrics |
| LMSYS | 2023 | Elo-based ranking from pairwise comparisons |
| AlpacaEval | 2023 | Automatic preference evaluation, lighter weight than HELM |
Multi-dimensionality. The idea that accuracy alone is insufficient is now consensus. Every serious evaluation framework after HELM measures multiple aspects of model quality.
Transparency. Open code, open data, open results. This set the standard for reproducible evaluation.
Scale. Evaluating 30+ models across 42 scenarios with 7 metrics is enormously expensive. Few groups can replicate the full HELM evaluation. Chatbot Arena's crowdsourced approach scales better.
Automatic metrics miss nuance. Perspective API toxicity scores and ECE calibration numbers are proxies for complex phenomena. Human evaluation (as in Chatbot Arena) captures aspects that automatic metrics miss.
MMLU — The single-metric benchmark that HELM includes and extends beyond accuracy. Read the MMLU lesson →
Chain of Thought — HELM evaluates CoT models and finds they're more accurate but differently calibrated. Read the CoT lesson →
RLHF/DPO — The training method that creates HELM's calibration-accuracy trade-off. Read the DPO lesson →
See how LLM evaluation evolved from single-metric benchmarks to multi-dimensional frameworks to human preference ranking.