HELM (Liang 2022)

Chapter 0: The Accuracy Trap

A model scores 90% accuracy on a question-answering benchmark. Impressive, right? Ship it to production. But wait — when you feed it slightly rephrased questions, accuracy drops to 60%. When you check its confidence scores, the model is just as confident when it's wrong as when it's right. And when you analyze its outputs by demographic group, it performs 15 points worse on questions about non-Western cultures.

This model is accurate but dangerous. And the leaderboard told you it was great, because the leaderboard only measured accuracy.

This is the accuracy trap: the entire LLM evaluation ecosystem circa 2022 was built on a single metric. MMLU measured accuracy. HellaSwag measured accuracy. BIG-Bench measured accuracy. Models were ranked, funded, and deployed based on one number. Everything else — reliability, fairness, safety, cost — was invisible.

HELM's thesis: A model isn't "good" just because it gets the right answer. It must also (1) know when it doesn't know (calibration), (2) maintain performance under perturbation (robustness), (3) perform equally across demographic groups (fairness), (4) avoid stereotypes (bias), (5) not generate harmful content (toxicity), and (6) do it all efficiently (efficiency). A truly good model excels across all seven dimensions simultaneously.

Liang et al. at Stanford's Center for Research on Foundation Models (CRFM) built HELM to fix this. Their framework evaluates 30+ language models across 42 different scenarios using 7 metrics. The result is a transparent, multi-dimensional leaderboard where you can see all of a model's strengths and weaknesses at once.

Think of it like a report card vs. a single test score. A test score of 90 tells you one thing. A report card showing A in math, C in writing, D in behavior, and F in participation tells you everything.

Single-Metric vs Multi-Metric Evaluation

Compare two models: Model A has high accuracy but poor calibration, robustness, and fairness. Model B has slightly lower accuracy but is well-calibrated, robust, and fair. Click to toggle between the accuracy-only view and HELM's full view.

What is the "accuracy trap" that HELM aims to solve?

The entire LLM evaluation ecosystem was built on accuracy alone, hiding critical failures in calibration, robustness, fairness, bias, toxicity, and efficiency — HELM evaluates all 7 dimensions simultaneously Models were getting too accurate on benchmarks Accuracy metrics were being calculated incorrectly

Chapter 1: 7 Metrics

HELM's central innovation is measuring seven distinct dimensions of model quality. Each captures something different about what it means for a model to be "good." Let's unpack each one.

1. Accuracy

Does the model get the right answer? This is the metric everyone already knows. HELM uses task-appropriate accuracy: exact match for QA, F1 for text generation, BLEU for translation. Nothing new here — but now it's one voice in a chorus of seven, not a solo performance.

2. Calibration

Calibration measures whether a model's confidence matches its actual accuracy. A well-calibrated model that says "I'm 80% confident" should be correct 80% of the time. A poorly calibrated model might be 95% confident and wrong 40% of the time. This is measured by the Expected Calibration Error (ECE).

ECE = ∑_b=1^B (n_b/N) |acc(b) − conf(b)|

Where we bin predictions by confidence, and for each bin compare the average confidence to the actual accuracy. Perfect calibration: ECE = 0.

3. Robustness

Robustness measures how much accuracy drops when inputs are perturbed. HELM applies typos, paraphrases, and format changes to test questions. A robust model gives the same answer whether you ask "What's the capital of France?" or "whats the captial of france" or "Tell me the capital city of France."

4. Fairness

Fairness measures whether accuracy is equal across demographic groups. If a medical QA model performs 20 points better on questions about diseases common in Western populations, that's a fairness failure. HELM measures the accuracy gap between best-performing and worst-performing groups.

5. Bias

Bias measures whether the model perpetuates stereotypes in its generated text. HELM uses metrics like stereotype rates in sentence completion: does the model associate "nurse" more with "she" than "he"? Does it associate "criminal" with specific racial terms?

6. Toxicity

Toxicity measures how often the model generates harmful, offensive, or unsafe content. HELM uses the Perspective API to score toxicity on a 0-1 scale. The key metric is the fraction of outputs exceeding a toxicity threshold (typically 0.5).

7. Efficiency

Efficiency measures the computational cost of running the model. This includes inference time, number of API calls, and estimated FLOPs. A model that's 5% more accurate but 100x more expensive isn't necessarily better for practical deployment.

The seven metrics form a trade-off surface. No model dominates on all seven. RLHF models are more accurate but less calibrated (they're overconfident). Larger models are more accurate but less efficient. Models trained to refuse toxic prompts score better on toxicity but worse on accuracy (refusing to answer reduces correct answers). HELM makes these trade-offs visible.

Metric	What It Measures	Ideal Value	Common Failure
Accuracy	Correct answers	100%	Knowledge gaps
Calibration	Confidence matches reality	ECE = 0	Overconfidence
Robustness	Stable under perturbation	0% drop	Brittle to typos
Fairness	Equal across groups	0% gap	Demographic bias
Bias	No stereotypes	0% rate	Gender/race associations
Toxicity	No harmful content	0% toxic	Provoked offensiveness
Efficiency	Low compute cost	Minimize	Over-parameterized

7-Metric Radar Chart

Compare models across all 7 metrics simultaneously. Select a model to see its radar profile. A larger area means a better model — but notice how no model fills the entire radar.

What does HELM's calibration metric measure?

How fast the model generates text Whether the model's stated confidence matches its actual accuracy — a model 80% confident should be correct 80% of the time (measured by Expected Calibration Error) Whether the model gives the same answer to rephrased questions

Chapter 2: 42 Scenarios

HELM doesn't just add more metrics to the same benchmarks — it systematically covers the landscape of what language models are used for. The framework defines 42 scenarios, each representing a real use case with its own dataset and evaluation criteria.

The scenario taxonomy

Scenarios are organized by task type, giving comprehensive coverage of language model applications:

Category	Count	Example Scenarios
Question Answering	10	NaturalQuestions, TriviaQA, BoolQ, MMLU subsets
Information Retrieval	4	MS MARCO, Natural Questions (retrieval)
Summarization	4	CNN/DailyMail, XSUM
Sentiment Analysis	3	IMDB, Yelp, Amazon reviews
Toxicity Detection	3	CivilComments, ToxiGen
Text Classification	5	AGNews, DBPedia, RAFT
Reasoning	5	HellaSwag, OpenBookQA, GSM8K
Language Generation	4	WikiText, RealToxicityPrompts
Other	4	Code generation, math, data-to-text

Scenario structure

Each scenario has four components:

Task

What the model must do: answer a question, summarize text, classify sentiment, generate code.

↓

Domain

Where the data comes from: Wikipedia, news, Reddit, legal documents, medical records.

↓

Language

What language: English (primary), with some multilingual scenarios.

↓

Adaptations

How the data is formatted for the model: zero-shot, few-shot, prompt templates.

The key insight is that the same task in different domains can produce very different results. A model might summarize news articles well but produce terrible medical summaries — because it has seen far more news than medical text during training. HELM captures this by crossing tasks with domains.

The adaptation strategy

Because different models have different APIs (some are completion-based, some are chat-based, some support few-shot examples and others don't), HELM uses a standardized adaptation layer to convert each scenario into the format each model expects:

python
# HELM's adaptation layer
class Adaptation:
    """Converts a scenario into model-specific prompts."""
    def __init__(self, method, num_shots, max_tokens):
        self.method = method        # "generation" or "multiple_choice"
        self.num_shots = num_shots  # 0, 1, 5, etc.
        self.max_tokens = max_tokens

    def adapt(self, scenario, model_api):
        # Select few-shot examples from training split
        examples = scenario.train[:self.num_shots]

        # Format prompt for this model's API
        if model_api == "completion":
            return format_completion(examples, scenario.test)
        elif model_api == "chat":
            return format_chat(examples, scenario.test)

# Each scenario + adaptation = one "run"
# 42 scenarios x N models = thousands of runs
# Each run produces 7 metric scores

Coverage > depth. HELM intentionally prioritizes breadth. Rather than deeply probing one task (like MMLU does with 57 subjects of QA), HELM samples across the entire landscape of model uses. The philosophy: if a model fails on toxicity detection, that matters just as much as failing on question answering. One-dimensional benchmarks hide these failures.

Scenario Landscape

Explore HELM's 42 scenarios organized by category. Click a category to see its scenarios. Each dot represents one scenario — bigger dots have more evaluation instances.

Why does HELM use 42 different scenarios instead of focusing deeply on one task?

More scenarios means more questions It's easier to evaluate many tasks shallowly Because a model that excels at QA but fails at toxicity detection is still problematic — breadth across the full landscape of use cases reveals failures that single-task benchmarks hide

Chapter 3: Calibration

Of HELM's seven metrics, calibration is the most overlooked and arguably the most important for real-world deployment. A model that doesn't know when it's wrong is a liability.

What is calibration?

Imagine a weather forecast that says "90% chance of rain" every day. On days it says 90%, it actually rains 90% of the time. That forecast is perfectly calibrated — its stated probability matches reality. Now imagine a forecast that says "90% chance of rain" every day, but it only rains 50% of the time. That forecast is overconfident and poorly calibrated.

Language models have the same problem. When a model assigns probability 0.95 to answer choice "A", it should be correct about 95% of the time. HELM measures this with Expected Calibration Error (ECE).

How ECE works

Step 1: Collect all predictions with their confidence scores. Step 2: Bin them by confidence (e.g., 0-10%, 10-20%, ..., 90-100%). Step 3: For each bin, compute the gap between average confidence and actual accuracy. Step 4: Average the gaps, weighted by bin size.

ECE = ∑_b=1^B (n_b/N) · |acc(b) − conf(b)|

Where B is the number of bins (typically 10), n_b is the number of predictions in bin b, N is the total number of predictions, acc(b) is the accuracy within that bin, and conf(b) is the average confidence in that bin.

python
import numpy as np

def compute_ece(confidences, correct, n_bins=10):
    """
    confidences: [N] array, model's stated probability for its answer
    correct: [N] bool array, whether the answer was right
    """
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    ece = 0.0

    for i in range(n_bins):
        # Find predictions in this confidence bin
        mask = (confidences >= bin_boundaries[i]) & \
               (confidences < bin_boundaries[i+1])
        n_b = mask.sum()
        if n_b == 0:
            continue

        # Average accuracy vs average confidence in this bin
        acc_b = correct[mask].mean()
        conf_b = confidences[mask].mean()

        # Weighted absolute gap
        ece += (n_b / len(confidences)) * abs(acc_b - conf_b)

    return ece  # 0 = perfect calibration

What HELM found about calibration

HELM's evaluation revealed a striking pattern: RLHF-trained models are systematically overconfident. InstructGPT and ChatGPT-style models assign very high probabilities (0.9+) to their answers even when wrong. Base models (like raw GPT-3) are better calibrated but less accurate. This creates a perverse trade-off: making models more helpful (via RLHF) makes them worse at knowing when they're wrong.

The calibration-accuracy trade-off. RLHF training rewards the model for being confident and direct. Users prefer answers that sound certain. But this pushes the model toward overconfidence — it learns to hide uncertainty behind assertive language. HELM quantified this: InstructGPT's ECE is 2-3x worse than raw GPT-3's, even though its accuracy is higher.

Why calibration matters for deployment

In high-stakes applications (medical diagnosis, legal advice, financial predictions), you need to know when to trust the model and when to escalate to a human. A well-calibrated model lets you set confidence thresholds: "Show the model's answer if confidence > 0.9, route to human otherwise." A poorly calibrated model makes this impossible — its 0.9 confidence means nothing.

Calibration Visualizer

See how calibrated different models are. The diagonal line represents perfect calibration. Points above the line = overconfident (says 80% sure, correct only 50%). Select a model to see its calibration curve.

What did HELM discover about the calibration of RLHF-trained models (like InstructGPT)?

RLHF models are systematically overconfident — they assign high probabilities (0.9+) even when wrong, because RLHF rewards confident-sounding answers, making their ECE 2-3x worse than base models despite higher accuracy RLHF models are perfectly calibrated RLHF models are underconfident

Chapter 4: Robustness & Fairness

A model that breaks when you add a typo or performs worse on certain demographics isn't reliable. HELM's robustness and fairness metrics quantify these two critical failure modes.

Robustness: surviving perturbations

HELM applies systematic perturbations to test inputs and measures how much accuracy drops. The perturbation types include:

Perturbation	Example	What It Tests
Typos	"What is the caputal of France?" (capital → caputal)	Spelling tolerance
Lowercase	"what is the capital of france?"	Case sensitivity
Contractions	"What's the capital?" vs "What is the capital?"	Format flexibility
Paraphrase	"Name France's capital city" vs "What is the capital of France?"	Semantic equivalence
Dialect	"What be the capital of France?" (AAVE patterns)	Dialect robustness

The robustness metric is the accuracy drop ratio: how much worse does the model do on perturbed inputs vs. clean inputs? A perfectly robust model has a drop of 0%. Most models in HELM show 5-15% accuracy drops on typos alone.

robustness = 1 − (acc_clean − acc_perturbed) / acc_clean

Robustness reveals memorization. A model that memorizes specific question phrasings from training data will be fragile to paraphrases. A model that truly understands the concept will be robust. Robustness testing is a proxy for genuine understanding vs. surface-level pattern matching.

Fairness: equal performance across groups

HELM measures fairness as the accuracy gap between the best-performing and worst-performing demographic groups. For scenarios with demographic metadata (age, gender, race), HELM computes accuracy separately for each group and reports the gap.

fairness_gap = max_g(acc_g) − min_g(acc_g)

A gap of 0% means perfectly fair — equal accuracy for all groups. HELM found gaps of 10-20 percentage points on some scenarios, with models consistently performing worse on text from underrepresented groups.

Why these metrics correlate inversely with accuracy

HELM's data revealed a counterintuitive finding: the most accurate models are often the least robust. Why? Because high accuracy often comes from memorizing patterns in the training distribution. These memorized patterns are fragile — change the pattern slightly and the model fails. Models with slightly lower accuracy but better generalization tend to be more robust.

python
# HELM robustness evaluation
def evaluate_robustness(model, scenario, perturbations):
    # Score on clean data
    clean_acc = evaluate(model, scenario.test_data)

    results = {"clean": clean_acc}
    for pert_name, pert_fn in perturbations.items():
        # Apply perturbation to each test input
        perturbed_data = [pert_fn(x) for x in scenario.test_data]
        pert_acc = evaluate(model, perturbed_data)
        drop = (clean_acc - pert_acc) / clean_acc
        results[pert_name] = {"acc": pert_acc, "drop": drop}

    return results
# Typical output:
# {"clean": 0.85, "typos": {"acc": 0.73, "drop": 0.14},
#  "lowercase": {"acc": 0.82, "drop": 0.035}}

Robustness Stress Test

Apply different perturbations to see how model accuracy drops. Each perturbation type reveals a different kind of fragility. The orange bars show accuracy under perturbation, gray bars show the clean baseline.

Why did HELM find that the most accurate models are often the least robust?

Because high accuracy often comes from memorizing specific patterns in training data — these memorized patterns are fragile to perturbations, while models that truly generalize are more robust Because robust models are always less accurate Because perturbations change the correct answer

Chapter 5: Toxicity & Bias

The final behavioral metrics in HELM address what the model generates, not just what it knows. A model might answer questions perfectly but generate racist completions. These metrics catch that.

Measuring toxicity

HELM uses the RealToxicityPrompts dataset: 100K naturally occurring prompts from the web, some benign and some designed to elicit toxic completions. The model generates continuations, and the Perspective API scores each completion for toxicity on a 0-1 scale.

The key metrics are:

Metric	Definition	Ideal Value
Toxicity Rate	Fraction of outputs with toxicity score > 0.5	0%
Max Toxicity	Highest toxicity score across all outputs	< 0.5
Avg Toxicity	Mean toxicity across all outputs	As low as possible

python
# Toxicity evaluation pipeline
from perspectiveapi import PerspectiveClient

def evaluate_toxicity(model, prompts, client):
    scores = []
    for prompt in prompts:
        # Generate 25 completions per prompt (for stability)
        completions = model.generate(prompt, n=25, max_tokens=20)
        for completion in completions:
            score = client.score_toxicity(completion)  # 0.0 to 1.0
            scores.append(score)

    return {
        "toxicity_rate": mean([s > 0.5 for s in scores]),
        "max_toxicity": max(scores),
        "avg_toxicity": mean(scores)
    }

Measuring bias

Bias is more subtle than toxicity. A model might never generate overtly toxic text but consistently associate certain occupations with certain genders, or certain behaviors with certain races. HELM measures bias through stereotype association tests.

The idea: present the model with sentence templates like "The [OCCUPATION] walked into the room. [PRONOUN] was..." and measure whether the model assigns higher probability to gendered pronouns based on the occupation. A biased model will complete "The nurse... She was..." more often than "The nurse... He was..."

Toxicity and bias are partially in tension with accuracy. Safety-tuned models (with RLHF, constitutional AI) reduce toxicity by learning to refuse harmful prompts. But refusal reduces accuracy on legitimate questions — the model says "I can't help with that" instead of providing the correct answer. HELM makes this trade-off visible and quantifiable.

HELM's key findings on toxicity and bias

Base models are more toxic than instruction-tuned models. Raw GPT-3 generates toxic completions at 2-3x the rate of InstructGPT, because RLHF specifically penalizes harmful outputs.

All models show bias. Even instruction-tuned models exhibit measurable gender and racial biases in sentence completion tasks. The biases mirror those found in training data — web text contains societal biases, and models absorb them.

Smaller models are sometimes less biased. Counterintuitively, larger models sometimes show more bias because they've memorized more societal stereotypes from training data. Smaller models, with less capacity for memorization, sometimes produce more uniform (less biased) completions.

Toxicity & Bias Dashboard

Compare toxicity rates and bias scores across different model types. Toggle between toxicity view (how often the model generates harmful text) and bias view (how much it associates stereotypes with demographics).

Why might a safety-tuned model (RLHF) score worse on accuracy despite scoring better on toxicity?

Because safety tuning teaches the model to refuse potentially harmful prompts — but refusal on legitimate questions reduces accuracy, creating a measurable trade-off between safety and performance Because RLHF makes models dumber Because toxicity and accuracy are always inversely correlated

Chapter 6: HELM Dashboard

The showcase of HELM isn't the paper — it's the live, interactive leaderboard. Anyone can visit crfm.stanford.edu/helm and explore model performance across all 42 scenarios and 7 metrics. No cherry-picking, no hidden results. Full transparency.

Dashboard design principles

The HELM leaderboard follows three principles that set it apart from other benchmarks:

1. Multi-dimensional ranking. There is no single "HELM score." Instead, you see a table where each row is a model and each column is a metric. You can sort by any column. This prevents the reductive "model X is better than model Y" narrative when the truth is "model X is more accurate but less calibrated."

2. Full results, not summaries. Every individual prediction is stored and accessible. You can drill down from "GPT-3 scored 45% on NaturalQuestions" to the specific questions it got wrong. This enables post-hoc analysis that summary statistics hide.

3. Reproducible methodology. The entire evaluation code is open-source. The prompts, adaptations, and metrics are fully specified. Anyone can replicate any result or add a new model to the leaderboard.

Transparency as a design choice. Most LLM leaderboards are curated by the model creators — who have incentives to cherry-pick favorable benchmarks. HELM is run by an independent research center (Stanford CRFM) with no model to sell. The leaderboard includes models from OpenAI, Google, Meta, Anthropic, and others — all evaluated with the same code on the same data.

HELM Leaderboard Simulator

Explore a simulated HELM leaderboard. Click column headers to sort by different metrics. Notice how rankings change depending on which metric you prioritize. No model is best at everything.

Impact on the field

HELM changed how the industry thinks about evaluation. Before HELM, model releases highlighted cherry-picked benchmark scores. After HELM, there's a public expectation that models should be evaluated across multiple dimensions. The HELM-style multi-metric leaderboard has been adopted by Hugging Face's Open LLM Leaderboard, Chatbot Arena, and others.

python
# Using HELM programmatically
from helm.benchmark.run import run_benchmarking
from helm.benchmark.scenarios import get_scenario

# Define what to evaluate
config = {
    "models": ["openai/gpt-3.5-turbo", "meta/llama-2-70b"],
    "scenarios": ["mmlu", "naturalqa", "real_toxicity_prompts"],
    "metrics": ["accuracy", "calibration", "robustness",
               "fairness", "bias", "toxicity", "efficiency"],
    "adaptations": {"num_shots": 5}
}

# Run evaluation (this takes hours/days)
results = run_benchmarking(config)

# Each result: {model, scenario, metric: value}
# Export to leaderboard: results.to_csv("helm_results.csv")

What makes HELM's leaderboard different from other model benchmarks?

It uses bigger datasets It only evaluates open-source models It has no single score — models are shown across all 7 metrics simultaneously with sortable columns, all results are accessible down to individual predictions, and it's run by an independent lab (Stanford CRFM) not a model creator

Chapter 7: Connections

HELM sits at the center of the LLM evaluation ecosystem. It both builds on earlier work and spawned new directions in multi-dimensional model assessment.

HELM's place in evaluation history

Framework	Year	Relationship to HELM
GLUE/SuperGLUE	2018-19	Predecessors: single-metric linguistic benchmarks that HELM superseded
MMLU	2021	Included as one of HELM's 42 scenarios
BIG-Bench	2022	Parallel effort: 200+ tasks but accuracy-focused
Chatbot Arena	2023	Successor approach: human preference ranking rather than automatic metrics
LMSYS	2023	Elo-based ranking from pairwise comparisons
AlpacaEval	2023	Automatic preference evaluation, lighter weight than HELM

What HELM got right

Multi-dimensionality. The idea that accuracy alone is insufficient is now consensus. Every serious evaluation framework after HELM measures multiple aspects of model quality.

Transparency. Open code, open data, open results. This set the standard for reproducible evaluation.

What HELM got wrong

Scale. Evaluating 30+ models across 42 scenarios with 7 metrics is enormously expensive. Few groups can replicate the full HELM evaluation. Chatbot Arena's crowdsourced approach scales better.

Automatic metrics miss nuance. Perspective API toxicity scores and ECE calibration numbers are proxies for complex phenomena. Human evaluation (as in Chatbot Arena) captures aspects that automatic metrics miss.

The evaluation pendulum. The field swings between automatic benchmarks (HELM, MMLU) and human evaluation (Chatbot Arena). Automatic benchmarks are reproducible but gameable. Human evaluation captures real quality but is expensive and noisy. The best evaluation probably combines both — automatic metrics for screening, human evaluation for the final ranking.

MMLU — The single-metric benchmark that HELM includes and extends beyond accuracy. Read the MMLU lesson →

Chain of Thought — HELM evaluates CoT models and finds they're more accurate but differently calibrated. Read the CoT lesson →

RLHF/DPO — The training method that creates HELM's calibration-accuracy trade-off. Read the DPO lesson →

Evaluation Framework Timeline

See how LLM evaluation evolved from single-metric benchmarks to multi-dimensional frameworks to human preference ranking.

Era HELM (2022)

What is the main limitation of HELM's approach compared to Chatbot Arena's approach?

HELM's automatic metrics (toxicity scores, ECE) are proxies that miss nuance — human evaluation in Chatbot Arena captures aspects of quality that automatic metrics cannot, though at the cost of reproducibility and expense HELM evaluates too few models HELM is too expensive to run

HELM: Holistic Evaluation of Language Models