Evan Miller (Anthropic) — 2024

Adding Error Bars to Evals

LLM benchmarks report "65.5% vs 64.0%" without uncertainty. Is that a real difference, or noise? Rigorous statistics says: compute the error bar first, then decide.

Prerequisites: Mean & variance + Central Limit Theorem + Confidence intervals
10
Chapters
4+
Simulations

Chapter 0: The Problem

Two fictional language models, Galleon and Dreadnought, are evaluated on three benchmarks:

Eval# QuestionsGalleonDreadnoughtDifference
MATH5,00065.5%63.0%+2.5%
HumanEval16483.6%86.7%−3.1%
MGSM2,50075.3%78.0%−2.7%

At face value, Dreadnought wins two of three evals. Case closed?

Not so fast. Look at HumanEval: only 164 questions. A 3.1% difference on 164 binary questions has a standard error of roughly 3%. That difference is indistinguishable from noise. Meanwhile the MATH result (5,000 questions, 2.5% gap) is rock-solid. Without error bars, you cannot tell the difference.

This is not a hypothetical problem. Industry practice today is to bold the highest number, put it in a press release, and move on. Leaderboards rank models to the tenth of a percent. Technical reports trumpet a 0.5% improvement as evidence of superior architecture. But none of this means anything without a measure of uncertainty.

Eval Scores: With and Without Error Bars

Toggle to see how error bars change interpretation. The bare numbers look definitive; the error bars tell a different story.

The core problem: Current LLM evaluation practice is "highest number is best" — bold the SOTA, move on. No error bars, no significance tests, no sample size justification. This paper brings the statistics that every other experimental science has used for decades.
HumanEval has 164 questions, MATH has 5,000. A 3% difference on HumanEval vs. a 2.5% difference on MATH — which is more likely to be statistically significant?

Chapter 1: The Key Insight

The fundamental conceptual move in this paper is simple but powerful:

Treat eval questions as random samples from a super-population. The 5,000 MATH questions are not all possible math questions — they are a sample from the infinite universe of math questions you could ask. The eval score is an estimate of a true underlying ability, and that estimate has sampling uncertainty.

Once you accept this framing, standard statistical machinery applies directly:

Decompose the score on question i into a conditional mean (how hard the question is, on average) and a random component (noise from sampling the model's output):

si = xi + εi

Here xi is the expected score on question i (e.g. the probability the model gets it right), and εi is zero-mean noise. The true eval score μ is the expected value over the super-population: μ = E[s] = E[x]. We estimate it with the sample mean s̄.

Two sources of variance drive the uncertainty:

  1. Var(x): Variance of question difficulty across the super-population. Some questions are easy (xi near 1), some are hard (xi near 0). This variance is a property of the eval and cannot be reduced.
  2. E[σi²]: Mean conditional variance — even for a fixed question, the model's sampled output varies (temperature, nucleus sampling). This noise can be reduced by resampling or using next-token probabilities.

By the law of total variance, these add:

Var(μ̂) = Var(s)/n = (Var(x) + E[σi²]) / n

This decomposition is the engine of everything that follows. The first term is irreducible (it is the whole point of sampling from a super-population). The second term is the target for variance reduction techniques (Chapter 6). And the 1/n factor is why bigger evals give tighter CIs.

Why does treating eval questions as samples from a super-population unlock statistical inference?

Chapter 2: Confidence Intervals for Accuracy

The simplest case: n independent questions, each scored 0 or 1. The model gets s̄ = k/n correct. What is the standard error?

The Bernoulli standard error

When scores are binary (right/wrong), the standard error has a clean form:

SEBernoulli = √(s̄(1 − s̄) / n)

Worked example. A model scores 65.5% on MATH (n = 5,000):

SE = √(0.655 × 0.345 / 5000) = √(0.0000452) = 0.0067 = 0.67%

The 95% confidence interval is 65.5% ± 1.96 × 0.67% = (64.2%, 66.8%).

Now consider HumanEval with only n = 164 at 83.6% accuracy:

SE = √(0.836 × 0.164 / 164) = √(0.000836) = 0.0289 = 2.89%

The 95% CI is (77.9%, 89.3%) — an 11-point spread! That "83.6%" could easily be 78% or 89%.

The Wilson interval

For small n or extreme p (near 0 or 1), the normal approximation — s̄ ± 1.96 × SE — can give intervals outside [0, 1]. The Wilson interval corrects this:

CI = (s̄ + z²/2n ± z√(s̄(1−s̄)/n + z²/4n²)) / (1 + z²/n)

For typical eval sizes (n > 100) and moderate accuracy, the normal and Wilson intervals are nearly identical. But if you have a small eval (n < 50) or accuracy > 95%, use Wilson.

The CLT standard error

When scores take fractional values (e.g. F1 scores, partial credit), use the general formula from the Central Limit Theorem:

SECLT = √( Σ(si − s̄)² / (n(n−1)) )

This is just the sample standard deviation divided by √n. The Bernoulli formula is a special case. Using SEBernoulli on non-binary data gives conservative (too wide) intervals — the paper catches Llama 3 making exactly this mistake.

Confidence Interval Calculator

Drag the sliders to see how accuracy and sample size affect the CI width.

Rule of thumb: For a binary eval at ~50% accuracy, you need about n = 1,000 questions to get a CI width under ±3 percentage points. At n = 100, the CI is ±10 points — nearly useless for distinguishing models.
A model scores 90% on an eval with 200 questions. What is the approximate 95% confidence interval?

Chapter 3: Clustering

Many evals group questions into clusters. DROP has passages with multiple questions each. MGSM translates each question into 10 languages. RACE has several questions per reading passage. Within a cluster, questions are not independent — if the model understands the passage, it likely gets all related questions right.

This violates the independence assumption of the CLT. The naive standard error pretends you have n independent data points, but you really have far fewer independent observations.

Think of it this way: if a passage has 10 questions and the model either understands the passage or doesn't, those 10 scores are essentially 10 copies of one coin flip. You wrote down 10 data points, but you only collected 1 bit of information. The naive formula thinks it has 10x more data than it does.

The clustered standard error

Let si,c be the score on question i within cluster c. The cluster-adjusted SE is:

SEclustered² = SECLT² + (1/n²) Σc Σi Σj≠i (si,c − s̄)(sj,c − s̄)

The extra term captures within-cluster correlations. When all questions in a cluster are perfectly correlated, each cluster acts as a single observation — and the SE can be much larger.

Worked example: real numbers from the paper

EvalSEclusteredSEnaiveRatio
DROP (588 clusters, 9622 Qs)1.340.443.05x
RACE-H (1045 clusters, 3498 Qs)0.51%0.46%1.10x
MGSM (250 clusters, 2500 Qs)1.62%0.86%1.88x

On DROP, the real uncertainty is 3x larger than the naive estimate. If you reported the naive CI, you would think your measurement was much more precise than it actually is.

Clustering Effect Simulator

Adjust intra-cluster correlation and cluster size. Watch the CI balloon when correlation is high.

The sliding scale: Clustered SE interpolates between two extremes. At zero within-cluster correlation, SEclustered = SEnaive. At perfect correlation, each cluster is one observation, and the effective sample size is the number of clusters, not questions.

The design effect quantifies this inflation: DEFF = 1 + (m−1)ρ, where m is cluster size and ρ is intra-cluster correlation. The effective sample size is n/DEFF. For DROP with ~16 questions per passage and high within-passage correlation, the design effect can be 9 or more — meaning 9,622 questions carry the information of only ~1,000 independent observations.

When should you cluster? According to Abadie et al. (2022): whenever the sampling or assignment mechanism operates at the cluster level. If passages were sampled (not individual questions), cluster. If the same question appears in 10 languages, cluster by question. The rule is simple: cluster at the level of independent sampling.
An eval has 1,000 questions in 100 clusters of 10. The naive SE is 1.5%. If within-cluster correlation is high, the clustered SE could be closest to:

Chapter 4: Comparing Two Models

You rarely care about a single model's score in isolation. You care about: is Model A better than Model B? There are two ways to test this.

Unpaired analysis (naive)

If models were evaluated on different question sets, combine their SEs in quadrature:

SEA−B = √(SEA² + SEB²)

Paired analysis (better)

If both models answer the same questions — which is the usual case — you can do much better. Compute the per-question difference di = sA,i − sB,i, then take the SE of those differences:

SEpaired = √( Σ(di − d̄)² / (n(n−1)) )

Why is this better? Because the variance of the difference is:

Var(d) = Var(sA) + Var(sB) − 2 Cov(sA, sB)

The covariance term is positive whenever both models agree on which questions are easy and which are hard — which is almost always true. This subtracts variance, giving a tighter CI.

Worked example

Galleon scores 65.5% on MATH, Dreadnought 63.0%. Both answer all 5,000 questions. Individually, each has SE ≈ 0.67%. The unpaired SE of the difference would be:

SEunpaired = √(0.67² + 0.67²) = 0.95%

But if the per-question score correlation is 0.5, the paired SE is:

SEpaired = √(0.67² + 0.67² − 2 × 0.67 × 0.67 × 0.5) = 0.67%

The 95% CI on the 2.5% gap is 2.5% ± 1.96 × 0.67% = (1.2%, 3.8%). The interval is entirely positive — this is a statistically significant difference. The paired test gives a 30% narrower CI, which can be the difference between "significant" and "inconclusive."

Paired vs. Unpaired Testing

Drag the correlation slider. Higher correlation = bigger advantage for paired testing.

McNemar's test for binary data

When scores are binary (right/wrong), the paired analysis has a classical equivalent: McNemar's test. Categorize each question into a 2x2 table:

B correctB wrong
A correctBoth right (a)Only A right (b)
A wrongOnly B right (c)Both wrong (d)

The cells (a) and (d) — where both models agree — carry no information about which is better. All the signal is in the discordant cells b and c. McNemar's test statistic is:

χ² = (b − c)² / (b + c)

Worked example. 5,000 MATH questions. Suppose: both right = 3,000, only Galleon right = 275, only Dreadnought right = 150, both wrong = 1,575. Then b = 275, c = 150:

χ² = (275 − 150)² / (275 + 150) = 125² / 425 = 36.76

With 1 degree of freedom, χ² > 3.84 is significant at p < 0.05. At 36.76, this is highly significant (p < 0.001). Note: only 425 out of 5,000 questions contribute — the 4,575 where both agree or disagree are irrelevant.

Key recommendation: Always use paired tests when two models answer the same questions. It is free variance reduction. Report the pairwise difference, its SE, the 95% CI, and the score correlation between models.
Two models have individual SEs of 2% each, and their per-question scores have correlation 0.8. What is the paired SE of the difference?

Chapter 5: Sample Size Planning

Before running an eval, ask: how many questions do I need to reliably detect a D-point improvement? This is power analysis — standard in clinical trials, rare in ML.

The sample size formula

Inputs: significance level α (Type I error), power 1−β (probability of detecting a real effect), and minimum detectable effect δ. For a paired test with K resamples per question:

n = (zα/2 + zβ)² (ω² + σA²/KA + σB²/KB) / δ²

Where ω² = Var(xA) + Var(xB) − 2Cov(xA, xB) captures the variance of question-level score differences, and σ² terms capture conditional (sampling) variance.

Worked example

Suppose continuous scores with zero conditional variance (σ² = 0), ω² = 1/9 (from uniform scores with correlation 0.5). We want to detect δ = 3% with 80% power at α = 0.05:

n = (1.96 + 0.84)² × (1/9) / (0.03)² = 7.84 × 0.111 / 0.0009 ≈ 969

You need about 1,000 questions to detect a 3-point difference. This is why the paper recommends new evals have at least 1,000 questions.

Inverting the formula gives the minimum detectable effect (MDE) for a fixed n:

δ = (zα/2 + zβ) √((ω² + σA²/K + σB²/K) / n)

The MDE tells you: "given this eval's size and noise level, what is the smallest real difference I could reliably detect?" If the MDE is 8% but you care about 2% differences, the eval is too small — do not run it and pretend the results are informative.

Second example: effect of resampling on MDE. With σ² = 1/6 (binary scores), ω² = 1/9, n = 198 questions. At K = 1 (no resampling):

δ = 2.80 × √((1/9 + 1/6 + 1/6) / 198) = 2.80 × √(0.444/198) = 2.80 × 0.0474 = 13.3%

At K = 10:

δ = 2.80 × √((1/9 + 1/60 + 1/60) / 198) = 2.80 × √(0.144/198) = 2.80 × 0.0270 = 7.6%

Resampling 10x cut the MDE nearly in half — from 13.3% to 7.6%. But the remaining 7.6% is dominated by ω² (question difficulty variance), which no amount of resampling can touch. To go lower, you need more questions.

Power Analysis Calculator

Adjust the minimum detectable effect and see how many questions you need. Gray zone = infeasible (would need more questions than any existing eval).

Translation: An eval with 200 questions can only detect differences of ~10%+ between models. If you need to measure 2-3 point improvements (common in LLM development), you need 1,000+ questions. Many popular evals are far too small for the claims made about them.
You have an eval with 400 questions and want to detect a 5% difference at 80% power. Is this feasible? (Assume omega^2 = 1/9, sigma^2 = 0.)

Chapter 6: Variance Reduction

The total estimation variance has two parts. Only one is under your control:

Var(μ̂) = (Var(x) + E[σi²]) / n

Strategy 1: Resampling (answer K times)

Have the model answer each question K times. Use the mean score per question. The conditional variance becomes σi²/K.

Worked example. Binary scores, uniform difficulty (x ~ U[0,1]). Then Var(x) = 1/12 and E[σi²] = E[x(1-x)] = 1/6 (the average Bernoulli variance across difficulty levels). Going from K = 1 to K = 2:

Var(μ̂|K=2) = (1/12 + 1/12) / n = (1/6) / n
Var(μ̂|K=1) = (1/12 + 1/6) / n = (1/4) / n

That is a 33% reduction in variance by just asking twice. K = 4 gives 50% reduction. K = 6 gives 56%. But returns diminish quickly — once E[σ²]/K is much smaller than Var(x), more resampling is wasted effort. The theoretical limit is a 67% reduction (eliminating all conditional variance), achievable with next-token probabilities.

Critical warning: do not pool resamples. If you answer K = 5 times and have n = 1,000 questions, do NOT compute the SE across all 5,000 answers. That would violate independence (5 answers to the same question are correlated). Instead, average each question's K answers into a single score, then compute the SE across the 1,000 question-level means. Equivalently, treat resamples as a special case of clustering (Chapter 3).

Strategy 2: Next-token probabilities

For multiple-choice evals without chain-of-thought: instead of sampling a token and checking if it is correct, read the probability of the correct token directly. This sets εi = 0 and completely eliminates conditional variance. The score becomes si = xi = pi.

This achieves the theoretical maximum variance reduction — equivalent to K = ∞.

Strategy 2: Next-token probabilities

For multiple-choice evals without chain-of-thought: read the probability of the correct token directly instead of sampling. This sets εi = 0, completely eliminating conditional variance. It is equivalent to K = ∞.

Example: the model assigns pi = 0.73 to the correct answer for question i. Instead of flipping a coin and recording 0 or 1, record 0.73 as the score. No sampling noise at all. The variance of μ̂ drops from (Var(x) + E[σ²])/n to Var(x)/n alone — a 2/3 reduction in the uniform-difficulty case.

When to use which: Next-token probabilities when available and no chain-of-thought is needed. Resampling (K = 2-6) when you must generate full answers. In both cases, compute the SE across question-level scores, not across all K*n individual answers (that would violate independence).

Do NOT reduce temperature

It is tempting to set T = 0 to reduce noise. But this can increase total variance.

Worked example. Question difficulty uniform x ~ U[0,1]. At T = 1: Var(x) = 1/12, E[σ²] = 1/6. At T = 0: scores become binary (1 if x > 0.5, else 0). Now Var(xT=0) = 1/4. The conditional variance disappears but question-difficulty variance triples.

Even worse: temperature changes can bias the estimator. If x ~ U[1/3, 1], then E[xT=1] = 2/3 but E[xT=0] = 3/4. The mean score shifts by 8 percentage points, and variance increases five-fold (from 1/27 to 3/16).

Resampling: Diminishing Returns

The curve shows total variance relative to K=1 as you increase resamples per question. The floor is set by Var(x), which resampling cannot reduce.

Don't touch the thermostat. Reducing temperature shifts conditional variance into question-difficulty variance, and can even bias the estimator. Use resampling or next-token probabilities instead.
Why does resampling (K > 1) have diminishing returns?

Chapter 7: What Llama 3 Got Wrong

The Llama 3 technical report (Dubey et al., 2024) was notable for being one of the first industry reports to include confidence intervals on eval scores. This is laudable. But the paper identifies two systematic errors in their approach:

Error 1: Too narrow (ignored clustering)

For clustered evals like DROP, MGSM, and RACE, Llama 3 used the naive SE formula, ignoring within-cluster correlations. As we saw in Chapter 3, this can understate the true SE by 3x. Their reported CIs were anti-conservative — they appeared more precise than justified.

Real-world impact: On DROP, the naive SE was 0.44 but the clustered SE was 1.34 — over 3x larger. A difference that looks statistically significant under the naive CI might be completely consistent with noise under the correct clustered CI.

Error 2: Too wide (used Bernoulli for non-binary scores)

For evals with fractional scores (like F1 on DROP), Llama 3 used SEBernoulli = √(s̄(1−s̄)/n) even though scores were not binary. The Bernoulli formula assumes maximum variance for a given mean — it treats every score as 0 or 1. When scores are actually continuous (e.g., F1 scores between 0 and 1), the true variance is smaller, so the Bernoulli SE is conservative (too wide).

The correct approach: use SECLT computed from the actual sample variance of the scores.

Worked example: DROP

DROP uses F1 scores (continuous, 0 to 1) with 588 passage clusters containing 9,622 questions total. Suppose a model scores 87.1:

In this case, the Bernoulli formula gave 0.34 (too small), the CLT gave 0.44 (still too small without clustering), and the truth was 1.34. The two Llama 3 errors go in opposite directions on this eval: Bernoulli is wider than CLT, but missing clustering is much narrower. The net effect: still dangerously anti-conservative.

The net effect

Ironically, these two errors sometimes offset: too-narrow from ignoring clustering, too-wide from using Bernoulli. But they offset by accident, not design, and the degree varies per eval. On MGSM (binary scores, clustered), only error 1 applies — CIs are purely anti-conservative. On non-clustered evals with continuous scores, only error 2 applies — CIs are purely conservative.

The lesson: Getting CIs right requires attention to both the score type (binary vs. continuous) and the sampling structure (independent vs. clustered). Using the wrong formula can be worse than no CI at all, because it gives false confidence in precision.
Llama 3 reported CIs for DROP (a clustered eval with F1 scores). Which combination of errors did they make?

Chapter 8: Practical Recommendations

The paper distills its analysis into concrete, actionable recommendations for anyone running or reporting LLM evals:

For reporting results

  1. Always report standard errors alongside mean scores, in parentheses below the mean (e.g., "65.5% (0.7%)"). This is standard practice in economics and medicine — ML should adopt it.
  2. Report the number of questions in each eval. Without this, readers cannot assess precision.
  3. Use clustered SEs for any eval with grouped questions (DROP, RACE, QuAC, SQuAD, MGSM). Report the number of clusters alongside the question count.
  4. Use SECLT, not SEBernoulli, unless scores are genuinely binary. The Bernoulli formula is a special case — using it on continuous scores gives inflated intervals.

For comparing models

  1. Use paired tests whenever two models answer the same questions. Report pairwise differences, paired SEs, 95% CIs, and the score correlation.
  2. Use clustered paired SEs on evals with grouped questions.
  3. Test for significance before claiming one model is better than another. If the 95% CI of the difference includes zero, the difference is not statistically significant.

For designing evals

  1. Run power analysis first. Use the sample-size formula to determine how many questions you need to detect the improvement you care about.
  2. Aim for 1,000+ questions as a baseline. Evals with ~100 questions can only detect very large differences.
  3. Consider resampling (K = 2-4) or using next-token probabilities to reduce conditional variance.
  4. Do not adjust temperature for the sake of variance reduction.

What to do right now

If you are writing a technical report today, do these three things:

  1. Compute SECLT from your per-question scores (not SEBernoulli unless scores are truly 0/1). Report it in parentheses.
  2. For any clustered eval (DROP, RACE, QuAC, SQuAD, MGSM), compute the clustered SE. It is a few extra lines of code.
  3. For model comparisons, compute per-question differences and report the paired SE. Include the score correlation between models.

These three steps would transform the interpretability of every leaderboard result. They cost essentially nothing — a few lines of NumPy after the eval run.

The cheat sheet

ScenarioFormula
Single model, binary, independentSE = √(p(1-p)/n)
Single model, continuous, independentSE = sd(s) / √n
Single model, clusteredSEclustered via Eq. 4
Two models, unpairedSE = √(SEA² + SEB²)
Two models, pairedSE = sd(di) / √n
Sample size (paired)n = (zα/2+zβ)² ω² / δ²
You are writing a technical report comparing your new model against a baseline on MGSM (multilingual eval, 250 clusters of 10 questions each). What should you report?

Chapter 9: Connections

This paper sits at the intersection of classical statistics and modern ML evaluation practice. Here is how it connects to the broader landscape:

The statistics it draws on

In the LLM evaluation landscape

Broader context

This paper is part of a growing movement to professionalize LLM evaluation. As the field matures, "highest number wins" is not sustainable. A/B testing in industry already uses these tools — the same rigor should apply when comparing language models.

The recommendations are simple, free to implement, and would immediately improve the interpretability of every technical report and leaderboard. Every eval framework should compute and display SEs by default. Every leaderboard should show error bars. Every technical report should include pairwise analysis when comparing models. These are not novel statistical techniques — they are intro-level statistics applied with care to a domain that has been ignoring them.

Key formulas reference card

Single model: SE = √(s̄(1-s̄)/n) for binary, sd(s)/√n for continuous
Clustered: Add within-cluster cross-terms to naive variance
Paired difference: SE = sd(di)/√n, where di = sA,i - sB,i
Sample size: n = (zα/2 + zβ)² ω² / δ²
95% CI: estimate ± 1.96 × SE
What is the single most impactful change this paper recommends for the LLM evaluation community?