Error Bars for Evals

Chapter 0: The Problem

Two fictional language models, Galleon and Dreadnought, are evaluated on three benchmarks:

Eval	# Questions	Galleon	Dreadnought	Difference
MATH	5,000	65.5%	63.0%	+2.5%
HumanEval	164	83.6%	86.7%	−3.1%
MGSM	2,500	75.3%	78.0%	−2.7%

At face value, Dreadnought wins two of three evals. Case closed?

Not so fast. Look at HumanEval: only 164 questions. A 3.1% difference on 164 binary questions has a standard error of roughly 3%. That difference is indistinguishable from noise. Meanwhile the MATH result (5,000 questions, 2.5% gap) is rock-solid. Without error bars, you cannot tell the difference.

This is not a hypothetical problem. Industry practice today is to bold the highest number, put it in a press release, and move on. Leaderboards rank models to the tenth of a percent. Technical reports trumpet a 0.5% improvement as evidence of superior architecture. But none of this means anything without a measure of uncertainty.

Eval Scores: With and Without Error Bars

Toggle to see how error bars change interpretation. The bare numbers look definitive; the error bars tell a different story.

The core problem: Current LLM evaluation practice is "highest number is best" — bold the SOTA, move on. No error bars, no significance tests, no sample size justification. This paper brings the statistics that every other experimental science has used for decades.

HumanEval has 164 questions, MATH has 5,000. A 3% difference on HumanEval vs. a 2.5% difference on MATH — which is more likely to be statistically significant?

HumanEval, because 3% is a bigger gap MATH, because with 30x more questions the standard error is much smaller, so the 2.5% gap is many standard errors away from zero Neither — you cannot determine significance from this information

Chapter 1: The Key Insight

The fundamental conceptual move in this paper is simple but powerful:

Treat eval questions as random samples from a super-population. The 5,000 MATH questions are not all possible math questions — they are a sample from the infinite universe of math questions you could ask. The eval score is an estimate of a true underlying ability, and that estimate has sampling uncertainty.

Once you accept this framing, standard statistical machinery applies directly:

Central Limit Theorem gives you standard errors
Confidence intervals quantify uncertainty in a single score
Paired tests compare two models rigorously
Power analysis tells you how many questions you need

Decompose the score on question i into a conditional mean (how hard the question is, on average) and a random component (noise from sampling the model's output):

s_i = x_i + ε_i

Here x_i is the expected score on question i (e.g. the probability the model gets it right), and ε_i is zero-mean noise. The true eval score μ is the expected value over the super-population: μ = E[s] = E[x]. We estimate it with the sample mean s̄.

Two sources of variance drive the uncertainty:

Var(x): Variance of question difficulty across the super-population. Some questions are easy (x_i near 1), some are hard (x_i near 0). This variance is a property of the eval and cannot be reduced.
E[σ_i²]: Mean conditional variance — even for a fixed question, the model's sampled output varies (temperature, nucleus sampling). This noise can be reduced by resampling or using next-token probabilities.

By the law of total variance, these add:

Var(μ̂) = Var(s)/n = (Var(x) + E[σ_i²]) / n

This decomposition is the engine of everything that follows. The first term is irreducible (it is the whole point of sampling from a super-population). The second term is the target for variance reduction techniques (Chapter 6). And the 1/n factor is why bigger evals give tighter CIs.

Why does treating eval questions as samples from a super-population unlock statistical inference?

Because the Central Limit Theorem applies to sample means from any distribution with finite variance, giving us standard errors and confidence intervals for the true eval score Because it makes the questions easier Because super-populations always follow a normal distribution

Chapter 2: Confidence Intervals for Accuracy

The simplest case: n independent questions, each scored 0 or 1. The model gets s̄ = k/n correct. What is the standard error?

The Bernoulli standard error

When scores are binary (right/wrong), the standard error has a clean form:

SE_Bernoulli = √(s̄(1 − s̄) / n)

Worked example. A model scores 65.5% on MATH (n = 5,000):

SE = √(0.655 × 0.345 / 5000) = √(0.0000452) = 0.0067 = 0.67%

The 95% confidence interval is 65.5% ± 1.96 × 0.67% = (64.2%, 66.8%).

Now consider HumanEval with only n = 164 at 83.6% accuracy:

SE = √(0.836 × 0.164 / 164) = √(0.000836) = 0.0289 = 2.89%

The 95% CI is (77.9%, 89.3%) — an 11-point spread! That "83.6%" could easily be 78% or 89%.

The Wilson interval

For small n or extreme p (near 0 or 1), the normal approximation — s̄ ± 1.96 × SE — can give intervals outside [0, 1]. The Wilson interval corrects this:

CI = (s̄ + z²/2n ± z√(s̄(1−s̄)/n + z²/4n²)) / (1 + z²/n)

For typical eval sizes (n > 100) and moderate accuracy, the normal and Wilson intervals are nearly identical. But if you have a small eval (n < 50) or accuracy > 95%, use Wilson.

The CLT standard error

When scores take fractional values (e.g. F1 scores, partial credit), use the general formula from the Central Limit Theorem:

SE_CLT = √( Σ(s_i − s̄)² / (n(n−1)) )

This is just the sample standard deviation divided by √n. The Bernoulli formula is a special case. Using SE_Bernoulli on non-binary data gives conservative (too wide) intervals — the paper catches Llama 3 making exactly this mistake.

Confidence Interval Calculator

Drag the sliders to see how accuracy and sample size affect the CI width.

Accuracy: 65% n: 500

Rule of thumb: For a binary eval at ~50% accuracy, you need about n = 1,000 questions to get a CI width under ±3 percentage points. At n = 100, the CI is ±10 points — nearly useless for distinguishing models.

A model scores 90% on an eval with 200 questions. What is the approximate 95% confidence interval?

SE = sqrt(0.9 x 0.1 / 200) = 2.1%, so CI is roughly (85.8%, 94.2%) SE = 0.9 / 200 = 0.45%, so CI is (89.1%, 90.9%) Cannot compute without individual question scores

Chapter 3: Clustering

Many evals group questions into clusters. DROP has passages with multiple questions each. MGSM translates each question into 10 languages. RACE has several questions per reading passage. Within a cluster, questions are not independent — if the model understands the passage, it likely gets all related questions right.

This violates the independence assumption of the CLT. The naive standard error pretends you have n independent data points, but you really have far fewer independent observations.

Think of it this way: if a passage has 10 questions and the model either understands the passage or doesn't, those 10 scores are essentially 10 copies of one coin flip. You wrote down 10 data points, but you only collected 1 bit of information. The naive formula thinks it has 10x more data than it does.

The clustered standard error

Let s_i,c be the score on question i within cluster c. The cluster-adjusted SE is:

SE_clustered² = SE_CLT² + (1/n²) Σ_c Σ_i Σ_j≠i (s_i,c − s̄)(s_j,c − s̄)

The extra term captures within-cluster correlations. When all questions in a cluster are perfectly correlated, each cluster acts as a single observation — and the SE can be much larger.

Worked example: real numbers from the paper

Eval	SE_clustered	SE_naive	Ratio
DROP (588 clusters, 9622 Qs)	1.34	0.44	3.05x
RACE-H (1045 clusters, 3498 Qs)	0.51%	0.46%	1.10x
MGSM (250 clusters, 2500 Qs)	1.62%	0.86%	1.88x

On DROP, the real uncertainty is 3x larger than the naive estimate. If you reported the naive CI, you would think your measurement was much more precise than it actually is.

Clustering Effect Simulator

Adjust intra-cluster correlation and cluster size. Watch the CI balloon when correlation is high.

Intra-cluster corr: 0.5 Questions/cluster: 10

The sliding scale: Clustered SE interpolates between two extremes. At zero within-cluster correlation, SE_clustered = SE_naive. At perfect correlation, each cluster is one observation, and the effective sample size is the number of clusters, not questions.

The design effect quantifies this inflation: DEFF = 1 + (m−1)ρ, where m is cluster size and ρ is intra-cluster correlation. The effective sample size is n/DEFF. For DROP with ~16 questions per passage and high within-passage correlation, the design effect can be 9 or more — meaning 9,622 questions carry the information of only ~1,000 independent observations.

When should you cluster? According to Abadie et al. (2022): whenever the sampling or assignment mechanism operates at the cluster level. If passages were sampled (not individual questions), cluster. If the same question appears in 10 languages, cluster by question. The rule is simple: cluster at the level of independent sampling.

An eval has 1,000 questions in 100 clusters of 10. The naive SE is 1.5%. If within-cluster correlation is high, the clustered SE could be closest to:

1.5% (same as naive) ~4.7% (because effectively n = 100 clusters, SE scales up by sqrt(10)) 0.15% (clustering reduces uncertainty)

Chapter 4: Comparing Two Models

You rarely care about a single model's score in isolation. You care about: is Model A better than Model B? There are two ways to test this.

Unpaired analysis (naive)

If models were evaluated on different question sets, combine their SEs in quadrature:

SE_A−B = √(SE_A² + SE_B²)

Paired analysis (better)

If both models answer the same questions — which is the usual case — you can do much better. Compute the per-question difference d_i = s_A,i − s_B,i, then take the SE of those differences:

SE_paired = √( Σ(d_i − d̄)² / (n(n−1)) )

Why is this better? Because the variance of the difference is:

Var(d) = Var(s_A) + Var(s_B) − 2 Cov(s_A, s_B)

The covariance term is positive whenever both models agree on which questions are easy and which are hard — which is almost always true. This subtracts variance, giving a tighter CI.

Worked example

Galleon scores 65.5% on MATH, Dreadnought 63.0%. Both answer all 5,000 questions. Individually, each has SE ≈ 0.67%. The unpaired SE of the difference would be:

SE_unpaired = √(0.67² + 0.67²) = 0.95%

But if the per-question score correlation is 0.5, the paired SE is:

SE_paired = √(0.67² + 0.67² − 2 × 0.67 × 0.67 × 0.5) = 0.67%

The 95% CI on the 2.5% gap is 2.5% ± 1.96 × 0.67% = (1.2%, 3.8%). The interval is entirely positive — this is a statistically significant difference. The paired test gives a 30% narrower CI, which can be the difference between "significant" and "inconclusive."

Paired vs. Unpaired Testing

Drag the correlation slider. Higher correlation = bigger advantage for paired testing.

Score correlation: 0.50

McNemar's test for binary data

When scores are binary (right/wrong), the paired analysis has a classical equivalent: McNemar's test. Categorize each question into a 2x2 table:

	B correct	B wrong
A correct	Both right (a)	Only A right (b)
A wrong	Only B right (c)	Both wrong (d)

The cells (a) and (d) — where both models agree — carry no information about which is better. All the signal is in the discordant cells b and c. McNemar's test statistic is:

χ² = (b − c)² / (b + c)

Worked example. 5,000 MATH questions. Suppose: both right = 3,000, only Galleon right = 275, only Dreadnought right = 150, both wrong = 1,575. Then b = 275, c = 150:

χ² = (275 − 150)² / (275 + 150) = 125² / 425 = 36.76

With 1 degree of freedom, χ² > 3.84 is significant at p < 0.05. At 36.76, this is highly significant (p < 0.001). Note: only 425 out of 5,000 questions contribute — the 4,575 where both agree or disagree are irrelevant.

Key recommendation: Always use paired tests when two models answer the same questions. It is free variance reduction. Report the pairwise difference, its SE, the 95% CI, and the score correlation between models.

Two models have individual SEs of 2% each, and their per-question scores have correlation 0.8. What is the paired SE of the difference?

sqrt(4 + 4) = 2.83% (unpaired) sqrt(4 + 4 - 2*2*2*0.8) = sqrt(1.6) = 1.26% 2% * 0.8 = 1.6%

Chapter 5: Sample Size Planning

Before running an eval, ask: how many questions do I need to reliably detect a D-point improvement? This is power analysis — standard in clinical trials, rare in ML.

The sample size formula

Inputs: significance level α (Type I error), power 1−β (probability of detecting a real effect), and minimum detectable effect δ. For a paired test with K resamples per question:

n = (z_α/2 + z_β)² (ω² + σ_A²/K_A + σ_B²/K_B) / δ²

Where ω² = Var(x_A) + Var(x_B) − 2Cov(x_A, x_B) captures the variance of question-level score differences, and σ² terms capture conditional (sampling) variance.

Worked example

Suppose continuous scores with zero conditional variance (σ² = 0), ω² = 1/9 (from uniform scores with correlation 0.5). We want to detect δ = 3% with 80% power at α = 0.05:

n = (1.96 + 0.84)² × (1/9) / (0.03)² = 7.84 × 0.111 / 0.0009 ≈ 969

You need about 1,000 questions to detect a 3-point difference. This is why the paper recommends new evals have at least 1,000 questions.

Inverting the formula gives the minimum detectable effect (MDE) for a fixed n:

δ = (z_α/2 + z_β) √((ω² + σ_A²/K + σ_B²/K) / n)

The MDE tells you: "given this eval's size and noise level, what is the smallest real difference I could reliably detect?" If the MDE is 8% but you care about 2% differences, the eval is too small — do not run it and pretend the results are informative.

Second example: effect of resampling on MDE. With σ² = 1/6 (binary scores), ω² = 1/9, n = 198 questions. At K = 1 (no resampling):

δ = 2.80 × √((1/9 + 1/6 + 1/6) / 198) = 2.80 × √(0.444/198) = 2.80 × 0.0474 = 13.3%

At K = 10:

δ = 2.80 × √((1/9 + 1/60 + 1/60) / 198) = 2.80 × √(0.144/198) = 2.80 × 0.0270 = 7.6%

Resampling 10x cut the MDE nearly in half — from 13.3% to 7.6%. But the remaining 7.6% is dominated by ω² (question difficulty variance), which no amount of resampling can touch. To go lower, you need more questions.

Power Analysis Calculator

Adjust the minimum detectable effect and see how many questions you need. Gray zone = infeasible (would need more questions than any existing eval).

MDE (δ): 3% ω²: 0.111

Translation: An eval with 200 questions can only detect differences of ~10%+ between models. If you need to measure 2-3 point improvements (common in LLM development), you need 1,000+ questions. Many popular evals are far too small for the claims made about them.

You have an eval with 400 questions and want to detect a 5% difference at 80% power. Is this feasible? (Assume omega^2 = 1/9, sigma^2 = 0.)

Yes: n needed = 7.84 * (1/9) / 0.05^2 = 348 questions. 400 > 348, so you have enough power. No: you always need at least 1,000 questions Cannot determine without knowing the models' scores

Chapter 6: Variance Reduction

The total estimation variance has two parts. Only one is under your control:

Var(μ̂) = (Var(x) + E[σ_i²]) / n

Var(x) — question difficulty variance. Fixed by the super-population. Cannot reduce.
E[σ_i²] — conditional variance (sampling noise). This you can attack.

Strategy 1: Resampling (answer K times)

Have the model answer each question K times. Use the mean score per question. The conditional variance becomes σ_i²/K.

Worked example. Binary scores, uniform difficulty (x ~ U[0,1]). Then Var(x) = 1/12 and E[σ_i²] = E[x(1-x)] = 1/6 (the average Bernoulli variance across difficulty levels). Going from K = 1 to K = 2:

Var(μ̂|K=2) = (1/12 + 1/12) / n = (1/6) / n

Var(μ̂|K=1) = (1/12 + 1/6) / n = (1/4) / n

That is a 33% reduction in variance by just asking twice. K = 4 gives 50% reduction. K = 6 gives 56%. But returns diminish quickly — once E[σ²]/K is much smaller than Var(x), more resampling is wasted effort. The theoretical limit is a 67% reduction (eliminating all conditional variance), achievable with next-token probabilities.

Critical warning: do not pool resamples. If you answer K = 5 times and have n = 1,000 questions, do NOT compute the SE across all 5,000 answers. That would violate independence (5 answers to the same question are correlated). Instead, average each question's K answers into a single score, then compute the SE across the 1,000 question-level means. Equivalently, treat resamples as a special case of clustering (Chapter 3).

Strategy 2: Next-token probabilities

For multiple-choice evals without chain-of-thought: instead of sampling a token and checking if it is correct, read the probability of the correct token directly. This sets ε_i = 0 and completely eliminates conditional variance. The score becomes s_i = x_i = p_i.

This achieves the theoretical maximum variance reduction — equivalent to K = ∞.

Strategy 2: Next-token probabilities

For multiple-choice evals without chain-of-thought: read the probability of the correct token directly instead of sampling. This sets ε_i = 0, completely eliminating conditional variance. It is equivalent to K = ∞.

Example: the model assigns p_i = 0.73 to the correct answer for question i. Instead of flipping a coin and recording 0 or 1, record 0.73 as the score. No sampling noise at all. The variance of μ̂ drops from (Var(x) + E[σ²])/n to Var(x)/n alone — a 2/3 reduction in the uniform-difficulty case.

When to use which: Next-token probabilities when available and no chain-of-thought is needed. Resampling (K = 2-6) when you must generate full answers. In both cases, compute the SE across question-level scores, not across all K*n individual answers (that would violate independence).

Do NOT reduce temperature

It is tempting to set T = 0 to reduce noise. But this can increase total variance.

Worked example. Question difficulty uniform x ~ U[0,1]. At T = 1: Var(x) = 1/12, E[σ²] = 1/6. At T = 0: scores become binary (1 if x > 0.5, else 0). Now Var(x_T=0) = 1/4. The conditional variance disappears but question-difficulty variance triples.

Even worse: temperature changes can bias the estimator. If x ~ U[1/3, 1], then E[x_T=1] = 2/3 but E[x_T=0] = 3/4. The mean score shifts by 8 percentage points, and variance increases five-fold (from 1/27 to 3/16).

Resampling: Diminishing Returns

The curve shows total variance relative to K=1 as you increase resamples per question. The floor is set by Var(x), which resampling cannot reduce.

Don't touch the thermostat. Reducing temperature shifts conditional variance into question-difficulty variance, and can even bias the estimator. Use resampling or next-token probabilities instead.

Why does resampling (K > 1) have diminishing returns?

Because once E[sigma^2]/K becomes much smaller than Var(x), further resampling only reduces the already-small conditional variance, while the dominant variance from question selection is untouched Because the model gives the same answer every time Because resampling increases Var(x)

Chapter 7: What Llama 3 Got Wrong

The Llama 3 technical report (Dubey et al., 2024) was notable for being one of the first industry reports to include confidence intervals on eval scores. This is laudable. But the paper identifies two systematic errors in their approach:

Error 1: Too narrow (ignored clustering)

For clustered evals like DROP, MGSM, and RACE, Llama 3 used the naive SE formula, ignoring within-cluster correlations. As we saw in Chapter 3, this can understate the true SE by 3x. Their reported CIs were anti-conservative — they appeared more precise than justified.

Real-world impact: On DROP, the naive SE was 0.44 but the clustered SE was 1.34 — over 3x larger. A difference that looks statistically significant under the naive CI might be completely consistent with noise under the correct clustered CI.

Error 2: Too wide (used Bernoulli for non-binary scores)

For evals with fractional scores (like F1 on DROP), Llama 3 used SE_Bernoulli = √(s̄(1−s̄)/n) even though scores were not binary. The Bernoulli formula assumes maximum variance for a given mean — it treats every score as 0 or 1. When scores are actually continuous (e.g., F1 scores between 0 and 1), the true variance is smaller, so the Bernoulli SE is conservative (too wide).

The correct approach: use SE_CLT computed from the actual sample variance of the scores.

Worked example: DROP

DROP uses F1 scores (continuous, 0 to 1) with 588 passage clusters containing 9,622 questions total. Suppose a model scores 87.1:

Llama 3 approach: SE_Bernoulli = √(0.871 × 0.129 / 9622) = 0.34. But also naive (no clustering).
Correct naive SE_CLT: Using actual score variance = 0.44 (larger than Bernoulli because F1 scores have more spread than 0/1)
Correct clustered SE: 1.34 — three times the naive, because passage-level questions are highly correlated

In this case, the Bernoulli formula gave 0.34 (too small), the CLT gave 0.44 (still too small without clustering), and the truth was 1.34. The two Llama 3 errors go in opposite directions on this eval: Bernoulli is wider than CLT, but missing clustering is much narrower. The net effect: still dangerously anti-conservative.

The net effect

Ironically, these two errors sometimes offset: too-narrow from ignoring clustering, too-wide from using Bernoulli. But they offset by accident, not design, and the degree varies per eval. On MGSM (binary scores, clustered), only error 1 applies — CIs are purely anti-conservative. On non-clustered evals with continuous scores, only error 2 applies — CIs are purely conservative.

The lesson: Getting CIs right requires attention to both the score type (binary vs. continuous) and the sampling structure (independent vs. clustered). Using the wrong formula can be worse than no CI at all, because it gives false confidence in precision.

Llama 3 reported CIs for DROP (a clustered eval with F1 scores). Which combination of errors did they make?

Only too narrow Only too wide Both: too narrow (ignored clustering) AND too wide (used Bernoulli on fractional F1 scores). The errors partially cancel by coincidence.

Chapter 8: Practical Recommendations

The paper distills its analysis into concrete, actionable recommendations for anyone running or reporting LLM evals:

For reporting results

Always report standard errors alongside mean scores, in parentheses below the mean (e.g., "65.5% (0.7%)"). This is standard practice in economics and medicine — ML should adopt it.
Report the number of questions in each eval. Without this, readers cannot assess precision.
Use clustered SEs for any eval with grouped questions (DROP, RACE, QuAC, SQuAD, MGSM). Report the number of clusters alongside the question count.
Use SE_CLT, not SE_Bernoulli, unless scores are genuinely binary. The Bernoulli formula is a special case — using it on continuous scores gives inflated intervals.

For comparing models

Use paired tests whenever two models answer the same questions. Report pairwise differences, paired SEs, 95% CIs, and the score correlation.
Use clustered paired SEs on evals with grouped questions.
Test for significance before claiming one model is better than another. If the 95% CI of the difference includes zero, the difference is not statistically significant.

For designing evals

Run power analysis first. Use the sample-size formula to determine how many questions you need to detect the improvement you care about.
Aim for 1,000+ questions as a baseline. Evals with ~100 questions can only detect very large differences.
Consider resampling (K = 2-4) or using next-token probabilities to reduce conditional variance.
Do not adjust temperature for the sake of variance reduction.

What to do right now

If you are writing a technical report today, do these three things:

Compute SE_CLT from your per-question scores (not SE_Bernoulli unless scores are truly 0/1). Report it in parentheses.
For any clustered eval (DROP, RACE, QuAC, SQuAD, MGSM), compute the clustered SE. It is a few extra lines of code.
For model comparisons, compute per-question differences and report the paired SE. Include the score correlation between models.

These three steps would transform the interpretability of every leaderboard result. They cost essentially nothing — a few lines of NumPy after the eval run.

The cheat sheet

Scenario	Formula
Single model, binary, independent	SE = √(p(1-p)/n)
Single model, continuous, independent	SE = sd(s) / √n
Single model, clustered	SE_clustered via Eq. 4
Two models, unpaired	SE = √(SE_A² + SE_B²)
Two models, paired	SE = sd(d_i) / √n
Sample size (paired)	n = (z_α/2+z_β)² ω² / δ²

You are writing a technical report comparing your new model against a baseline on MGSM (multilingual eval, 250 clusters of 10 questions each). What should you report?

Just the two model accuracies Accuracies with naive SEs Accuracies with clustered SEs, the pairwise difference with a clustered paired SE, the 95% CI, and the score correlation — plus the number of questions and clusters

Chapter 9: Connections

This paper sits at the intersection of classical statistics and modern ML evaluation practice. Here is how it connects to the broader landscape:

The statistics it draws on

Central Limit Theorem — the foundation of all the standard error formulas here
Clustered standard errors — developed by economists (Abadie et al., 2022) for survey and observational data with grouped observations
Power analysis — standard in clinical trials and A/B testing; List, Sadoff & Wagner (2010) provide the experiment design framework this paper follows
McNemar's test — classical paired comparison for binary data, applicable when two models answer the same questions

In the LLM evaluation landscape

Chatbot Arena (Chiang et al., 2024) — one of the few eval platforms that uses confidence intervals (on Elo scores), but this paper addresses traditional Q&A evals
"With Little Power Comes Great Responsibility" (Card et al., 2020) — earlier work highlighting the power problem in NLP evaluations. Showed that many NLP papers lacked sufficient statistical power to support their claims.
Quantifying Variance in Evaluation Benchmarks (Madaan et al., 2024) — contemporaneous work on eval reliability, approaching the problem from the benchmark design side
Inspect framework (UK AISI) — correctly computes SE_CLT with its built-in stderr() metric, making it easy to follow these recommendations
OpenAI Evals — uses bootstrapping for standard errors, which this paper argues is unnecessary when the CLT applies (large n, finite variance)

Broader context

This paper is part of a growing movement to professionalize LLM evaluation. As the field matures, "highest number wins" is not sustainable. A/B testing in industry already uses these tools — the same rigor should apply when comparing language models.

The recommendations are simple, free to implement, and would immediately improve the interpretability of every technical report and leaderboard. Every eval framework should compute and display SEs by default. Every leaderboard should show error bars. Every technical report should include pairwise analysis when comparing models. These are not novel statistical techniques — they are intro-level statistics applied with care to a domain that has been ignoring them.

Key formulas reference card

Single model: SE = √(s̄(1-s̄)/n) for binary, sd(s)/√n for continuous
Clustered: Add within-cluster cross-terms to naive variance
Paired difference: SE = sd(d_i)/√n, where d_i = s_A,i - s_B,i
Sample size: n = (z_α/2 + z_β)² ω² / δ²
95% CI: estimate ± 1.96 × SE

What is the single most impactful change this paper recommends for the LLM evaluation community?

Report standard errors alongside eval scores — transforming "highest number wins" into rigorous statistical inference Use more questions in evals Stop using benchmarks entirely

Adding Error Bars to Evals

Chapter 0: The Problem

Chapter 1: The Key Insight

Chapter 2: Confidence Intervals for Accuracy

The Bernoulli standard error

The Wilson interval

The CLT standard error

Chapter 3: Clustering

The clustered standard error

Worked example: real numbers from the paper

Chapter 4: Comparing Two Models

Unpaired analysis (naive)

Paired analysis (better)

Worked example

McNemar's test for binary data

Chapter 5: Sample Size Planning

The sample size formula

Worked example

Chapter 6: Variance Reduction

Strategy 1: Resampling (answer K times)

Strategy 2: Next-token probabilities

Strategy 2: Next-token probabilities

Do NOT reduce temperature

Chapter 7: What Llama 3 Got Wrong

Error 1: Too narrow (ignored clustering)

Error 2: Too wide (used Bernoulli for non-binary scores)

Worked example: DROP

The net effect

Chapter 8: Practical Recommendations

For reporting results

For comparing models

For designing evals

What to do right now

The cheat sheet

Chapter 9: Connections

The statistics it draws on

In the LLM evaluation landscape

Broader context

Key formulas reference card