LLM benchmarks report "65.5% vs 64.0%" without uncertainty. Is that a real difference, or noise? Rigorous statistics says: compute the error bar first, then decide.
Two fictional language models, Galleon and Dreadnought, are evaluated on three benchmarks:
| Eval | # Questions | Galleon | Dreadnought | Difference |
|---|---|---|---|---|
| MATH | 5,000 | 65.5% | 63.0% | +2.5% |
| HumanEval | 164 | 83.6% | 86.7% | −3.1% |
| MGSM | 2,500 | 75.3% | 78.0% | −2.7% |
At face value, Dreadnought wins two of three evals. Case closed?
Not so fast. Look at HumanEval: only 164 questions. A 3.1% difference on 164 binary questions has a standard error of roughly 3%. That difference is indistinguishable from noise. Meanwhile the MATH result (5,000 questions, 2.5% gap) is rock-solid. Without error bars, you cannot tell the difference.
This is not a hypothetical problem. Industry practice today is to bold the highest number, put it in a press release, and move on. Leaderboards rank models to the tenth of a percent. Technical reports trumpet a 0.5% improvement as evidence of superior architecture. But none of this means anything without a measure of uncertainty.
Toggle to see how error bars change interpretation. The bare numbers look definitive; the error bars tell a different story.
The fundamental conceptual move in this paper is simple but powerful:
Once you accept this framing, standard statistical machinery applies directly:
Decompose the score on question i into a conditional mean (how hard the question is, on average) and a random component (noise from sampling the model's output):
Here xi is the expected score on question i (e.g. the probability the model gets it right), and εi is zero-mean noise. The true eval score μ is the expected value over the super-population: μ = E[s] = E[x]. We estimate it with the sample mean s̄.
Two sources of variance drive the uncertainty:
By the law of total variance, these add:
This decomposition is the engine of everything that follows. The first term is irreducible (it is the whole point of sampling from a super-population). The second term is the target for variance reduction techniques (Chapter 6). And the 1/n factor is why bigger evals give tighter CIs.
The simplest case: n independent questions, each scored 0 or 1. The model gets s̄ = k/n correct. What is the standard error?
When scores are binary (right/wrong), the standard error has a clean form:
Worked example. A model scores 65.5% on MATH (n = 5,000):
The 95% confidence interval is 65.5% ± 1.96 × 0.67% = (64.2%, 66.8%).
Now consider HumanEval with only n = 164 at 83.6% accuracy:
The 95% CI is (77.9%, 89.3%) — an 11-point spread! That "83.6%" could easily be 78% or 89%.
For small n or extreme p (near 0 or 1), the normal approximation — s̄ ± 1.96 × SE — can give intervals outside [0, 1]. The Wilson interval corrects this:
For typical eval sizes (n > 100) and moderate accuracy, the normal and Wilson intervals are nearly identical. But if you have a small eval (n < 50) or accuracy > 95%, use Wilson.
When scores take fractional values (e.g. F1 scores, partial credit), use the general formula from the Central Limit Theorem:
This is just the sample standard deviation divided by √n. The Bernoulli formula is a special case. Using SEBernoulli on non-binary data gives conservative (too wide) intervals — the paper catches Llama 3 making exactly this mistake.
Drag the sliders to see how accuracy and sample size affect the CI width.
Many evals group questions into clusters. DROP has passages with multiple questions each. MGSM translates each question into 10 languages. RACE has several questions per reading passage. Within a cluster, questions are not independent — if the model understands the passage, it likely gets all related questions right.
This violates the independence assumption of the CLT. The naive standard error pretends you have n independent data points, but you really have far fewer independent observations.
Think of it this way: if a passage has 10 questions and the model either understands the passage or doesn't, those 10 scores are essentially 10 copies of one coin flip. You wrote down 10 data points, but you only collected 1 bit of information. The naive formula thinks it has 10x more data than it does.
Let si,c be the score on question i within cluster c. The cluster-adjusted SE is:
The extra term captures within-cluster correlations. When all questions in a cluster are perfectly correlated, each cluster acts as a single observation — and the SE can be much larger.
| Eval | SEclustered | SEnaive | Ratio |
|---|---|---|---|
| DROP (588 clusters, 9622 Qs) | 1.34 | 0.44 | 3.05x |
| RACE-H (1045 clusters, 3498 Qs) | 0.51% | 0.46% | 1.10x |
| MGSM (250 clusters, 2500 Qs) | 1.62% | 0.86% | 1.88x |
On DROP, the real uncertainty is 3x larger than the naive estimate. If you reported the naive CI, you would think your measurement was much more precise than it actually is.
Adjust intra-cluster correlation and cluster size. Watch the CI balloon when correlation is high.
The design effect quantifies this inflation: DEFF = 1 + (m−1)ρ, where m is cluster size and ρ is intra-cluster correlation. The effective sample size is n/DEFF. For DROP with ~16 questions per passage and high within-passage correlation, the design effect can be 9 or more — meaning 9,622 questions carry the information of only ~1,000 independent observations.
You rarely care about a single model's score in isolation. You care about: is Model A better than Model B? There are two ways to test this.
If models were evaluated on different question sets, combine their SEs in quadrature:
If both models answer the same questions — which is the usual case — you can do much better. Compute the per-question difference di = sA,i − sB,i, then take the SE of those differences:
Why is this better? Because the variance of the difference is:
The covariance term is positive whenever both models agree on which questions are easy and which are hard — which is almost always true. This subtracts variance, giving a tighter CI.
Galleon scores 65.5% on MATH, Dreadnought 63.0%. Both answer all 5,000 questions. Individually, each has SE ≈ 0.67%. The unpaired SE of the difference would be:
But if the per-question score correlation is 0.5, the paired SE is:
The 95% CI on the 2.5% gap is 2.5% ± 1.96 × 0.67% = (1.2%, 3.8%). The interval is entirely positive — this is a statistically significant difference. The paired test gives a 30% narrower CI, which can be the difference between "significant" and "inconclusive."
Drag the correlation slider. Higher correlation = bigger advantage for paired testing.
When scores are binary (right/wrong), the paired analysis has a classical equivalent: McNemar's test. Categorize each question into a 2x2 table:
| B correct | B wrong | |
|---|---|---|
| A correct | Both right (a) | Only A right (b) |
| A wrong | Only B right (c) | Both wrong (d) |
The cells (a) and (d) — where both models agree — carry no information about which is better. All the signal is in the discordant cells b and c. McNemar's test statistic is:
Worked example. 5,000 MATH questions. Suppose: both right = 3,000, only Galleon right = 275, only Dreadnought right = 150, both wrong = 1,575. Then b = 275, c = 150:
With 1 degree of freedom, χ² > 3.84 is significant at p < 0.05. At 36.76, this is highly significant (p < 0.001). Note: only 425 out of 5,000 questions contribute — the 4,575 where both agree or disagree are irrelevant.
Before running an eval, ask: how many questions do I need to reliably detect a D-point improvement? This is power analysis — standard in clinical trials, rare in ML.
Inputs: significance level α (Type I error), power 1−β (probability of detecting a real effect), and minimum detectable effect δ. For a paired test with K resamples per question:
Where ω² = Var(xA) + Var(xB) − 2Cov(xA, xB) captures the variance of question-level score differences, and σ² terms capture conditional (sampling) variance.
Suppose continuous scores with zero conditional variance (σ² = 0), ω² = 1/9 (from uniform scores with correlation 0.5). We want to detect δ = 3% with 80% power at α = 0.05:
You need about 1,000 questions to detect a 3-point difference. This is why the paper recommends new evals have at least 1,000 questions.
Inverting the formula gives the minimum detectable effect (MDE) for a fixed n:
The MDE tells you: "given this eval's size and noise level, what is the smallest real difference I could reliably detect?" If the MDE is 8% but you care about 2% differences, the eval is too small — do not run it and pretend the results are informative.
Second example: effect of resampling on MDE. With σ² = 1/6 (binary scores), ω² = 1/9, n = 198 questions. At K = 1 (no resampling):
At K = 10:
Resampling 10x cut the MDE nearly in half — from 13.3% to 7.6%. But the remaining 7.6% is dominated by ω² (question difficulty variance), which no amount of resampling can touch. To go lower, you need more questions.
Adjust the minimum detectable effect and see how many questions you need. Gray zone = infeasible (would need more questions than any existing eval).
The total estimation variance has two parts. Only one is under your control:
Have the model answer each question K times. Use the mean score per question. The conditional variance becomes σi²/K.
Worked example. Binary scores, uniform difficulty (x ~ U[0,1]). Then Var(x) = 1/12 and E[σi²] = E[x(1-x)] = 1/6 (the average Bernoulli variance across difficulty levels). Going from K = 1 to K = 2:
That is a 33% reduction in variance by just asking twice. K = 4 gives 50% reduction. K = 6 gives 56%. But returns diminish quickly — once E[σ²]/K is much smaller than Var(x), more resampling is wasted effort. The theoretical limit is a 67% reduction (eliminating all conditional variance), achievable with next-token probabilities.
For multiple-choice evals without chain-of-thought: instead of sampling a token and checking if it is correct, read the probability of the correct token directly. This sets εi = 0 and completely eliminates conditional variance. The score becomes si = xi = pi.
This achieves the theoretical maximum variance reduction — equivalent to K = ∞.
For multiple-choice evals without chain-of-thought: read the probability of the correct token directly instead of sampling. This sets εi = 0, completely eliminating conditional variance. It is equivalent to K = ∞.
Example: the model assigns pi = 0.73 to the correct answer for question i. Instead of flipping a coin and recording 0 or 1, record 0.73 as the score. No sampling noise at all. The variance of μ̂ drops from (Var(x) + E[σ²])/n to Var(x)/n alone — a 2/3 reduction in the uniform-difficulty case.
It is tempting to set T = 0 to reduce noise. But this can increase total variance.
Worked example. Question difficulty uniform x ~ U[0,1]. At T = 1: Var(x) = 1/12, E[σ²] = 1/6. At T = 0: scores become binary (1 if x > 0.5, else 0). Now Var(xT=0) = 1/4. The conditional variance disappears but question-difficulty variance triples.
Even worse: temperature changes can bias the estimator. If x ~ U[1/3, 1], then E[xT=1] = 2/3 but E[xT=0] = 3/4. The mean score shifts by 8 percentage points, and variance increases five-fold (from 1/27 to 3/16).
The curve shows total variance relative to K=1 as you increase resamples per question. The floor is set by Var(x), which resampling cannot reduce.
The Llama 3 technical report (Dubey et al., 2024) was notable for being one of the first industry reports to include confidence intervals on eval scores. This is laudable. But the paper identifies two systematic errors in their approach:
For clustered evals like DROP, MGSM, and RACE, Llama 3 used the naive SE formula, ignoring within-cluster correlations. As we saw in Chapter 3, this can understate the true SE by 3x. Their reported CIs were anti-conservative — they appeared more precise than justified.
For evals with fractional scores (like F1 on DROP), Llama 3 used SEBernoulli = √(s̄(1−s̄)/n) even though scores were not binary. The Bernoulli formula assumes maximum variance for a given mean — it treats every score as 0 or 1. When scores are actually continuous (e.g., F1 scores between 0 and 1), the true variance is smaller, so the Bernoulli SE is conservative (too wide).
The correct approach: use SECLT computed from the actual sample variance of the scores.
DROP uses F1 scores (continuous, 0 to 1) with 588 passage clusters containing 9,622 questions total. Suppose a model scores 87.1:
In this case, the Bernoulli formula gave 0.34 (too small), the CLT gave 0.44 (still too small without clustering), and the truth was 1.34. The two Llama 3 errors go in opposite directions on this eval: Bernoulli is wider than CLT, but missing clustering is much narrower. The net effect: still dangerously anti-conservative.
Ironically, these two errors sometimes offset: too-narrow from ignoring clustering, too-wide from using Bernoulli. But they offset by accident, not design, and the degree varies per eval. On MGSM (binary scores, clustered), only error 1 applies — CIs are purely anti-conservative. On non-clustered evals with continuous scores, only error 2 applies — CIs are purely conservative.
The paper distills its analysis into concrete, actionable recommendations for anyone running or reporting LLM evals:
If you are writing a technical report today, do these three things:
These three steps would transform the interpretability of every leaderboard result. They cost essentially nothing — a few lines of NumPy after the eval run.
| Scenario | Formula |
|---|---|
| Single model, binary, independent | SE = √(p(1-p)/n) |
| Single model, continuous, independent | SE = sd(s) / √n |
| Single model, clustered | SEclustered via Eq. 4 |
| Two models, unpaired | SE = √(SEA² + SEB²) |
| Two models, paired | SE = sd(di) / √n |
| Sample size (paired) | n = (zα/2+zβ)² ω² / δ² |
This paper sits at the intersection of classical statistics and modern ML evaluation practice. Here is how it connects to the broader landscape:
This paper is part of a growing movement to professionalize LLM evaluation. As the field matures, "highest number wins" is not sustainable. A/B testing in industry already uses these tools — the same rigor should apply when comparing language models.
The recommendations are simple, free to implement, and would immediately improve the interpretability of every technical report and leaderboard. Every eval framework should compute and display SEs by default. Every leaderboard should show error bars. Every technical report should include pairwise analysis when comparing models. These are not novel statistical techniques — they are intro-level statistics applied with care to a domain that has been ignoring them.