Each problem scales exponentially with attempts. The aggregate scales as a power law. Where does the polynomial come from? The answer: heavy-tailed difficulty distributions.
You give an LLM a math problem. It gets it wrong. You try again. Wrong. You try 100 times. On one of those 100 attempts, it gets it right. This is repeated sampling -- the simplest form of scaling inference compute. No fancy search, no chain-of-thought trees. Just ask again.
Brown et al. (2024) discovered something striking when they scaled this up across hundreds of math problems with Pythia models. They plotted -log(average success rate) against the number of attempts k on a log-log plot. It was a straight line. A power law:
Hughes et al. (2024) found the same thing independently when jailbreaking frontier LLMs -- Claude, GPT-4, Gemini. Power law scaling in the number of jailbreak attempts. Across completely different tasks and models, the same pattern. Proof assistants showed it too. Something deep and universal is at work.
But here is the puzzle. Take any single problem. The model either solves it or it doesn't on each independent attempt. If the probability of success on one try is p, then the probability of failing ALL k attempts is (1-p)k. That is exponential decay -- much faster than polynomial. Every individual problem scales exponentially with attempts.
So we have a paradox. Every problem individually decays exponentially (fast). But their average decays as a power law (slow). Averaging fast things should not produce something slow. And yet it does. This paper explains exactly why.
Compare: a single problem's exponential decay (teal) vs the aggregate power law (orange). On a log-log plot, the power law is a straight line; the exponential curves away.
The resolution to the puzzle is elegant. It is not the exponential scaling of each problem that creates the power law. It is the distribution of difficulties across problems.
Think of it this way. Some problems are easy -- the model solves them with p = 0.8 on any given try. After just k = 5 attempts, the failure rate is 0.25 = 0.00032. Essentially solved. Other problems are moderate -- p = 0.1, so failure after 5 attempts is 0.95 = 0.59. Takes more tries but gets solved eventually.
But a small fraction of problems are extremely hard. p = 0.001, or p = 0.0001. For these, even after k = 1000 attempts, the failure rate is still (0.999)1000 = 0.368. These problems resist being solved for a very long time. And here is the key: if the distribution of p values has a heavy left tail -- meaning there are just enough of these extremely hard problems -- then those hard problems collectively drag the aggregate success rate down from exponential to polynomial.
More precisely, the paper proves that aggregate power law scaling with exponent b happens if and only if the density of single-attempt success probabilities behaves like pb-1 near zero. This is a power law left tail in the difficulty distribution.
The word "if and only if" is important. It is not just that heavy tails are sufficient to produce power laws -- they are necessary. If you see a power law in the aggregate, the difficulty distribution MUST have a heavy left tail. And if you see no heavy tail, you CANNOT get a power law. The two are mathematically equivalent.
Let us derive the per-problem scaling carefully. Problem i has a single-attempt success probability passi@1. Each of k attempts is independent and identically distributed. The model either solves it (probability passi@1) or fails (probability 1 − passi@1) on each try.
What is the probability of solving problem i in at least one of k attempts? It is 1 minus the probability of failing ALL k attempts:
Now take the negative log. For large k, (1 − passi@1)k becomes small, and using the Taylor expansion log(1 − x) ≈ −x for small x:
This is a pure exponential in k. The base is (1 − passi@1), which is between 0 and 1, so it decays exponentially. The smaller passi@1 is, the closer the base is to 1, and the slower the decay.
Suppose passi@1 = 0.05 (the model solves it 5% of the time). After k = 10 attempts:
After k = 50 attempts:
After k = 100 attempts:
Even with just a 5% chance per try, 100 attempts virtually guarantees success. That is the power of exponential decay in failure probability. But what if passi@1 = 0.0001? Then (1 − 0.0001)100 = 0.99, still 99% failure. Even 10,000 attempts gives only 63% success. These ultra-hard problems are what create the power law.
Adjust p to see how the failure rate (1-p)k drops. Small p decays slowly -- these problems dominate the aggregate at large k.
Now we aggregate. The average success rate across P problems in a benchmark is:
In the continuous limit (replacing the sum with an integral over the distribution D of success probabilities):
where fD(p) is the probability density function of single-attempt success probabilities across problems.
Now, if all problems had the SAME difficulty p0, this integral collapses to (1 − p0)k, and −log(passD@k) would decay exponentially. No power law. If the difficulties follow a Gaussian distribution (concentrated around some mean), the aggregate still decays approximately exponentially -- the integral is dominated by the mean.
The magic happens only when the density fD(p) puts enough weight near p = 0. Why? Because those near-zero terms (1 − p)k ≈ e−pk decay incredibly slowly. Even as the easy and moderate problems are solved and contribute nothing to the failure sum, the hard problems with tiny p keep contributing for a long, long time. The aggregate is dominated by the tail.
Suppose we have just 5 problems with pass@1 = {0.5, 0.1, 0.01, 0.001, 0.0001}. The aggregate failure at k attempts is:
At k = 10: the first three terms are tiny (0.001, 0.35, 0.90). The sum is dominated by the last two terms (0.99910 = 0.99, 0.999910 = 0.999). Aggregate failure ≈ 0.45.
At k = 1000: the first four terms are negligible. Only 0.99991000 = 0.905 survives. Aggregate failure ≈ 0.18.
At k = 10000: even the hardest problem yields 0.999910000 = 0.368. Aggregate failure ≈ 0.074.
Notice how at each decade of k, a different problem "drops out" of the sum. The hardest problem dominates for longer and longer. This cascading dropout is exactly what produces the power law -- the rate at which problems drop out is controlled by the spacing of p values, which is controlled by the tail of the distribution.
Compare two distributions of problem difficulty. Uniform: all p values equally likely. Concentrated: all problems have similar p. Only the heavy-tailed distribution produces a power law aggregate.
This is the heart of the paper. Let us be precise about what "heavy-tailed" means here and see it in action.
A distribution of success probabilities p has a power law left tail if its density near p = 0 looks like:
When b < 1, this density actually blows up near zero -- there are more and more problems as difficulty increases. When b = 1, the density is flat near zero (like a uniform distribution). When b > 1, the density vanishes near zero, but if b is not too large, there are still enough hard problems to matter.
The paper proves that several standard distributions produce this behavior. Uniform(0, β) has a flat density, giving b = 1. Beta(α, β) has density proportional to pα−1 near zero, giving b = α. Kumaraswamy(α, β) similarly gives b = α.
The empirical distributions of passi@1 across Pythia models on MATH and frontier models on HarmBench are well fit by scaled Kumaraswamy distributions -- and they all have heavy left tails. This is not a coincidence. Benchmarks are designed to have a spread of difficulties, and that spread naturally creates the tail.
| Distribution | Density near p=0 | Aggregate exponent b |
|---|---|---|
| Uniform(0, β) | Constant (p0) | b = 1 |
| Beta(α, β) | pα−1 | b = α |
| Kumaraswamy(α, β) | pα−1 | b = α |
| Continuous Bernoulli(λ<0.5) | Constant | b = 1 |
| Reciprocal(α, β) | p−1 (too heavy) | Not a power law |
Notice: the Reciprocal distribution has a tail that is TOO heavy -- p−1 is not integrable near zero, meaning it does not satisfy the conditions of Theorem 3.1. The result is not a power law but something that decays faster due to the exponential prefactor.
Consider 200 math problems. Distribution A: all problems have pass@1 = 0.1. The aggregate decays as 0.9k -- purely exponential, no power law. Distribution B: pass@1 is Uniform(0, 0.2). The density is flat, so b = 1, giving −log(pass@k) ∝ k−1. Distribution C: pass@1 follows Beta(0.3, 5). The density near zero is p−0.7, so b = 0.3, giving −log(pass@k) ∝ k−0.3 -- a very slow power law.
Distribution C has the heaviest tail. It includes problems with pass@1 = 10−5 or even 10−6. These almost-impossible problems act as anchors, keeping the aggregate failure rate high even after thousands of attempts. Distribution A has no anchor problems at all -- once 0.9k decays, everything is solved.
The analogy: imagine a footrace where everyone runs at a different speed. If all runners are similarly fast (concentrated distribution), the last person finishes soon after the first -- exponential-like. But if the speeds are heavy-tailed (most people are fast, but a few are extraordinarily slow), the last finisher takes disproportionately long. The average finishing time is dominated by the slowest runners. The "power law" is the curve of how many runners have finished as a function of time.
Adjust the tail parameter. Left panel: the distribution of per-problem success probabilities. Right panel: the resulting aggregate scaling on a log-log plot. Watch the power law exponent change.
Let us prove the central theorem. We want to show that if fD(p) ≈ C · pb−1 near p = 0, then −log(passD@k) ∼ C · Γ(b) · k−b for large k.
Start from the aggregate failure probability:
For large k, (1 − p)k ≈ e−pk is negligible unless p is very small (of order 1/k). So the integral is dominated by the region near p = 0. Substitute u = pk, so p = u/k and dp = du/k:
The upper limit goes to ∞ because we extended it (the integrand is negligible for u > k). Collecting powers of k:
That integral on the right is exactly the Gamma function Γ(b) = ∫0∞ ub−1 e−u du. It is a pure constant that depends only on b, not on k. All the k-dependence is captured by the k−b prefactor. For large k, the failure probability 1 − passD@k is small, so:
This is the power law. The exponent b comes directly from the tail exponent of the density.
The paper also proves the converse: if −log(passD@k) ∼ A · k−b, then the density MUST satisfy fD(p) ∼ (A/Γ(b)) · pb−1 near zero. Power law scaling and heavy-tailed difficulty are two sides of the same coin.
If passi@1 ∼ Uniform(0, β), then fD(p) = 1/β for p ∈ (0, β). Near zero, fD(p) = 1/β = constant · p0, so b − 1 = 0, giving b = 1. The aggregate scales as k−1. And indeed:
If passi@1 ∼ Beta(α, β), the density near zero is proportional to pα−1. So b = α. The aggregate scales as k−α. A flatter tail (α = 0.5) means slower scaling (exponent −0.5). A steeper tail (α = 0.15) means even slower scaling (exponent −0.15).
For the empirical data: Pythia models on MATH have fitted α values between 0.15 and 0.45, depending on model size. Larger models have larger α (steeper tail, faster scaling) because they solve more problems easily, pushing the hard tail to smaller values. The distributional exponent b matches the least-squares exponent to within measurement error.
The theory predicts two testable things. First, per-problem scaling should be exponential. Second, the distribution of passi@1 should have a heavy left tail that predicts the aggregate exponent. The paper tests both.
Using Brown et al.'s data (Pythia 70M through 12B on 128 MATH problems) and Hughes et al.'s data (Claude 3.5, GPT-4o, Gemini 1.5, Llama 3 8B IT on 159 HarmBench prompts), the paper plots −log(passi@k) for each individual problem. On every single problem, for every model, the negative log success rate falls exponentially with k. Straight lines on a log-linear plot.
This is exactly what the theory predicts from the independence assumption. Each attempt is an independent coin flip with probability pi. The aggregate over attempts is purely exponential. The key finding: the aggregate over problems is where the power law emerges. Per-attempt scaling is trivial; per-problem distribution is where all the action is.
They then examine the histograms of passi@1 across problems. For all Pythia models on MATH and most frontier models on HarmBench, the distributions have clear power law-like left tails -- a concentration of problems with very small success probabilities.
Why a scaled Kumaraswamy? The standard Kumaraswamy(α, β) distribution has support on (0, 1), but most benchmarks have max success rates well below 1. On MATH with Pythia 70M, the highest pass@1 might be 0.05 -- not 1.0. So the paper uses a 3-parameter scaled version Kumaraswamy(α, β, c) where c shrinks the support to (0, c). This is critical for good fits. The α parameter controls the tail shape and directly gives the power law exponent b.
| Setting | Models | Problems | Per-problem | Aggregate | Tail? |
|---|---|---|---|---|---|
| MATH | Pythia 70M–12B | 128 | Exponential | Power law | Yes |
| HarmBench | Claude, GPT, Gemini | 159 | Exponential | Power law | Yes |
| HarmBench | Llama 3 8B IT | 159 | Exponential | NOT power law | No |
Simulated aggregate scaling for three model types. The "heavy-tailed" model (like Pythia on MATH) produces a clean power law. The "no tail" model (like Llama 3 on HarmBench) decays faster than any power law.
Here is where the theory becomes practically useful. The standard way to measure the power law exponent b is brute force: sample k = 1, 10, 100, 1000, 10000 attempts per problem, compute the aggregate success rate at each k, then fit a line in log-log space. This requires enormous compute -- you need huge k to see the asymptotic behavior.
The distributional approach offers a shortcut. Instead of running massive numbers of attempts, you can:
The paper tests this by backtesting on synthetic data with known ground-truth exponents. The distributional estimator achieves approximately 10x lower relative error than the standard least-squares fit. Equivalently, it needs 100–10,000x less inference compute to achieve the same accuracy.
Why is it so much better? Because the standard estimator needs enough samples to observe the power law at large k -- but that is precisely where compute is expensive. The distributional estimator only needs enough samples to characterize the left tail of the p-distribution, which can be done at much smaller sample sizes.
There is a subtlety. If you run N = 100 samples per problem, any problem with true pass@1 < 1/100 = 0.01 will show zero successes. You cannot measure its exact p. But you CAN count how many problems fell into this "zero-success bucket." If 30 out of 128 problems had zero successes, you know the distribution must place about 30/128 = 23% of its mass below 0.01.
The distributional estimator exploits this. It fits a scaled Kumaraswamy(α, β, c) distribution by discretizing into bins of width 1/N and matching the observed counts via maximum likelihood -- including the crucial zero-success bin. This is what gives it power: it extrapolates the tail structure below the measurement resolution.
You run N = 200 samples on 128 MATH problems with Pythia 1B. You observe: 40 problems with 0 successes, 15 with 1 success, 12 with 2 successes, and so on up to a few problems with 50+ successes. You fit Kumaraswamy(α, β, c) and get α = 0.22. The predicted aggregate exponent is b = α = 0.22. The standard approach would need k = 10,000+ samples per problem to measure this directly -- that is 50x more total compute.
python import numpy as np from scipy.optimize import minimize def distributional_exponent(successes, n_samples, n_bins=100): """Estimate power law exponent from per-problem success counts.""" # Estimate pass@1 for each problem p_hat = successes / n_samples # Count problems in zero-success bucket n_zero = np.sum(successes == 0) # Fit Kumaraswamy(alpha, beta, scale) via MLE def neg_log_lik(params): alpha, beta, scale = params if alpha <= 0 or beta <= 0 or scale <= 0: return 1e10 # ... discretized MLE (see paper Sec. 5) return nll result = minimize(neg_log_lik, [0.3, 3.0, 0.5]) alpha_hat = result.x[0] return alpha_hat # This IS the power law exponent b
Drag the sample budget slider. The distributional estimator (teal) recovers the true exponent with far fewer samples than least-squares (orange).
Not every model-benchmark combination produces a clean power law. The distributional framework explains exactly when and why the scaling curve bends away from a straight line in log-log space.
If the distribution of passi@1 has no heavy left tail -- meaning all problems can be solved reasonably often -- then the aggregate fails faster than any power law. The scaling curve bends downward (faster improvement). This is exactly what happened with Llama 3 8B IT on HarmBench.
Real benchmarks have a finite number of problems. With P problems, the smallest possible passi@1 is bounded below by roughly 1/N (where N is the number of samples used to estimate it). This means the "true" tail is truncated. At very large k, once even the hardest problem is effectively solved, the scaling must eventually fall off faster than the power law predicts. The power law is an asymptotic result; finite benchmarks produce finite-range approximations.
If the difficulty distribution is not smooth -- for instance, if there are two clusters of problems (easy and hard with a gap in between) -- the scaling may show a "kink" or change in slope on the log-log plot. Each cluster contributes its own approximate power law in different regimes of k.
Think of it this way: at small k, the aggregate is dominated by the easy cluster (they are being solved first). At large k, the easy problems are all solved, and the hard cluster takes over. If the two clusters have different effective exponents, the log-log plot shows a transition from one slope to another -- a visible bend or kink.
With only 10-20 problems, the empirical distribution of p values is a poor approximation of any smooth density. The power law may appear noisy or not at all. The theory is asymptotic in both k (many attempts) and P (many problems). Small P gives high variance in the observed exponent.
As a counterexample, the paper derives what happens with a Reciprocal(α, β) distribution (log-uniform). Its density near zero goes like p−1 / log(β/α), which does not have the pb−1 form for any finite b. The result:
This decays faster than any power law because of the (1 − α)k factor. No heavy left tail in the right form means no power law scaling.
Switch between three scenarios. Watch how the aggregate scaling (orange) deviates from a clean power law (yellow dashed) depending on the p-distribution shape.
This paper provides the missing theoretical foundation for a rapidly growing body of work on inference-time compute scaling. The insight -- that heavy-tailed difficulty distributions transform exponential per-problem scaling into aggregate power laws -- has implications far beyond repeated sampling.
Before this work, the power law was an empirical observation without a mechanistic explanation. Now we know: it is a mathematical consequence of the difficulty distribution. This transforms the power law from a mysterious empirical regularity into a predictable, controllable property of model-benchmark interactions. If you know the difficulty distribution, you know the scaling law -- and you can predict it cheaply.
The authors speculate that the same mechanism might explain pretraining scaling laws. If pretraining loss is a sum of many component losses, each decaying at a different rate with compute, and the distribution of those rates is heavy-tailed, then the aggregate loss would follow a power law in compute. Concretely:
If the distribution of rates α is heavy-tailed, then for large compute C, the sum is dominated by the slowest-decaying components -- the "hard-to-learn" features. These are the "dark matter of neural scaling laws." The aggregate loss looks like A · C−αmin, a power law. The hard-to-learn components play the same role as the hard-to-solve problems in repeated sampling.
This connection is speculative but tantalizing. It suggests that power laws in deep learning may not be mysterious -- they may be a universal consequence of heterogeneous difficulty distributions, whether the difficulty is across benchmark problems, training data points, or learned features.
For benchmark designers: If you want your benchmark to exhibit clean power law scaling (which makes extrapolation easier), ensure it includes a spread of difficulties with a heavy left tail. This means including some genuinely hard problems that the model almost never solves.
For practitioners scaling inference compute: Before investing in massive repeated sampling runs, measure the p-distribution first. A small pilot (100 samples per problem) tells you the exponent, which tells you the returns from further scaling. If b is small (say 0.1), doubling k from 1000 to 2000 only improves −log(pass@k) by a factor of 20.1 = 1.07. Diminishing returns.
For scaling law researchers: When fitting power laws to empirical data, always check the underlying distribution. A clean power law requires a clean heavy tail. If the tail is truncated, your extrapolations will be optimistic.
| Paper | Connection |
|---|---|
| Large Language Monkeys (Brown et al. 2024) | Discovered the aggregate power law for repeated sampling on MATH. This paper explains WHY. |
| Best-of-N Jailbreaking (Hughes et al. 2024) | Found the same power law for jailbreaking. This paper unifies both under the same distributional framework. |
| Scaling Test-Time Compute (Snell et al. 2024) | Studies more sophisticated inference-time strategies (verifiers, search). Power law understanding constrains what repeated sampling alone can achieve. |
| Chinchilla (Hoffmann et al. 2022) | Pretraining scaling laws. The authors conjecture similar distributional mechanisms underlie these. |
| Kaplan et al. 2020 | Original neural scaling laws paper. Power laws in loss vs compute/data/parameters. Same asymptotic math may apply. |
| Result | Statement |
|---|---|
| Per-problem exponential | −log(passi@k) ≈ (1 − pi)k for each problem |
| Aggregate power law | −log(passD@k) ≈ CΓ(b) k−b when the p-distribution has tail pb−1 |
| If and only if | Power law aggregate ⇔ heavy-tailed p-distribution |
| Distributional estimator | 10x lower error or 100-10,000x less compute for predicting exponent |
| Deviation diagnosis | No tail → faster than power law. Truncated tail → bends down. Bimodal → kink. |
Why do benchmarks have heavy-tailed difficulty? Is it inherent to the structure of mathematical problems, or an artifact of benchmark design? The authors conjecture both: benchmarks intentionally include a spread of difficulties, and real problem spaces naturally have fat tails.
Does the same mechanism explain pretraining scaling? The "dark matter" hypothesis suggests yes -- the hardest-to-learn features play the same role as the hardest-to-solve problems. But this remains unproven.
Can we engineer the tail? If the power law exponent is determined by the difficulty distribution, could we design benchmarks with specific exponents? Or filter problems to make the distribution more uniform and get faster-than-power-law improvement?