Schaeffer, Kazdan, Hughes, Juravsky et al. — Stanford & Google DeepMind, 2025

How Do LLMs Get Their
Power Laws?

Each problem scales exponentially with attempts. The aggregate scales as a power law. Where does the polynomial come from? The answer: heavy-tailed difficulty distributions.

Prerequisites: Basic probability + Logarithms. That's it.
10
Chapters
4+
Simulations

Chapter 0: The Puzzle

You give an LLM a math problem. It gets it wrong. You try again. Wrong. You try 100 times. On one of those 100 attempts, it gets it right. This is repeated sampling -- the simplest form of scaling inference compute. No fancy search, no chain-of-thought trees. Just ask again.

Brown et al. (2024) discovered something striking when they scaled this up across hundreds of math problems with Pythia models. They plotted -log(average success rate) against the number of attempts k on a log-log plot. It was a straight line. A power law:

−log(passD@k) ≈ a · k−b

Hughes et al. (2024) found the same thing independently when jailbreaking frontier LLMs -- Claude, GPT-4, Gemini. Power law scaling in the number of jailbreak attempts. Across completely different tasks and models, the same pattern. Proof assistants showed it too. Something deep and universal is at work.

But here is the puzzle. Take any single problem. The model either solves it or it doesn't on each independent attempt. If the probability of success on one try is p, then the probability of failing ALL k attempts is (1-p)k. That is exponential decay -- much faster than polynomial. Every individual problem scales exponentially with attempts.

So we have a paradox. Every problem individually decays exponentially (fast). But their average decays as a power law (slow). Averaging fast things should not produce something slow. And yet it does. This paper explains exactly why.

The paradox: Each problem scales exponentially. The aggregate scales as a power law. Exponential is much faster than polynomial. So how does averaging a bunch of fast-decaying exponentials produce a slow-decaying power law? Something about the averaging process must be transforming exponential into polynomial. But what?
Exponential vs Power Law

Compare: a single problem's exponential decay (teal) vs the aggregate power law (orange). On a log-log plot, the power law is a straight line; the exponential curves away.

Single-problem p0.10
Power law exponent b0.30
Why is aggregate power law scaling surprising given per-problem behavior?

Chapter 1: The Key Insight

The resolution to the puzzle is elegant. It is not the exponential scaling of each problem that creates the power law. It is the distribution of difficulties across problems.

Think of it this way. Some problems are easy -- the model solves them with p = 0.8 on any given try. After just k = 5 attempts, the failure rate is 0.25 = 0.00032. Essentially solved. Other problems are moderate -- p = 0.1, so failure after 5 attempts is 0.95 = 0.59. Takes more tries but gets solved eventually.

But a small fraction of problems are extremely hard. p = 0.001, or p = 0.0001. For these, even after k = 1000 attempts, the failure rate is still (0.999)1000 = 0.368. These problems resist being solved for a very long time. And here is the key: if the distribution of p values has a heavy left tail -- meaning there are just enough of these extremely hard problems -- then those hard problems collectively drag the aggregate success rate down from exponential to polynomial.

The core mechanism in one sentence: A heavy-tailed distribution of per-problem success probabilities transforms per-problem exponential scaling into aggregate power law scaling. The tail of the difficulty distribution sets the power law exponent.

More precisely, the paper proves that aggregate power law scaling with exponent b happens if and only if the density of single-attempt success probabilities behaves like pb-1 near zero. This is a power law left tail in the difficulty distribution.

The word "if and only if" is important. It is not just that heavy tails are sufficient to produce power laws -- they are necessary. If you see a power law in the aggregate, the difficulty distribution MUST have a heavy left tail. And if you see no heavy tail, you CANNOT get a power law. The two are mathematically equivalent.

Per-problem
Each problem i has success probability pi. Failure after k attempts = (1 − pi)k. Exponential in k.
Distribution
The collection of pi values across problems follows a heavy-tailed distribution. Many problems have tiny pi.
Aggregate
Averaging (1 − pi)k over the heavy-tailed distribution of p produces −log(pass@k) ∝ k−b. Power law.
What property of the difficulty distribution transforms exponential per-problem scaling into aggregate power law scaling?

Chapter 2: Per-Problem Exponential Scaling

Let us derive the per-problem scaling carefully. Problem i has a single-attempt success probability passi@1. Each of k attempts is independent and identically distributed. The model either solves it (probability passi@1) or fails (probability 1 − passi@1) on each try.

What is the probability of solving problem i in at least one of k attempts? It is 1 minus the probability of failing ALL k attempts:

passi@k = 1 − (1 − passi@1)k

Now take the negative log. For large k, (1 − passi@1)k becomes small, and using the Taylor expansion log(1 − x) ≈ −x for small x:

−log(passi@k) = −log(1 − (1 − passi@1)k) ≈ (1 − passi@1)k

This is a pure exponential in k. The base is (1 − passi@1), which is between 0 and 1, so it decays exponentially. The smaller passi@1 is, the closer the base is to 1, and the slower the decay.

Worked example

Suppose passi@1 = 0.05 (the model solves it 5% of the time). After k = 10 attempts:

passi@10 = 1 − 0.9510 = 1 − 0.599 = 0.401

After k = 50 attempts:

passi@50 = 1 − 0.9550 = 1 − 0.077 = 0.923

After k = 100 attempts:

passi@100 = 1 − 0.95100 = 1 − 0.0059 = 0.994

Even with just a 5% chance per try, 100 attempts virtually guarantees success. That is the power of exponential decay in failure probability. But what if passi@1 = 0.0001? Then (1 − 0.0001)100 = 0.99, still 99% failure. Even 10,000 attempts gives only 63% success. These ultra-hard problems are what create the power law.

The half-life formula: The number of attempts needed for a 50% chance of solving problem i is k1/2 = ln(0.5) / ln(1 − pi) ≈ 0.693 / pi for small pi. So a problem with p = 0.01 needs ∼69 attempts, p = 0.001 needs ∼693, and p = 0.0001 needs ∼6931. The half-life scales as 1/p -- linearly in the inverse success probability. This is why hard problems resist so stubbornly.
Per-Problem Failure Decay

Adjust p to see how the failure rate (1-p)k drops. Small p decays slowly -- these problems dominate the aggregate at large k.

passi@10.050
If a problem has passi@1 = 0.01, approximately how many attempts are needed for a 50% chance of solving it?

Chapter 3: The Aggregation Puzzle

Now we aggregate. The average success rate across P problems in a benchmark is:

passD@k = (1/P) ∑i=1P passi@k = 1 − (1/P) ∑i=1P (1 − passi@1)k

In the continuous limit (replacing the sum with an integral over the distribution D of success probabilities):

passD@k = 1 − ∫01 (1 − p)k · fD(p) dp

where fD(p) is the probability density function of single-attempt success probabilities across problems.

Now, if all problems had the SAME difficulty p0, this integral collapses to (1 − p0)k, and −log(passD@k) would decay exponentially. No power law. If the difficulties follow a Gaussian distribution (concentrated around some mean), the aggregate still decays approximately exponentially -- the integral is dominated by the mean.

The magic happens only when the density fD(p) puts enough weight near p = 0. Why? Because those near-zero terms (1 − p)k ≈ e−pk decay incredibly slowly. Even as the easy and moderate problems are solved and contribute nothing to the failure sum, the hard problems with tiny p keep contributing for a long, long time. The aggregate is dominated by the tail.

Why sums of exponentials can produce power laws: This is a known mathematical result (going back to Feller). A weighted sum of exponentials e−λt, where the weights on the rates λ follow a power law distribution, produces a power law in t. Here, λ maps to p (the per-problem success probability), and t maps to k (number of attempts). The distribution of λ values is the distribution of p values across problems.

Numerical walkthrough

Suppose we have just 5 problems with pass@1 = {0.5, 0.1, 0.01, 0.001, 0.0001}. The aggregate failure at k attempts is:

(1/5)[0.5k + 0.9k + 0.99k + 0.999k + 0.9999k]

At k = 10: the first three terms are tiny (0.001, 0.35, 0.90). The sum is dominated by the last two terms (0.99910 = 0.99, 0.999910 = 0.999). Aggregate failure ≈ 0.45.

At k = 1000: the first four terms are negligible. Only 0.99991000 = 0.905 survives. Aggregate failure ≈ 0.18.

At k = 10000: even the hardest problem yields 0.999910000 = 0.368. Aggregate failure ≈ 0.074.

Notice how at each decade of k, a different problem "drops out" of the sum. The hardest problem dominates for longer and longer. This cascading dropout is exactly what produces the power law -- the rate at which problems drop out is controlled by the spacing of p values, which is controlled by the tail of the distribution.

Same Problems, Different Distributions

Compare two distributions of problem difficulty. Uniform: all p values equally likely. Concentrated: all problems have similar p. Only the heavy-tailed distribution produces a power law aggregate.

DistributionMixed
If all problems in a benchmark had the same difficulty (same pass@1 probability), what would the aggregate scaling look like?

Chapter 4: Heavy-Tailed Distributions

This is the heart of the paper. Let us be precise about what "heavy-tailed" means here and see it in action.

A distribution of success probabilities p has a power law left tail if its density near p = 0 looks like:

fD(p) ≈ C · pb−1    as p → 0+

When b < 1, this density actually blows up near zero -- there are more and more problems as difficulty increases. When b = 1, the density is flat near zero (like a uniform distribution). When b > 1, the density vanishes near zero, but if b is not too large, there are still enough hard problems to matter.

The paper proves that several standard distributions produce this behavior. Uniform(0, β) has a flat density, giving b = 1. Beta(α, β) has density proportional to pα−1 near zero, giving b = α. Kumaraswamy(α, β) similarly gives b = α.

The empirical distributions of passi@1 across Pythia models on MATH and frontier models on HarmBench are well fit by scaled Kumaraswamy distributions -- and they all have heavy left tails. This is not a coincidence. Benchmarks are designed to have a spread of difficulties, and that spread naturally creates the tail.

DistributionDensity near p=0Aggregate exponent b
Uniform(0, β)Constant (p0)b = 1
Beta(α, β)pα−1b = α
Kumaraswamy(α, β)pα−1b = α
Continuous Bernoulli(λ<0.5)Constantb = 1
Reciprocal(α, β)p−1 (too heavy)Not a power law

Notice: the Reciprocal distribution has a tail that is TOO heavy -- p−1 is not integrable near zero, meaning it does not satisfy the conditions of Theorem 3.1. The result is not a power law but something that decays faster due to the exponential prefactor.

The tail IS the exponent: The power law exponent b in the aggregate scaling −log(pass@k) ∝ k−b is directly determined by the shape of the difficulty distribution near zero. Steeper tails (smaller b) mean slower power law scaling. Flatter tails (larger b) mean faster scaling. You can literally read off the scaling exponent from the histogram of per-problem success rates.

Concrete example: three distributions

Consider 200 math problems. Distribution A: all problems have pass@1 = 0.1. The aggregate decays as 0.9k -- purely exponential, no power law. Distribution B: pass@1 is Uniform(0, 0.2). The density is flat, so b = 1, giving −log(pass@k) ∝ k−1. Distribution C: pass@1 follows Beta(0.3, 5). The density near zero is p−0.7, so b = 0.3, giving −log(pass@k) ∝ k−0.3 -- a very slow power law.

Distribution C has the heaviest tail. It includes problems with pass@1 = 10−5 or even 10−6. These almost-impossible problems act as anchors, keeping the aggregate failure rate high even after thousands of attempts. Distribution A has no anchor problems at all -- once 0.9k decays, everything is solved.

The analogy: imagine a footrace where everyone runs at a different speed. If all runners are similarly fast (concentrated distribution), the last person finishes soon after the first -- exponential-like. But if the speeds are heavy-tailed (most people are fast, but a few are extraordinarily slow), the last finisher takes disproportionately long. The average finishing time is dominated by the slowest runners. The "power law" is the curve of how many runners have finished as a function of time.

Tail Heaviness Controls the Power Law

Adjust the tail parameter. Left panel: the distribution of per-problem success probabilities. Right panel: the resulting aggregate scaling on a log-log plot. Watch the power law exponent change.

Tail exponent b0.40
Num problems200
A benchmark has aggregate power law exponent b = 0.3. What does this tell you about the distribution of per-problem success rates?

Chapter 5: The Derivation

Let us prove the central theorem. We want to show that if fD(p) ≈ C · pb−1 near p = 0, then −log(passD@k) ∼ C · Γ(b) · k−b for large k.

Start from the aggregate failure probability:

1 − passD@k = ∫01 (1 − p)k fD(p) dp

For large k, (1 − p)k ≈ e−pk is negligible unless p is very small (of order 1/k). So the integral is dominated by the region near p = 0. Substitute u = pk, so p = u/k and dp = du/k:

01 e−pk · C · pb−1 dp ≈ ∫0 e−u · C · (u/k)b−1 · (du/k)

The upper limit goes to ∞ because we extended it (the integrand is negligible for u > k). Collecting powers of k:

= C · k−b0 ub−1 e−u du = C · k−b · Γ(b)

That integral on the right is exactly the Gamma function Γ(b) = ∫0 ub−1 e−u du. It is a pure constant that depends only on b, not on k. All the k-dependence is captured by the k−b prefactor. For large k, the failure probability 1 − passD@k is small, so:

−log(passD@k) ≈ 1 − passD@k ≈ C · Γ(b) · k−b

This is the power law. The exponent b comes directly from the tail exponent of the density.

The substitution is the entire trick. By substituting u = pk, we rescale the integration variable so that the region dominating the integral (p ∼ 1/k) maps to u ∼ 1. The density's behavior near zero, pb−1, then factors out as k−b. The remaining integral is the Gamma function -- a known constant. This is a standard Laplace-type asymptotic argument.

The paper also proves the converse: if −log(passD@k) ∼ A · k−b, then the density MUST satisfy fD(p) ∼ (A/Γ(b)) · pb−1 near zero. Power law scaling and heavy-tailed difficulty are two sides of the same coin.

What is Γ(b)? The Gamma function Γ(b) = ∫0 ub−1 e−u du is a generalization of the factorial. For positive integers, Γ(n) = (n−1)!. For non-integer values: Γ(0.5) = √π ≈ 1.77, Γ(0.3) ≈ 2.99, Γ(1) = 1. Since typical exponents b are between 0.1 and 0.5, Γ(b) is a modest constant (between about 2 and 10) that scales the prefactor of the power law.

Worked example: Uniform distribution

If passi@1 ∼ Uniform(0, β), then fD(p) = 1/β for p ∈ (0, β). Near zero, fD(p) = 1/β = constant · p0, so b − 1 = 0, giving b = 1. The aggregate scales as k−1. And indeed:

−log(passUniform@k) ≈ (1/β) · Γ(1) · k−1 = k−1

Worked example: Beta distribution

If passi@1 ∼ Beta(α, β), the density near zero is proportional to pα−1. So b = α. The aggregate scales as k−α. A flatter tail (α = 0.5) means slower scaling (exponent −0.5). A steeper tail (α = 0.15) means even slower scaling (exponent −0.15).

For the empirical data: Pythia models on MATH have fitted α values between 0.15 and 0.45, depending on model size. Larger models have larger α (steeper tail, faster scaling) because they solve more problems easily, pushing the hard tail to smaller values. The distributional exponent b matches the least-squares exponent to within measurement error.

In the derivation, why does the substitution u = pk work to extract the power law exponent?

Chapter 6: Empirical Validation

The theory predicts two testable things. First, per-problem scaling should be exponential. Second, the distribution of passi@1 should have a heavy left tail that predicts the aggregate exponent. The paper tests both.

Per-problem exponential scaling

Using Brown et al.'s data (Pythia 70M through 12B on 128 MATH problems) and Hughes et al.'s data (Claude 3.5, GPT-4o, Gemini 1.5, Llama 3 8B IT on 159 HarmBench prompts), the paper plots −log(passi@k) for each individual problem. On every single problem, for every model, the negative log success rate falls exponentially with k. Straight lines on a log-linear plot.

This is exactly what the theory predicts from the independence assumption. Each attempt is an independent coin flip with probability pi. The aggregate over attempts is purely exponential. The key finding: the aggregate over problems is where the power law emerges. Per-attempt scaling is trivial; per-problem distribution is where all the action is.

Heavy-tailed distributions

They then examine the histograms of passi@1 across problems. For all Pythia models on MATH and most frontier models on HarmBench, the distributions have clear power law-like left tails -- a concentration of problems with very small success probabilities.

Why a scaled Kumaraswamy? The standard Kumaraswamy(α, β) distribution has support on (0, 1), but most benchmarks have max success rates well below 1. On MATH with Pythia 70M, the highest pass@1 might be 0.05 -- not 1.0. So the paper uses a 3-parameter scaled version Kumaraswamy(α, β, c) where c shrinks the support to (0, c). This is critical for good fits. The α parameter controls the tail shape and directly gives the power law exponent b.

SettingModelsProblemsPer-problemAggregateTail?
MATHPythia 70M–12B128ExponentialPower lawYes
HarmBenchClaude, GPT, Gemini159ExponentialPower lawYes
HarmBenchLlama 3 8B IT159ExponentialNOT power lawNo
The Llama 3 exception proves the rule. Llama 3 8B Instruction Tuned could be jailbroken on EVERY prompt within the sampling budget. Its passi@1 distribution had no heavy left tail -- no extremely hard problems. The theory predicts: no heavy tail → no power law. And indeed, Llama 3 8B IT's aggregate scaling fell faster than any power law. The model was simply too easy to jailbreak.
Empirical Fit: Model Comparison

Simulated aggregate scaling for three model types. The "heavy-tailed" model (like Pythia on MATH) produces a clean power law. The "no tail" model (like Llama 3 on HarmBench) decays faster than any power law.

Model typeHeavy-tailed (Pythia-like)
Why did Llama 3 8B IT not exhibit power law scaling under Best-of-N jailbreaking?

Chapter 7: Predicting the Exponent

Here is where the theory becomes practically useful. The standard way to measure the power law exponent b is brute force: sample k = 1, 10, 100, 1000, 10000 attempts per problem, compute the aggregate success rate at each k, then fit a line in log-log space. This requires enormous compute -- you need huge k to see the asymptotic behavior.

The distributional approach offers a shortcut. Instead of running massive numbers of attempts, you can:

Step 1
Generate a moderate number of samples per problem (say N = 100). Estimate passi@1 for each problem.
Step 2
Fit a parametric distribution (scaled Kumaraswamy or Beta) to the observed passi@1 values. The key: handle the left tail carefully by counting how many problems had ZERO successes and fitting the distribution to match this tail mass.
Step 3
Simulate the aggregate passD@k at arbitrary k using the fitted distribution: passD@k = 1 − ∫(1−p)k f̂(p) dp. Read off the exponent b from the simulated curve.

The paper tests this by backtesting on synthetic data with known ground-truth exponents. The distributional estimator achieves approximately 10x lower relative error than the standard least-squares fit. Equivalently, it needs 100–10,000x less inference compute to achieve the same accuracy.

Why is it so much better? Because the standard estimator needs enough samples to observe the power law at large k -- but that is precisely where compute is expensive. The distributional estimator only needs enough samples to characterize the left tail of the p-distribution, which can be done at much smaller sample sizes.

The practical recipe: Run 100-1000 samples per problem (not 10,000). Estimate passi@1 for each. Fit a distribution to these estimates. Simulate to predict how the aggregate will scale at k = 10,000+. You skip 2-4 orders of magnitude of sampling cost.

Handling the left tail bucket

There is a subtlety. If you run N = 100 samples per problem, any problem with true pass@1 < 1/100 = 0.01 will show zero successes. You cannot measure its exact p. But you CAN count how many problems fell into this "zero-success bucket." If 30 out of 128 problems had zero successes, you know the distribution must place about 30/128 = 23% of its mass below 0.01.

The distributional estimator exploits this. It fits a scaled Kumaraswamy(α, β, c) distribution by discretizing into bins of width 1/N and matching the observed counts via maximum likelihood -- including the crucial zero-success bin. This is what gives it power: it extrapolates the tail structure below the measurement resolution.

Worked example

You run N = 200 samples on 128 MATH problems with Pythia 1B. You observe: 40 problems with 0 successes, 15 with 1 success, 12 with 2 successes, and so on up to a few problems with 50+ successes. You fit Kumaraswamy(α, β, c) and get α = 0.22. The predicted aggregate exponent is b = α = 0.22. The standard approach would need k = 10,000+ samples per problem to measure this directly -- that is 50x more total compute.

Implementation sketch

python
import numpy as np
from scipy.optimize import minimize

def distributional_exponent(successes, n_samples, n_bins=100):
    """Estimate power law exponent from per-problem success counts."""
    # Estimate pass@1 for each problem
    p_hat = successes / n_samples
    # Count problems in zero-success bucket
    n_zero = np.sum(successes == 0)

    # Fit Kumaraswamy(alpha, beta, scale) via MLE
    def neg_log_lik(params):
        alpha, beta, scale = params
        if alpha <= 0 or beta <= 0 or scale <= 0:
            return 1e10
        # ... discretized MLE (see paper Sec. 5)
        return nll

    result = minimize(neg_log_lik, [0.3, 3.0, 0.5])
    alpha_hat = result.x[0]
    return alpha_hat  # This IS the power law exponent b
Distributional Estimator vs Brute Force

Drag the sample budget slider. The distributional estimator (teal) recovers the true exponent with far fewer samples than least-squares (orange).

Samples per problem100
Why does the distributional estimator need 100-10,000x less compute than the standard least-squares approach?

Chapter 8: Deviations from Power Law

Not every model-benchmark combination produces a clean power law. The distributional framework explains exactly when and why the scaling curve bends away from a straight line in log-log space.

Case 1: No heavy tail

If the distribution of passi@1 has no heavy left tail -- meaning all problems can be solved reasonably often -- then the aggregate fails faster than any power law. The scaling curve bends downward (faster improvement). This is exactly what happened with Llama 3 8B IT on HarmBench.

Case 2: Finite support

Real benchmarks have a finite number of problems. With P problems, the smallest possible passi@1 is bounded below by roughly 1/N (where N is the number of samples used to estimate it). This means the "true" tail is truncated. At very large k, once even the hardest problem is effectively solved, the scaling must eventually fall off faster than the power law predicts. The power law is an asymptotic result; finite benchmarks produce finite-range approximations.

Case 3: Multimodal distribution

If the difficulty distribution is not smooth -- for instance, if there are two clusters of problems (easy and hard with a gap in between) -- the scaling may show a "kink" or change in slope on the log-log plot. Each cluster contributes its own approximate power law in different regimes of k.

Think of it this way: at small k, the aggregate is dominated by the easy cluster (they are being solved first). At large k, the easy problems are all solved, and the hard cluster takes over. If the two clusters have different effective exponents, the log-log plot shows a transition from one slope to another -- a visible bend or kink.

Case 4: Very small benchmarks

With only 10-20 problems, the empirical distribution of p values is a poor approximation of any smooth density. The power law may appear noisy or not at all. The theory is asymptotic in both k (many attempts) and P (many problems). Small P gives high variance in the observed exponent.

The unifying insight: Every deviation from clean power law scaling has a corresponding feature in the p-distribution. Bend downward? Missing left tail. Kink? Multimodal distribution. Slower-than-predicted? The tail is heavier than the fitted distribution assumed. The p-distribution is the complete explanation.

The Reciprocal distribution: not a power law

As a counterexample, the paper derives what happens with a Reciprocal(α, β) distribution (log-uniform). Its density near zero goes like p−1 / log(β/α), which does not have the pb−1 form for any finite b. The result:

−log(passReciprocal@k) ≈ (1 − α)k / k

This decays faster than any power law because of the (1 − α)k factor. No heavy left tail in the right form means no power law scaling.

Deviation Types

Switch between three scenarios. Watch how the aggregate scaling (orange) deviates from a clean power law (yellow dashed) depending on the p-distribution shape.

ScenarioHeavy tail (power law)
If the aggregate scaling curve bends downward (improves faster than the fitted power law predicts) at large k, what does this indicate about the p-distribution?

Chapter 9: Connections

This paper provides the missing theoretical foundation for a rapidly growing body of work on inference-time compute scaling. The insight -- that heavy-tailed difficulty distributions transform exponential per-problem scaling into aggregate power laws -- has implications far beyond repeated sampling.

Before this work, the power law was an empirical observation without a mechanistic explanation. Now we know: it is a mathematical consequence of the difficulty distribution. This transforms the power law from a mysterious empirical regularity into a predictable, controllable property of model-benchmark interactions. If you know the difficulty distribution, you know the scaling law -- and you can predict it cheaply.

Broader implications

The authors speculate that the same mechanism might explain pretraining scaling laws. If pretraining loss is a sum of many component losses, each decaying at a different rate with compute, and the distribution of those rates is heavy-tailed, then the aggregate loss would follow a power law in compute. Concretely:

L(C) = ∑α w(α) · C−α

If the distribution of rates α is heavy-tailed, then for large compute C, the sum is dominated by the slowest-decaying components -- the "hard-to-learn" features. These are the "dark matter of neural scaling laws." The aggregate loss looks like A · C−αmin, a power law. The hard-to-learn components play the same role as the hard-to-solve problems in repeated sampling.

This connection is speculative but tantalizing. It suggests that power laws in deep learning may not be mysterious -- they may be a universal consequence of heterogeneous difficulty distributions, whether the difficulty is across benchmark problems, training data points, or learned features.

Practical takeaways

For benchmark designers: If you want your benchmark to exhibit clean power law scaling (which makes extrapolation easier), ensure it includes a spread of difficulties with a heavy left tail. This means including some genuinely hard problems that the model almost never solves.

For practitioners scaling inference compute: Before investing in massive repeated sampling runs, measure the p-distribution first. A small pilot (100 samples per problem) tells you the exponent, which tells you the returns from further scaling. If b is small (say 0.1), doubling k from 1000 to 2000 only improves −log(pass@k) by a factor of 20.1 = 1.07. Diminishing returns.

For scaling law researchers: When fitting power laws to empirical data, always check the underlying distribution. A clean power law requires a clean heavy tail. If the tail is truncated, your extrapolations will be optimistic.

PaperConnection
Large Language Monkeys (Brown et al. 2024)Discovered the aggregate power law for repeated sampling on MATH. This paper explains WHY.
Best-of-N Jailbreaking (Hughes et al. 2024)Found the same power law for jailbreaking. This paper unifies both under the same distributional framework.
Scaling Test-Time Compute (Snell et al. 2024)Studies more sophisticated inference-time strategies (verifiers, search). Power law understanding constrains what repeated sampling alone can achieve.
Chinchilla (Hoffmann et al. 2022)Pretraining scaling laws. The authors conjecture similar distributional mechanisms underlie these.
Kaplan et al. 2020Original neural scaling laws paper. Power laws in loss vs compute/data/parameters. Same asymptotic math may apply.

Summary of key results

ResultStatement
Per-problem exponential−log(passi@k) ≈ (1 − pi)k for each problem
Aggregate power law−log(passD@k) ≈ CΓ(b) k−b when the p-distribution has tail pb−1
If and only ifPower law aggregate ⇔ heavy-tailed p-distribution
Distributional estimator10x lower error or 100-10,000x less compute for predicting exponent
Deviation diagnosisNo tail → faster than power law. Truncated tail → bends down. Bimodal → kink.

Open questions

Why do benchmarks have heavy-tailed difficulty? Is it inherent to the structure of mathematical problems, or an artifact of benchmark design? The authors conjecture both: benchmarks intentionally include a spread of difficulties, and real problem spaces naturally have fat tails.

Does the same mechanism explain pretraining scaling? The "dark matter" hypothesis suggests yes -- the hardest-to-learn features play the same role as the hardest-to-solve problems. But this remains unproven.

Can we engineer the tail? If the power law exponent is determined by the difficulty distribution, could we design benchmarks with specific exponents? Or filter problems to make the distribution more uniform and get faster-than-power-law improvement?

"The aggregate is a democracy. But the hardest problems are the loudest voters."
— The principle behind power law emergence