We limit models to one attempt per problem. What if you let them try 250 times? DeepSeek-Coder goes from 15.9% to 56% on SWE-bench Lite — beating the 43% single-sample SOTA. Coverage scales as a power law with samples.
You have an LLM. You ask it a coding question. It gets it wrong. You sigh, try a different prompt, maybe switch to a bigger model. Standard practice in 2024: one attempt, one answer, move on.
But think about how humans solve hard problems. A mathematician does not solve an olympiad problem on the first try. They attempt it multiple times, exploring different strategies. A programmer does not write bug-free code in a single pass. They iterate, test, fix.
Why do we restrict LLMs to a single shot?
This paper asks a deceptively simple question: if you sample many candidate solutions from an LLM instead of just one, how does the chance of getting at least one correct answer scale with the number of samples?
The answer turns out to be surprisingly clean. Coverage — the fraction of problems solved by any generated sample — scales as a power law with the number of samples, over four orders of magnitude. And in domains where you can automatically verify correctness (unit tests, proof checkers), this translates directly into better real-world performance.
Drag the slider to increase the number of samples. Watch how the fraction of problems with at least one correct solution grows.
For every problem, the model has some probability p of generating a correct solution on any given attempt. If p is 5%, then with one sample you solve it 5% of the time. With two independent samples, you solve it 1 − (1 − 0.05)2 = 9.75% of the time. With 100 samples: 1 − 0.95100 = 99.4%.
That is the math for a single problem. But across a dataset of problems, each problem has a different p. Some problems are easy (p = 0.8), some hard (p = 0.001), some impossible (p = 0). The aggregate coverage — fraction of problems solved by at least one sample — is the average of these individual success probabilities.
Why does this matter? Because predictability means you can plan. You can estimate how many samples you need to hit a target coverage. You can compare the cost of sampling more from a cheap model versus sampling once from an expensive model. It turns inference compute into a knob you can rationally tune.
The authors decompose repeated sampling into two subproblems:
When both problems are solved — high coverage plus reliable verification — repeated sampling becomes a powerful, general-purpose amplifier for any model.
The authors fit a simple functional form to the observed coverage curves. Start with the GPT-4 technical report's observation that log-pass-rate scales as a power law with compute. Apply the same idea here, but with number of samples k instead of training FLOPs:
where c is coverage, k is the number of samples, and a, b are fitted parameters. Exponentiating both sides gives the final model:
This is an exponentiated power law. Let's unpack what it means.
Take Llama-3-8B-Instruct on MATH. The fitted parameters are a = −1.33, b = −0.43. Let's compute coverage at several sample budgets:
Adjust the parameters a and b to see how the scaling law changes. The orange curve shows coverage vs number of samples on a log-x scale.
Here is the crux of the paper: coverage is only useful if you can identify the correct solution from your pile of samples. The paper studies five tasks and finds a sharp divide.
Three tasks have built-in verification:
For these tasks, coverage = performance. Every increase in coverage from more samples translates directly into more problems solved.
Two tasks lack verifiers:
For these tasks, the paper tests three verification strategies: majority voting, reward model + best-of-N, and reward model + weighted majority voting. All three plateau around 100 samples. Coverage keeps climbing to 98%+ at 10,000 samples, but actual accuracy with these verifiers stalls at ~41%.
Drag the sample slider to see coverage (what is possible with a perfect verifier) versus what majority voting and reward models actually achieve. The gap grows wider as samples increase.
The most striking result in the paper is on SWE-bench Lite — a dataset of 300 real-world GitHub issues where the model must read a repository, understand the bug, and produce a code patch.
The authors use DeepSeek-Coder-V2-Instruct with the open-source Moatless Tools agent framework. Each "sample" is one entire multi-turn trajectory: the model navigates the codebase, identifies relevant files, and generates a patch. All attempts are independent — no learning between them.
With a single attempt, DeepSeek-Coder solves 15.9% of issues. The single-sample SOTA at the time (CodeStory Aide using a mix of GPT-4o and Claude 3.5 Sonnet) was 43%.
With 250 independent samples per issue, DeepSeek-Coder's coverage reaches 56%. That is 13 percentage points above the SOTA, achieved by a weaker, cheaper model simply by sampling more.
The climb is smooth. At 1 sample: 15.9%. At 5 samples: ~30%. At 25 samples: ~42%. At 100 samples: ~50%. At 250 samples: 56%. Every doubling of the sample budget adds a few percentage points. The curve shows no sign of saturating at 250 — more samples would likely push it higher.
Beyond SWE-bench, the authors evaluate on four additional benchmarks with models ranging from 70M to 70B parameters.
Competitive programming problems with hidden test cases. Results with Llama-3-8B-Instruct:
Even a 2B parameter model starts solving competitive programming problems it had virtually zero chance of solving on a single attempt.
Math problems formalized in Lean4, verified by the proof checker:
With an oracle that can check final answers:
We saw in Chapter 3 that majority voting and reward models plateau. Let's understand why in more detail.
Consider a hard MATH problem where the model gets the right answer 1% of the time. Out of 1,000 samples, about 10 will be correct and 990 will be wrong. But many of the wrong answers will agree with each other — common mistakes are common. So majority voting picks the most popular wrong answer.
As you increase from 1,000 to 10,000 samples, you get ~100 correct answers and ~9,900 wrong ones. The correct answer is still a tiny minority. Majority voting still picks the wrong answer. Coverage increased (the correct solution exists), but precision did not improve.
You might hope a reward model could identify the rare correct solutions. The authors test ArmoRM-Llama3-8B-v0.1, a strong reward model. Two strategies:
The reward model cannot reliably distinguish correct from incorrect solutions on hard problems. On easy problems (where most samples are correct anyway), it works fine — but those problems were already solved with fewer samples.
At 10 samples, coverage might be 50% and majority voting accuracy 38%. A 12-point gap. At 10,000 samples, coverage might be 98% and majority voting accuracy 41%. A 57-point gap. The gap widens as samples increase, because coverage keeps climbing while verification methods stall.
Sampling 250 times costs 250x as much. Is it worth it? The answer depends on what you are comparing against.
The authors compare Llama-3-8B-Instruct (many samples) versus Llama-3-70B-Instruct (fewer samples) at the same total FLOP budget. The results vary by task:
The paper compares API costs for SWE-bench Lite using the Moatless Tools framework:
| Model | Cost/Attempt | Attempts | Issues Solved | Total Cost | Relative |
|---|---|---|---|---|---|
| DeepSeek-Coder-V2 | $0.0072 | 5 | 29.6% | $10.80 | 1x |
| GPT-4o | $0.13 | 1 | 24.0% | $39.00 | 3.6x |
| Claude 3.5 Sonnet | $0.17 | 1 | 26.7% | $51.00 | 4.7x |
Five attempts with DeepSeek solve more issues than a single attempt from GPT-4o or Claude — at one-third to one-fifth the cost. The cheap model sampled more beats the expensive model sampled once.
Drag the budget slider to see how many DeepSeek attempts you can afford at different total budgets, and compare against single frontier model attempts.
This paper establishes repeated sampling as a scaling axis for inference compute, parallel to the well-known axes of model size, training data, and training compute. The implications ripple through how we think about deploying LLMs.
With training scaling laws, we learned to predict performance from training FLOPs and allocate budgets rationally. The exponentiated power law for coverage does the same for inference. You can now ask: "How many samples do I need to solve 90% of problems in this class?" and get a quantitative answer.
The ability to amplify weak models is practically important. Open-source models are cheaper to run, can be self-hosted, and avoid API vendor lock-in. If sampling 100 times from Llama-3-8B beats sampling once from GPT-4o on your task, the economics strongly favor the open-source path.
In domains with automatic verification (code, formal proofs), repeated sampling already works today. The frontier is building better verifiers for domains without them. Process reward models, self-verification, execution-based checking, and formal verification tools are all active research areas that directly unlock the potential of repeated sampling.
The authors identify three ways to improve repeated sampling beyond the "dumb" independent sampling they study:
Large Language Monkeys sits at the foundation of the test-time compute scaling paradigm. Here is how it connects to the broader landscape.