Large Language Monkeys

Chapter 0: The Problem

You have an LLM. You ask it a coding question. It gets it wrong. You sigh, try a different prompt, maybe switch to a bigger model. Standard practice in 2024: one attempt, one answer, move on.

But think about how humans solve hard problems. A mathematician does not solve an olympiad problem on the first try. They attempt it multiple times, exploring different strategies. A programmer does not write bug-free code in a single pass. They iterate, test, fix.

Why do we restrict LLMs to a single shot?

The infinite monkey theorem: Given infinite time, a monkey hitting keys at random on a typewriter will almost surely type any given text — including the complete works of Shakespeare. The probability is astronomically low per attempt, but with enough attempts, it approaches certainty. LLMs are far better than random monkeys. What happens when you give them 10, 100, or 10,000 attempts?

This paper asks a deceptively simple question: if you sample many candidate solutions from an LLM instead of just one, how does the chance of getting at least one correct answer scale with the number of samples?

The answer turns out to be surprisingly clean. Coverage — the fraction of problems solved by any generated sample — scales as a power law with the number of samples, over four orders of magnitude. And in domains where you can automatically verify correctness (unit tests, proof checkers), this translates directly into better real-world performance.

One Attempt vs Many: The Core Idea

Drag the slider to increase the number of samples. Watch how the fraction of problems with at least one correct solution grows.

Samples per problem 1

Why is repeated sampling a natural way to scale inference compute?

Each independent sample explores a different region of the solution space — more samples means more chances to stumble on a correct answer, especially for problems the model can solve with low but nonzero probability Repeated sampling improves the model weights through reinforcement It makes each individual sample higher quality by using more compute per token

Chapter 1: The Key Insight

For every problem, the model has some probability p of generating a correct solution on any given attempt. If p is 5%, then with one sample you solve it 5% of the time. With two independent samples, you solve it 1 − (1 − 0.05)² = 9.75% of the time. With 100 samples: 1 − 0.95¹⁰⁰ = 99.4%.

That is the math for a single problem. But across a dataset of problems, each problem has a different p. Some problems are easy (p = 0.8), some hard (p = 0.001), some impossible (p = 0). The aggregate coverage — fraction of problems solved by at least one sample — is the average of these individual success probabilities.

The key finding: Coverage does not just increase with more samples — it increases in a remarkably predictable way. The relationship between log(coverage) and the number of samples follows an approximate power law. This means coverage grows nearly log-linearly with samples over several orders of magnitude. From 1 to 10,000 samples, the curve is smooth and extrapolatable.

Why does this matter? Because predictability means you can plan. You can estimate how many samples you need to hit a target coverage. You can compare the cost of sampling more from a cheap model versus sampling once from an expensive model. It turns inference compute into a knob you can rationally tune.

Two Problems, Not One

The authors decompose repeated sampling into two subproblems:

Problem 1: Coverage

As the sample budget increases, can we generate a correct solution for more and more problems? This paper shows: yes, reliably, as a power law.

↓ then ↓

Problem 2: Precision

Once we have many samples, can we identify which one is correct? This depends on whether automatic verifiers (unit tests, proof checkers) are available. Without them: trouble.

When both problems are solved — high coverage plus reliable verification — repeated sampling becomes a powerful, general-purpose amplifier for any model.

If a problem has a per-sample success probability of 2%, what is the probability of solving it in 100 independent samples?

2% 50% 1 − 0.98¹⁰⁰ ≈ 86.7% 100%

Chapter 2: The Scaling Law

The authors fit a simple functional form to the observed coverage curves. Start with the GPT-4 technical report's observation that log-pass-rate scales as a power law with compute. Apply the same idea here, but with number of samples k instead of training FLOPs:

log(c) ≈ a · k^b

where c is coverage, k is the number of samples, and a, b are fitted parameters. Exponentiating both sides gives the final model:

c ≈ exp(a · k^b)

This is an exponentiated power law. Let's unpack what it means.

a < 0 always (since log(c) is negative when c < 1). Larger |a| means lower initial coverage — the task is harder for this model.
b < 0 always (we need k^b to decrease as k grows, so that a · k^b approaches 0 and c approaches 1). Typical values: b ≈ −0.1 to −0.5.
At k = 1: c(1) = exp(a), the single-sample success rate.
As k → ∞: k^b → 0, so c → exp(0) = 1. Perfect coverage, eventually.

Worked Example

Take Llama-3-8B-Instruct on MATH. The fitted parameters are a = −1.33, b = −0.43. Let's compute coverage at several sample budgets:

k = 1: c = exp(−1.33 · 1^−0.43) = exp(−1.33) = 26.4%
k = 10: c = exp(−1.33 · 10^−0.43) = exp(−1.33 · 0.372) = exp(−0.495) = 61.0%
k = 100: c = exp(−1.33 · 100^−0.43) = exp(−1.33 · 0.138) = exp(−0.184) = 83.2%
k = 1000: c = exp(−1.33 · 1000^−0.43) = exp(−1.33 · 0.0513) = exp(−0.068) = 93.4%
k = 10000: c = exp(−1.33 · 10000^−0.43) = exp(−1.33 · 0.0191) = exp(−0.025) = 97.5%

Log-linear growth: Notice that each 10x increase in samples yields a roughly constant jump in coverage. From 26% to 61% to 83% to 93% to 97%. On a log-x axis, coverage traces a smooth S-curve. This log-linearity holds across four orders of magnitude for most task-model combinations.

Exponentiated Power Law Explorer

Adjust the parameters a and b to see how the scaling law changes. The orange curve shows coverage vs number of samples on a log-x scale.

a (difficulty) -1.33

b (exponent) -0.43

Fit quality: For Llama-3-8B-Instruct on MATH, the mean absolute error between the power law fit and actual coverage is only 0.003 ± 0.003. On CodeContests: 0.002 ± 0.002. These are tight fits. The main exception is MiniF2F-MATH (formal proofs), where the fit is looser (error 0.030 ± 0.016).

In the exponentiated power law c = exp(a · k^b), what does making |a| larger (more negative) correspond to?

A harder task for the model — lower single-sample success rate, needing more samples to reach high coverage A faster rate of coverage improvement with samples A higher maximum coverage ceiling

Chapter 3: Verification Matters

Here is the crux of the paper: coverage is only useful if you can identify the correct solution from your pile of samples. The paper studies five tasks and finds a sharp divide.

With Automatic Verifiers

Three tasks have built-in verification:

CodeContests: Each problem has input-output test cases. Run the code, check if the output matches. Binary yes/no.
MiniF2F (Lean proofs): The Lean proof checker verifies whether each proof is valid. No ambiguity.
SWE-bench Lite: Each GitHub issue has a test suite. If the patched code passes the tests, it is correct.

For these tasks, coverage = performance. Every increase in coverage from more samples translates directly into more problems solved.

Without Automatic Verifiers

Two tasks lack verifiers:

GSM8K: Grade-school math. The model produces a chain of thought and a final answer. You can check if the final answer matches, but in practice you need a method to pick an answer from many conflicting samples.
MATH: Harder math problems. Same issue.

For these tasks, the paper tests three verification strategies: majority voting, reward model + best-of-N, and reward model + weighted majority voting. All three plateau around 100 samples. Coverage keeps climbing to 98%+ at 10,000 samples, but actual accuracy with these verifiers stalls at ~41%.

The verification gap: With Llama-3-8B-Instruct on MATH, coverage at 10,000 samples is 98.4%. But majority voting achieves only 41.4%. That is a 57-percentage-point gap. The correct solutions exist in the sample pool — but the verifier cannot find them. This is the central bottleneck of repeated sampling without domain-specific verification tools.

Coverage vs Verification Performance

Drag the sample slider to see coverage (what is possible with a perfect verifier) versus what majority voting and reward models actually achieve. The gap grows wider as samples increase.

Samples (log scale) 100

Why does majority voting plateau as you increase the number of samples beyond ~100?

New correct solutions are generated at low frequency — the rare correct answer gets outvoted by the far more common incorrect answers, so the majority remains wrong regardless of sample count The model starts generating worse solutions at higher sample counts Majority voting is computationally too expensive beyond 100 samples

Chapter 4: SWE-bench Results

The most striking result in the paper is on SWE-bench Lite — a dataset of 300 real-world GitHub issues where the model must read a repository, understand the bug, and produce a code patch.

The Setup

The authors use DeepSeek-Coder-V2-Instruct with the open-source Moatless Tools agent framework. Each "sample" is one entire multi-turn trajectory: the model navigates the codebase, identifies relevant files, and generates a patch. All attempts are independent — no learning between them.

With a single attempt, DeepSeek-Coder solves 15.9% of issues. The single-sample SOTA at the time (CodeStory Aide using a mix of GPT-4o and Claude 3.5 Sonnet) was 43%.

The Result

With 250 independent samples per issue, DeepSeek-Coder's coverage reaches 56%. That is 13 percentage points above the SOTA, achieved by a weaker, cheaper model simply by sampling more.

Why SWE-bench is special: Each issue has a test suite that can automatically verify candidate patches. So coverage directly equals performance. No verification bottleneck. This is the ideal case for repeated sampling: hard problems where correctness is automatically checkable.

The Coverage Curve

The climb is smooth. At 1 sample: 15.9%. At 5 samples: ~30%. At 25 samples: ~42%. At 100 samples: ~50%. At 250 samples: 56%. Every doubling of the sample budget adds a few percentage points. The curve shows no sign of saturating at 250 — more samples would likely push it higher.

Comparison with GPT-4o: Using the same Moatless Tools framework, a single GPT-4o attempt solves 24.7% of issues. Five DeepSeek attempts beat that (29.6%), at a fraction of the cost. The weaker model with more tries outperforms the stronger model with one try.

What makes SWE-bench Lite an ideal testbed for repeated sampling?

Each issue has a test suite that automatically verifies patches, so every coverage gain from more samples directly improves performance with no verification bottleneck The problems are easy enough that one sample is usually sufficient The dataset is small enough to run many samples cheaply

Chapter 5: Coding & Math Tasks

Beyond SWE-bench, the authors evaluate on four additional benchmarks with models ranging from 70M to 70B parameters.

CodeContests

Competitive programming problems with hidden test cases. Results with Llama-3-8B-Instruct:

pass@1 = 5.3%, pass@100 = 24.0%, pass@10k = 47.3%
Gemma-2B: pass@1 = 0.02%, pass@10k = 7.1% — a 300x increase in coverage

Even a 2B parameter model starts solving competitive programming problems it had virtually zero chance of solving on a single attempt.

MiniF2F (Formal Proofs)

Math problems formalized in Lean4, verified by the proof checker:

Llama-3-8B-Instruct: pass@1 = 2.3%, pass@10k = 41.5%
Llama-3-70B-Instruct: pass@1 = 3.8%, pass@10k = 48.5%
Both exceed GPT-4o's single-attempt 22.3% with enough samples

MATH (Oracle Verifier)

With an oracle that can check final answers:

Llama-3-8B-Instruct: pass@1 = 26.6%, pass@10k = 97.7%
Even Pythia-160M (a 160-million parameter model): pass@1 = 0.27%, pass@10k = 57%

GSM8K (Oracle Verifier)

Llama-3-8B-Instruct: pass@1 = 75.8%, pass@10k = 99.2%
Llama-3-70B-Instruct: pass@1 = 88.3%, pass@10k = 99.2% (one problem has wrong ground truth)

Model family consistency: Within a model family (Llama, Gemma, Pythia), the coverage curves have the same shape on a log-x axis — they are just shifted horizontally. A bigger model starts higher but climbs at the same rate. This suggests the difficulty distribution of problems is a property of the task, not the model.

Gemma-2B increases coverage on CodeContests by 300x (from 0.02% to 7.1%) with 10,000 samples. What does this demonstrate?

Even very small models with near-zero single-sample success can solve a meaningful fraction of hard problems given enough attempts, because the model assigns nonzero probability to correct solutions Gemma-2B is secretly better than larger models at competitive programming The CodeContests test cases are too lenient

Chapter 6: The Verification Bottleneck

We saw in Chapter 3 that majority voting and reward models plateau. Let's understand why in more detail.

The Needle in the Haystack

Consider a hard MATH problem where the model gets the right answer 1% of the time. Out of 1,000 samples, about 10 will be correct and 990 will be wrong. But many of the wrong answers will agree with each other — common mistakes are common. So majority voting picks the most popular wrong answer.

As you increase from 1,000 to 10,000 samples, you get ~100 correct answers and ~9,900 wrong ones. The correct answer is still a tiny minority. Majority voting still picks the wrong answer. Coverage increased (the correct solution exists), but precision did not improve.

Reward Models Fail Too

You might hope a reward model could identify the rare correct solutions. The authors test ArmoRM-Llama3-8B-v0.1, a strong reward model. Two strategies:

Best-of-N: Score every sample, pick the highest-scoring one. Plateaus around 100 samples.
Weighted majority voting: Weight each sample's vote by its reward score. Also plateaus.

The reward model cannot reliably distinguish correct from incorrect solutions on hard problems. On easy problems (where most samples are correct anyway), it works fine — but those problems were already solved with fewer samples.

A reassuring finding: The authors manually inspect 105 chains-of-thought from correct Llama-3-8B-Instruct samples on GSM8K. Over 90% have faithful, valid reasoning steps — even when the model rarely gets the right answer. The correct samples are not just lucky guesses with nonsense reasoning. There is signal for a verifier to exploit. Current verifiers just are not good enough to find it.

The Growing Gap

At 10 samples, coverage might be 50% and majority voting accuracy 38%. A 12-point gap. At 10,000 samples, coverage might be 98% and majority voting accuracy 41%. A 57-point gap. The gap widens as samples increase, because coverage keeps climbing while verification methods stall.

Implication for future work: The biggest bottleneck in scaling inference compute is not generating correct solutions — it is finding them. Better verifiers (process reward models, execution-based checking, self-consistency with reasoning) are the key to unlocking the full potential of repeated sampling on tasks without automatic verification.

Why do reward models fail to scale with the sample budget on math word problems?

For hard problems, correct solutions are rare in the sample pool and the reward model cannot reliably score them higher than plausible-but-wrong solutions that appear much more frequently Reward models are too slow to evaluate thousands of samples The correct solutions have lower quality reasoning chains

Chapter 7: Cost Analysis

Sampling 250 times costs 250x as much. Is it worth it? The answer depends on what you are comparing against.

FLOP-Matched Comparison

The authors compare Llama-3-8B-Instruct (many samples) versus Llama-3-70B-Instruct (fewer samples) at the same total FLOP budget. The results vary by task:

MATH, GSM8K, MiniF2F: 8B + many samples beats 70B + fewer samples at every FLOP budget. Sampling more from the small model is more efficient.
CodeContests: 70B is almost always more cost-effective. The task demands capabilities that the 8B model simply does not have, even with 10,000 tries.

When to sample more vs use a bigger model: If the smaller model has nonzero success on most problems (even at low rates), sampling more is efficient. If the smaller model has literally zero success on many problems (as with Pythia on CodeContests), no amount of sampling will help — use the bigger model.

Dollar Cost Comparison (SWE-bench)

The paper compares API costs for SWE-bench Lite using the Moatless Tools framework:

Model	Cost/Attempt	Attempts	Issues Solved	Total Cost	Relative
DeepSeek-Coder-V2	$0.0072	5	29.6%	$10.80	1x
GPT-4o	$0.13	1	24.0%	$39.00	3.6x
Claude 3.5 Sonnet	$0.17	1	26.7%	$51.00	4.7x

Five attempts with DeepSeek solve more issues than a single attempt from GPT-4o or Claude — at one-third to one-fifth the cost. The cheap model sampled more beats the expensive model sampled once.

Throughput advantage: Repeated sampling is a distinct workload from serving chatbot requests. You can use high batch sizes, shared prefix optimization (all attempts share the same prompt), and prioritize throughput over latency. This makes the per-sample cost even lower in practice than the naive API pricing suggests.

Cost-Performance Frontier

Drag the budget slider to see how many DeepSeek attempts you can afford at different total budgets, and compare against single frontier model attempts.

Total budget ($) $40

Under what condition is it better to use a larger model with fewer samples instead of a smaller model with many samples?

When the smaller model has literally zero probability of solving many problems in the dataset — meaning no amount of sampling can generate a correct solution for those problems When latency matters more than accuracy When the dataset is small

Chapter 8: Implications

This paper establishes repeated sampling as a scaling axis for inference compute, parallel to the well-known axes of model size, training data, and training compute. The implications ripple through how we think about deploying LLMs.

Inference Compute as a Knob

With training scaling laws, we learned to predict performance from training FLOPs and allocate budgets rationally. The exponentiated power law for coverage does the same for inference. You can now ask: "How many samples do I need to solve 90% of problems in this class?" and get a quantitative answer.

Weak Models + Many Samples

The ability to amplify weak models is practically important. Open-source models are cheaper to run, can be self-hosted, and avoid API vendor lock-in. If sampling 100 times from Llama-3-8B beats sampling once from GPT-4o on your task, the economics strongly favor the open-source path.

The Verifier Bottleneck is the Real Challenge

In domains with automatic verification (code, formal proofs), repeated sampling already works today. The frontier is building better verifiers for domains without them. Process reward models, self-verification, execution-based checking, and formal verification tools are all active research areas that directly unlock the potential of repeated sampling.

Future Directions from the Paper

The authors identify three ways to improve repeated sampling beyond the "dumb" independent sampling they study:

Solution diversity: Instead of relying solely on temperature for diversity, condition different samples with different metadata (as AlphaCode does) or use diverse prompting strategies.
Multi-turn interactions: Let the model iterate on its solutions using execution feedback (compile errors, test results) instead of single-shot generation.
Learning from previous attempts: Show the model its past failed attempts so it can try genuinely different strategies instead of repeating the same mistakes.

The bigger picture: This paper appeared in July 2024. Within months, OpenAI released o1 (which uses inference-time compute for chain-of-thought reasoning), and the field shifted toward "test-time compute scaling." Large Language Monkeys was one of the earliest systematic studies showing that inference compute, even in its simplest form, is a powerful and predictable scaling axis.

What is the most important bottleneck preventing repeated sampling from working on all tasks?

The lack of reliable automatic verifiers for domains like math word problems and open-ended reasoning, where majority voting and reward models plateau far below coverage The computational cost of generating many samples Models generating lower quality solutions at high temperatures

Chapter 9: Connections

Large Language Monkeys sits at the foundation of the test-time compute scaling paradigm. Here is how it connects to the broader landscape.

Scaling Test-Time Compute (Snell et al., 2024)

Goes beyond brute-force sampling. Studies adaptive allocation — spending more compute on harder problems. Shows that verifier-guided search can outperform repeated sampling with the same compute budget. Large Language Monkeys established the baseline that this work improves upon.

↓

CodeMonkeys (Ehrlich et al., 2025)

Direct follow-up by several of the same authors. Scales repeated sampling to SWE-bench Verified with up to 250 samples per issue. Uses execution feedback to guide sampling. Pushes the approach further with smarter verification.

↓

BrowseComp (OpenAI, 2025)

A benchmark of hard web research problems. Even with hundreds of browsing attempts, models struggle. Demonstrates that some tasks resist the "just sample more" strategy, motivating more sophisticated search and reasoning approaches.

↓

GEPA (Agrawal et al., 2026)

Instead of blind repeated sampling, GEPA uses reflective mutation — reading execution traces to diagnose failures and fix prompts. An evolution from "try 250 times independently" to "try, learn from the failure, try again smarter."

The trajectory: Large Language Monkeys showed that brute-force sampling works. Subsequent work asked: can we be smarter about it? Adaptive compute allocation, execution feedback, multi-turn refinement, and reflective prompt optimization all build on the empirical foundation this paper established.

How does the "scaling test-time compute" paradigm build on Large Language Monkeys?

Large Language Monkeys established that brute-force repeated sampling scales predictably; subsequent work like Snell et al. showed that adaptive allocation and verifier-guided search can achieve the same coverage with less compute They are unrelated research directions Test-time compute scaling replaces sampling with chain-of-thought reasoning