Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Ré, Azalia Mirhoseini — Stanford, Oxford, 2025

CodeMonkeys: Scaling Test-Time Compute for Software Engineering

Fix real GitHub issues by combining serial (multi-turn iteration) and parallel (multi-trajectory sampling) test-time compute. 57.4% on SWE-bench Verified for ~$2300.

Prerequisites: What an LLM is + Basic software engineering intuition + What a test suite does
10
Chapters
4
Simulations

Chapter 0: The Problem

You have a real GitHub issue. A user reports a bug in a popular Python library — maybe django, sympy, or scikit-learn. The issue description explains the expected behavior, the actual behavior, and maybe a minimal reproduction. Your job: read the codebase, find the right files, write a patch that resolves the issue, and make sure you didn't break anything else.

This is the SWE-bench problem. It's a benchmark of 500 real GitHub issues (the "Verified" split) from 12 popular Python repositories, each with a known solution and a test suite that checks whether your patch actually works. No toy problems — these are the same issues that real open-source developers solved by hand.

Why SWE-bench matters: It tests something much harder than code completion. You need to understand a multi-million-token codebase, locate the relevant files among hundreds, reason about the bug, write a correct patch, and avoid regressions. As of early 2025, no single-shot approach exceeds ~55%.

The standard approach is to give an LLM the issue description and some codebase context, ask it to generate a patch, and submit. Maybe you let it use tools — search files, run commands. Maybe you let it iterate a few turns. But you fundamentally generate one attempt and submit it.

The problem? Single-shot approaches plateau. An LLM might solve 40-50% of issues on its best attempt, but there's a long tail of issues where it gets stuck — wrong files, wrong diagnosis, wrong patch. More turns sometimes help, but the model can get locked into a bad trajectory. What if you could throw more compute at the problem?

SWE-bench: The Gap Between One Shot and Many Shots

A single attempt solves ~46% of issues. But if you sample 10 independent attempts and could magically pick the right one, coverage jumps to ~70%. The gap is the selection problem.

Why do single-shot approaches to SWE-bench plateau around 40-55%?

Chapter 1: The Key Insight

CodeMonkeys decomposes the problem into three subtasks — context, generation, selection — and scales test-time compute along two orthogonal axes.

Serial Scaling: Iterate Within a Trajectory

Instead of asking the model to generate a patch in one shot, let it iterate. The model writes a draft patch, generates a test script, runs the test, reads the output, revises the patch, re-runs the test. Each turn gives the model execution feedback it can use to refine its answer. More turns = more refinement. This is serial scaling.

Parallel Scaling: Sample Many Trajectories

Instead of running one trajectory, run many. Each trajectory starts from scratch with a positive sampling temperature, producing diverse approaches to the same issue. If trajectory #3 gets stuck on a bad diagnosis but trajectory #7 nails it, you haven't lost anything. More trajectories = more diversity. This is parallel scaling.

The combination is key. Serial scaling alone hits diminishing returns — a stuck trajectory doesn't get unstuck by more turns. Parallel scaling alone gives diversity but wastes compute on shallow attempts. CodeMonkeys combines both: each trajectory gets enough turns to iterate to a good solution, and you sample enough trajectories to cover diverse approaches.
1. Context
Identify relevant codebase files. Can we find the right files? Measured by recall.
2. Generation
Produce candidate edits (serial x parallel). Is the correct patch among any candidate? Measured by coverage.
3. Selection
Choose the best edit from candidates. Measured by final score.

The system uses Claude 3.5 Sonnet as the primary model and Qwen-2.5-Coder-32B for the initial file scan. For each of the 500 SWE-bench Verified issues, CodeMonkeys runs 10 parallel trajectories with up to 8 turns each, then selects a final answer. The whole run costs ~$2300 in API fees.

What are the two axes of test-time compute scaling in CodeMonkeys?

Chapter 2: Multi-Turn Trajectories

The core execution unit of CodeMonkeys is a pair of state machines: a Testing State Machine followed by an Editing State Machine. Let's walk through exactly what happens.

Step 1: The Testing State Machine

Given the issue description (but not the codebase context), the model writes a standalone Python test script that tries to reproduce the bug. The script uses exit codes to communicate: exit 1 if the bug is present, exit 0 if it's fixed. The model runs this script on the unedited codebase, reads the execution output, and iterates until the script correctly detects the bug.

Why test first? Writing a test doesn't require codebase context — you only need the issue description. By splitting testing from editing, CodeMonkeys avoids paying for expensive codebase context tokens during the test-writing phase. The testing state machine's prompts are much shorter (no 128K tokens of code files), which saves on prompt cache read costs.

Step 2: The Editing State Machine

Now the model receives the full codebase context (~74K tokens of relevant files) plus the test script from step 1. It generates a codebase edit in aider-style diff format. Then it runs the test before the edit (should fail) and after the edit (should pass). If the test doesn't pass post-edit, the model sees the execution output and iterates.

Two-sided debugging. The model can also revise the test during editing. Why? A test written against the unedited codebase might always fail — even after a correct fix. By letting the model adjust the test during editing, it can verify that the test fails pre-edit and passes post-edit. This bidirectional checking is critical.

The Full Trajectory

Testing SM
Input: issue description. Write test script. Run on unedited repo. Iterate until test correctly flags the bug (exit 1). No codebase files needed.
↓ test script
Editing SM
Input: issue + codebase context + test script. Write edit (aider diff). Run test pre-edit (should fail) and post-edit (should pass). Iterate on failures. Can also revise the test.
↓ (edit, test) pair
Output
A candidate codebase edit + a testing script. Both the edit and the test are used downstream for selection.

Each state machine is limited to 8 iterations (model completions). The model can choose to approve its work and terminate early. Format corrections and malformed outputs also count against the limit.

Why does CodeMonkeys split test generation into a separate state machine from edit generation?

Chapter 3: Reading the Whole Codebase

Most SWE-bench codebases contain millions of tokens. You can't feed everything to the model. Existing approaches use embedding-based retrieval, iterative file-tree expansion, or tool-based browsing. CodeMonkeys takes a radically simpler approach: let an LLM read every file.

Amortized Context

The key insight is amortization. If you're generating 10 candidate edits per issue, you can run the expensive file scan once and share the results across all 10 trajectories. The scan costs $334 across all 500 issues — about 15% of the total budget. If you re-ran it for each of the 10 edits, it would become the most expensive step. Sharing makes the simple approach affordable.

The amortization math: Scanning all files for 500 issues costs $334 total. That's $0.67 per issue. If you ran it 10 times per issue (once per trajectory), it would be $6.70 per issue — $3,340 total, more than the entire rest of the system. Sharing context across parallel trajectories makes brute-force retrieval viable.

The Two-Stage Pipeline

Stage 1 — Relevance scan: Qwen-2.5-Coder-32B (run locally, cheap) reads every Python file in the repo (excluding test directories) and decides whether each file is relevant to the issue. It also generates a concise summary of how each relevant file relates to the issue. On average, this processes 2.94 million tokens per problem.

Stage 2 — Ranking: Claude 3.5 Sonnet receives the file names, summaries, and token counts of all relevant files. It ranks them by importance, targeting ~60K tokens of context. Three ranking completions are averaged to produce a stable final ordering. Files are included up to a 128K token limit, yielding an average of 74,570 tokens of context — a 50.5x compression from the raw scan.

Recall

With a 128K token limit, 92.6% of SWE-bench instances have all the correct files in context. The remaining 7.4% are cases where the needed files were either missed by the scan or ranked too low to fit.

Context Pipeline: From Millions of Tokens to 74K

Watch the two-stage compression pipeline. Click to step through each stage.

Click to start
Why can CodeMonkeys afford to let an LLM read every file in the codebase?

Chapter 4: Serial Scaling

Serial scaling means giving each trajectory more turns to iterate. The model writes an edit, runs its test, sees what failed, and tries again. Let's see this in action.

What Happens Across Turns

In the editing state machine, a typical trajectory might look like this:

Turn 1
Model writes initial edit based on issue + context. Runs test. Test crashes with ImportError — wrong module path.
Turn 2
Model fixes the import. Runs test again. Test runs but fails — the edit handled the main case but missed an edge case in the error message.
Turn 3
Model reads the test output, realizes the edge case. Adjusts the patch. Runs test. Test passes pre-edit fail / post-edit pass. Model approves.

Each turn provides execution feedback — the model sees the actual stdout/stderr from running its test. This is far richer than having the model self-reflect without execution. The authors note that "models cannot self-correct reasoning" without external feedback, which is why the test-execute-iterate loop is essential.

The first few turns matter most. The coverage-vs-cost curves show that the first 2-3 serial iterations produce the biggest improvement. Early iterations typically fix configuration errors, import issues, and easy-to-catch bugs. Later iterations encounter diminishing returns — the model may already have approved its work or may be stuck in a loop.

The Stuck Trajectory Problem

Serial scaling has a fundamental limitation: a model can incorrectly approve its own work and terminate the state machine early. If the model believes its edit is correct (even when it isn't), adding more iterations doesn't help because the state machine has already terminated. This is a key reason why parallel scaling is needed as a complement.

Serial Scaling: Watch a Trajectory Iterate

Step through turns of an editing state machine. Watch the edit quality improve with execution feedback.

Turn 0 / 8
Why does serial scaling hit diminishing returns?

Chapter 5: Parallel Scaling

Parallel scaling means running multiple independent trajectories for the same issue, each with a positive sampling temperature to introduce diversity. The trajectories don't know about each other — they're completely independent.

Coverage Increases Log-Linearly

This was the central finding of the authors' prior work, Large Language Monkeys: the fraction of problems solved by any sample (coverage) increases approximately log-linearly with the number of samples. If you plot coverage vs. number of trajectories on a log-x scale, you get a roughly straight line.

For CodeMonkeys with 10 trajectories and 8 iterations each, coverage reaches 69.8%. That means nearly 70% of SWE-bench Verified issues have a correct patch somewhere among the 10 candidates.

69.8% coverage vs. 57.4% final score. The gap between coverage and final score is the selection problem. If you had a perfect oracle that always picked the correct edit, you'd score 69.8%. Random selection gives only 45.8%. The actual selection method recovers about half of this gap.

Why Parallel Beats Deeper Serial

Consider two configurations that cost roughly the same:

They achieve similar coverage! The paper finds that, after the first few iterations, configurations with similar total cost reach similar coverage regardless of how the budget is split between serial and parallel. But there's a catch: more parallel samples make selection harder. You have more candidates to sift through, and more wrong answers diluting the signal.

The Fresh Start Advantage

Parallel scaling has one guarantee that serial scaling lacks: every new trajectory is a fresh start. A stuck trajectory that approved a wrong edit can't be saved by more turns. But a new parallel trajectory starts from scratch and may take a completely different approach to the same problem.

Parallel Scaling: Coverage vs. Number of Trajectories

Watch coverage grow as more trajectories are sampled. Each new trajectory is an independent attempt that may solve a problem no prior trajectory solved.

0 trajectories | coverage: 0%
What is the "fresh start advantage" of parallel scaling over serial scaling?

Chapter 6: Selection

You have 10 candidate edits per issue, each with a corresponding test script. Coverage is 69.8%. Random selection gives 45.8%. How do you pick the right one?

Method 1: Majority Voting with Tests

Run each of the 10 model-generated tests on each of the 10 edits. That's 100 test executions per issue. Select the edit that passes the most tests. The intuition: if your edit is correct, it should pass most of the tests written by other trajectories (since those tests are trying to reproduce the same bug). Score: 53.0%.

Method 2: Model Selection

Show a model the issue description, codebase context, and all 10 candidate edits in git diff form. Ask it to pick the best one. Score: 52.0%. Surprisingly, this underperforms pure test voting. The model struggles to distinguish between 10 similar-looking diffs.

Method 3: Top-3 Filtering + Model Selection

First, use test voting to narrow down to the top 3 edits (the ones passing the most tests). Then use model selection among only 3 candidates instead of 10. Score: 55.6%. The filtering step removes noise, making the model's job much easier.

Method 4: Selection State Machine (Best)

Upgrade from a single model call to a multi-turn state machine. After top-3 filtering, the selection model can write new test scripts specifically designed to distinguish between the remaining candidates. It runs these tests on all 3 edits and the unedited codebase, then decides: write another distinguishing test, or pick a winner. Score: 57.4%.

The voting mechanism explained: Each trajectory produces both an edit and a test. Trajectory #3's test might be "import the fixed module and check the error message format." Trajectory #7's test might be "trigger the bug with a specific input and check the return value." A correct edit should pass both tests. An incorrect edit might pass its own test but fail others. Cross-voting exposes this.
Selection MethodScore %
Random Selection45.8
Majority Voting (tests)53.0
Model Selection (raw)52.0
Top-3 Filter + Model55.6
Selection State Machine57.4
Oracle (coverage ceiling)69.8
The selection state machine costs <6% of total budget. Despite being the most expensive selection method, it adds only $132 across all 500 issues. That's a great return: +1.8% score over top-3 model selection for 5.8% of the total cost.
Why does model selection (52.0%) underperform majority voting with tests (53.0%)?

Chapter 7: Results

Final Score: 57.4% on SWE-bench Verified

CodeMonkeys resolves 57.4% of SWE-bench Verified issues. At the time of publication (January 2025), this placed it among the top systems on the leaderboard.

Cost Breakdown

Total cost for all 500 issues: $2,291.90. That's about $4.58 per issue. Let's break it down.

StageCost (USD)% of Total
Relevance Scan (Qwen, local)33414.6%
Ranking (Claude)200.9%
Test Generation (Claude)44019.2%
Edit Generation (Claude)1,36659.6%
Selection (Claude)1325.8%
Edit generation dominates. Nearly 60% of costs go to the editing state machines. The biggest sub-component is prompt cache reads — every editing turn re-reads the ~74K tokens of codebase context. This is why separating test generation (no codebase context needed) from edit generation was an important cost optimization.

The Barrel of Monkeys

CodeMonkeys' selection method works on candidates from any source. The authors created an ensemble — the "Barrel of Monkeys" — by combining CodeMonkeys' edits with the top 4 SWE-bench submissions (Blackbox AI, CodeStory, Learn-by-interact, devlo). This 5-system ensemble has coverage of 80.8%. After selection: 66.2% — higher than any individual system.

SWE-bench Verified Leaderboard

Comparing CodeMonkeys and Barrel of Monkeys to top submissions. Oracle selection shows the coverage ceiling.

Why does the Barrel of Monkeys (66.2%) outperform its best individual member (Blackbox AI at 62.8%)?

Chapter 8: Scaling Analysis

The paper provides one of the most thorough analyses of serial vs. parallel scaling tradeoffs in the literature. Let's dig into the data.

The Coverage Frontier

The authors sweep over all combinations of 1-10 parallel trajectories and 1-8 serial turns, measuring coverage and cost for each configuration. The key finding: after the first few iterations, configurations with similar total cost achieve similar coverage, regardless of how the budget is split.

For example, at ~$500 total cost:

The configurations converge to roughly the same coverage. But they don't converge to the same final score after selection, because more parallel samples make selection harder.

The first 2-3 turns are special. Going from 1 turn to 3 turns per trajectory produces a massive coverage jump. This makes sense: the first turn writes a draft, the second catches import/config errors, the third fixes the actual logic. Beyond that, additional turns encounter diminishing returns.

Serial vs. Parallel: Different Tradeoffs

DimensionSerial (more turns)Parallel (more trajectories)
What it providesRefinement within an approachDiversity across approaches
Diminishing returnsModel may approve early and terminateLog-linear: each new sample has decreasing marginal coverage
Selection impactFewer candidates = easier selectionMore candidates = harder selection
Failure modeStuck trajectory can't escape bad approachShallow attempts may not iterate enough to fix errors
GuaranteeNo guarantee model uses extra turnsEach new trajectory guarantees a fresh start

Majority Voting Score Also Shows Convergence

When using majority voting for selection (not the selection state machine), similar-cost configurations also yield similar final scores. This means the frontier property holds not just for coverage but also for the final output — at least under simple selection methods.

Practical recommendation: The paper suggests that the "sweet spot" is somewhere around 8-10 trajectories with 6-8 turns each. This gives enough serial depth for the first few critical iterations while providing enough parallel diversity to cover the long tail of problems that any single trajectory might miss.
Why don't configurations with the same coverage always achieve the same final score?

Chapter 9: Connections

CodeMonkeys sits at the intersection of test-time compute scaling, coding agents, and software engineering automation. Let's map its lineage.

Large Language Monkeys (Brown et al., 2024)

By the same authors. Established that coverage increases log-linearly with repeated sampling across math and coding tasks. CodeMonkeys is the direct follow-up: "How would you design a system if scaling test-time compute was a primary consideration?" Large Language Monkeys showed the potential; CodeMonkeys builds the actual system.

AlphaCode (Li et al., 2022)

DeepMind's system for competitive programming. Also uses massive parallel sampling (up to 1M samples) with a filtering/clustering pipeline for selection. CodeMonkeys operates at a much smaller scale (10 samples) but with richer per-sample compute (multi-turn iteration). AlphaCode showed that generating many candidates and selecting well can beat generating one perfect answer.

Agentless (Xia et al., 2024)

Also decomposes SWE-bench into context + generation + selection. But Agentless uses a fixed pipeline (localization → repair → selection) without multi-turn iteration. CodeMonkeys adds the iteration loop and scales both serial and parallel compute more aggressively.

Meta-Harness (Lee et al., 2026)

Takes the outer-loop optimization idea further: instead of designing the coding agent by hand, automatically search over the harness code that wraps the model. Meta-Harness could in principle discover a CodeMonkeys-like architecture through automated search.

RLEF / Process Reward Models

CodeMonkeys uses model-generated tests as a form of process verification. Each test checks whether the edit resolves the specific issue. This is related to process reward models that evaluate intermediate reasoning steps, but CodeMonkeys' tests are executable — they provide ground-truth signal, not learned estimates.

Cheat Sheet

AspectCodeMonkeys
TaskSWE-bench Verified (500 real GitHub issues)
Primary modelClaude 3.5 Sonnet
Context modelQwen-2.5-Coder-32B (local)
Parallel trajectories10 per issue
Serial turnsUp to 8 per state machine
Context strategyScan every file, rank, 128K limit (92.6% recall)
Coverage69.8%
Final score57.4%
Total cost~$2,300 ($4.58/issue)
Selection methodTest voting top-3 + selection state machine
Barrel of Monkeys66.2% (5-system ensemble)
The bigger picture: CodeMonkeys demonstrates that test-time compute scaling works for software engineering, not just math and reasoning. The system design principles — iterate with execution feedback, sample diverse approaches, amortize shared costs, combine automated and model-based selection — likely generalize to any domain where outputs are verifiable.
What is the key design principle from CodeMonkeys that generalizes beyond SWE-bench?