CodeMonkeys — Veanors

Chapter 0: The Problem

You have a real GitHub issue. A user reports a bug in a popular Python library — maybe django, sympy, or scikit-learn. The issue description explains the expected behavior, the actual behavior, and maybe a minimal reproduction. Your job: read the codebase, find the right files, write a patch that resolves the issue, and make sure you didn't break anything else.

This is the SWE-bench problem. It's a benchmark of 500 real GitHub issues (the "Verified" split) from 12 popular Python repositories, each with a known solution and a test suite that checks whether your patch actually works. No toy problems — these are the same issues that real open-source developers solved by hand.

Why SWE-bench matters: It tests something much harder than code completion. You need to understand a multi-million-token codebase, locate the relevant files among hundreds, reason about the bug, write a correct patch, and avoid regressions. As of early 2025, no single-shot approach exceeds ~55%.

The standard approach is to give an LLM the issue description and some codebase context, ask it to generate a patch, and submit. Maybe you let it use tools — search files, run commands. Maybe you let it iterate a few turns. But you fundamentally generate one attempt and submit it.

The problem? Single-shot approaches plateau. An LLM might solve 40-50% of issues on its best attempt, but there's a long tail of issues where it gets stuck — wrong files, wrong diagnosis, wrong patch. More turns sometimes help, but the model can get locked into a bad trajectory. What if you could throw more compute at the problem?

SWE-bench: The Gap Between One Shot and Many Shots

A single attempt solves ~46% of issues. But if you sample 10 independent attempts and could magically pick the right one, coverage jumps to ~70%. The gap is the selection problem.

Why do single-shot approaches to SWE-bench plateau around 40-55%?

Because the model can get stuck in a bad trajectory (wrong files, wrong diagnosis), and a single attempt doesn't explore diverse approaches to the same problem Because the benchmark is too noisy Because SWE-bench only tests simple bugs

Chapter 1: The Key Insight

CodeMonkeys decomposes the problem into three subtasks — context, generation, selection — and scales test-time compute along two orthogonal axes.

Serial Scaling: Iterate Within a Trajectory

Instead of asking the model to generate a patch in one shot, let it iterate. The model writes a draft patch, generates a test script, runs the test, reads the output, revises the patch, re-runs the test. Each turn gives the model execution feedback it can use to refine its answer. More turns = more refinement. This is serial scaling.

Parallel Scaling: Sample Many Trajectories

Instead of running one trajectory, run many. Each trajectory starts from scratch with a positive sampling temperature, producing diverse approaches to the same issue. If trajectory #3 gets stuck on a bad diagnosis but trajectory #7 nails it, you haven't lost anything. More trajectories = more diversity. This is parallel scaling.

The combination is key. Serial scaling alone hits diminishing returns — a stuck trajectory doesn't get unstuck by more turns. Parallel scaling alone gives diversity but wastes compute on shallow attempts. CodeMonkeys combines both: each trajectory gets enough turns to iterate to a good solution, and you sample enough trajectories to cover diverse approaches.

1. Context

Identify relevant codebase files. Can we find the right files? Measured by recall.

↓

2. Generation

Produce candidate edits (serial x parallel). Is the correct patch among any candidate? Measured by coverage.

↓

3. Selection

Choose the best edit from candidates. Measured by final score.

The system uses Claude 3.5 Sonnet as the primary model and Qwen-2.5-Coder-32B for the initial file scan. For each of the 500 SWE-bench Verified issues, CodeMonkeys runs 10 parallel trajectories with up to 8 turns each, then selects a final answer. The whole run costs ~$2300 in API fees.

What are the two axes of test-time compute scaling in CodeMonkeys?

Training compute and inference compute Serial (more turns per trajectory for refinement) and parallel (more independent trajectories for diversity) Model size and context length

Chapter 2: Multi-Turn Trajectories

The core execution unit of CodeMonkeys is a pair of state machines: a Testing State Machine followed by an Editing State Machine. Let's walk through exactly what happens.

Step 1: The Testing State Machine

Given the issue description (but not the codebase context), the model writes a standalone Python test script that tries to reproduce the bug. The script uses exit codes to communicate: exit 1 if the bug is present, exit 0 if it's fixed. The model runs this script on the unedited codebase, reads the execution output, and iterates until the script correctly detects the bug.

Why test first? Writing a test doesn't require codebase context — you only need the issue description. By splitting testing from editing, CodeMonkeys avoids paying for expensive codebase context tokens during the test-writing phase. The testing state machine's prompts are much shorter (no 128K tokens of code files), which saves on prompt cache read costs.

Step 2: The Editing State Machine

Now the model receives the full codebase context (~74K tokens of relevant files) plus the test script from step 1. It generates a codebase edit in aider-style diff format. Then it runs the test before the edit (should fail) and after the edit (should pass). If the test doesn't pass post-edit, the model sees the execution output and iterates.

Two-sided debugging. The model can also revise the test during editing. Why? A test written against the unedited codebase might always fail — even after a correct fix. By letting the model adjust the test during editing, it can verify that the test fails pre-edit and passes post-edit. This bidirectional checking is critical.

The Full Trajectory

Testing SM

Input: issue description. Write test script. Run on unedited repo. Iterate until test correctly flags the bug (exit 1). No codebase files needed.

↓ test script

Editing SM

Input: issue + codebase context + test script. Write edit (aider diff). Run test pre-edit (should fail) and post-edit (should pass). Iterate on failures. Can also revise the test.

↓ (edit, test) pair

Output

A candidate codebase edit + a testing script. Both the edit and the test are used downstream for selection.

Each state machine is limited to 8 iterations (model completions). The model can choose to approve its work and terminate early. Format corrections and malformed outputs also count against the limit.

Why does CodeMonkeys split test generation into a separate state machine from edit generation?

Because test scripts are always trivial to write Because the testing framework requires a separate model Because test writing doesn't need codebase context, so the testing state machine avoids the expensive 128K-token prompt, reducing cache read costs

Chapter 3: Reading the Whole Codebase

Most SWE-bench codebases contain millions of tokens. You can't feed everything to the model. Existing approaches use embedding-based retrieval, iterative file-tree expansion, or tool-based browsing. CodeMonkeys takes a radically simpler approach: let an LLM read every file.

Amortized Context

The key insight is amortization. If you're generating 10 candidate edits per issue, you can run the expensive file scan once and share the results across all 10 trajectories. The scan costs $334 across all 500 issues — about 15% of the total budget. If you re-ran it for each of the 10 edits, it would become the most expensive step. Sharing makes the simple approach affordable.

The amortization math: Scanning all files for 500 issues costs $334 total. That's $0.67 per issue. If you ran it 10 times per issue (once per trajectory), it would be $6.70 per issue — $3,340 total, more than the entire rest of the system. Sharing context across parallel trajectories makes brute-force retrieval viable.

The Two-Stage Pipeline

Stage 1 — Relevance scan: Qwen-2.5-Coder-32B (run locally, cheap) reads every Python file in the repo (excluding test directories) and decides whether each file is relevant to the issue. It also generates a concise summary of how each relevant file relates to the issue. On average, this processes 2.94 million tokens per problem.

Stage 2 — Ranking: Claude 3.5 Sonnet receives the file names, summaries, and token counts of all relevant files. It ranks them by importance, targeting ~60K tokens of context. Three ranking completions are averaged to produce a stable final ordering. Files are included up to a 128K token limit, yielding an average of 74,570 tokens of context — a 50.5x compression from the raw scan.

Recall

With a 128K token limit, 92.6% of SWE-bench instances have all the correct files in context. The remaining 7.4% are cases where the needed files were either missed by the scan or ranked too low to fit.

Context Pipeline: From Millions of Tokens to 74K

Watch the two-stage compression pipeline. Click to step through each stage.

Click to start

Why can CodeMonkeys afford to let an LLM read every file in the codebase?

Because the scan is run once and shared across all 10 parallel trajectories (amortization), making the per-trajectory cost negligible at ~15% of total budget Because the codebases are small enough to fit in context Because reading files is free with the Claude API

Chapter 4: Serial Scaling

Serial scaling means giving each trajectory more turns to iterate. The model writes an edit, runs its test, sees what failed, and tries again. Let's see this in action.

What Happens Across Turns

In the editing state machine, a typical trajectory might look like this:

Turn 1

Model writes initial edit based on issue + context. Runs test. Test crashes with ImportError — wrong module path.

↓

Turn 2

Model fixes the import. Runs test again. Test runs but fails — the edit handled the main case but missed an edge case in the error message.

↓

Turn 3

Model reads the test output, realizes the edge case. Adjusts the patch. Runs test. Test passes pre-edit fail / post-edit pass. Model approves.

Each turn provides execution feedback — the model sees the actual stdout/stderr from running its test. This is far richer than having the model self-reflect without execution. The authors note that "models cannot self-correct reasoning" without external feedback, which is why the test-execute-iterate loop is essential.

The first few turns matter most. The coverage-vs-cost curves show that the first 2-3 serial iterations produce the biggest improvement. Early iterations typically fix configuration errors, import issues, and easy-to-catch bugs. Later iterations encounter diminishing returns — the model may already have approved its work or may be stuck in a loop.

The Stuck Trajectory Problem

Serial scaling has a fundamental limitation: a model can incorrectly approve its own work and terminate the state machine early. If the model believes its edit is correct (even when it isn't), adding more iterations doesn't help because the state machine has already terminated. This is a key reason why parallel scaling is needed as a complement.

Serial Scaling: Watch a Trajectory Iterate

Step through turns of an editing state machine. Watch the edit quality improve with execution feedback.

Turn 0 / 8

Why does serial scaling hit diminishing returns?

Because the API costs increase linearly Because the model can incorrectly approve its work and terminate early, and stuck trajectories don't benefit from more allowed turns Because the context window fills up

Chapter 5: Parallel Scaling

Parallel scaling means running multiple independent trajectories for the same issue, each with a positive sampling temperature to introduce diversity. The trajectories don't know about each other — they're completely independent.

Coverage Increases Log-Linearly

This was the central finding of the authors' prior work, Large Language Monkeys: the fraction of problems solved by any sample (coverage) increases approximately log-linearly with the number of samples. If you plot coverage vs. number of trajectories on a log-x scale, you get a roughly straight line.

For CodeMonkeys with 10 trajectories and 8 iterations each, coverage reaches 69.8%. That means nearly 70% of SWE-bench Verified issues have a correct patch somewhere among the 10 candidates.

69.8% coverage vs. 57.4% final score. The gap between coverage and final score is the selection problem. If you had a perfect oracle that always picked the correct edit, you'd score 69.8%. Random selection gives only 45.8%. The actual selection method recovers about half of this gap.

Why Parallel Beats Deeper Serial

Consider two configurations that cost roughly the same:

Config A: 2 trajectories, 8 turns each
Config B: 8 trajectories, 2 turns each

They achieve similar coverage! The paper finds that, after the first few iterations, configurations with similar total cost reach similar coverage regardless of how the budget is split between serial and parallel. But there's a catch: more parallel samples make selection harder. You have more candidates to sift through, and more wrong answers diluting the signal.

The Fresh Start Advantage

Parallel scaling has one guarantee that serial scaling lacks: every new trajectory is a fresh start. A stuck trajectory that approved a wrong edit can't be saved by more turns. But a new parallel trajectory starts from scratch and may take a completely different approach to the same problem.

Parallel Scaling: Coverage vs. Number of Trajectories

Watch coverage grow as more trajectories are sampled. Each new trajectory is an independent attempt that may solve a problem no prior trajectory solved.

0 trajectories | coverage: 0%

What is the "fresh start advantage" of parallel scaling over serial scaling?

Parallel trajectories are cheaper per turn Parallel trajectories use a larger model A new trajectory starts from scratch and may take a completely different approach, unlike a stuck serial trajectory that already approved a wrong answer

Chapter 6: Selection

You have 10 candidate edits per issue, each with a corresponding test script. Coverage is 69.8%. Random selection gives 45.8%. How do you pick the right one?

Method 1: Majority Voting with Tests

Run each of the 10 model-generated tests on each of the 10 edits. That's 100 test executions per issue. Select the edit that passes the most tests. The intuition: if your edit is correct, it should pass most of the tests written by other trajectories (since those tests are trying to reproduce the same bug). Score: 53.0%.

Method 2: Model Selection

Show a model the issue description, codebase context, and all 10 candidate edits in git diff form. Ask it to pick the best one. Score: 52.0%. Surprisingly, this underperforms pure test voting. The model struggles to distinguish between 10 similar-looking diffs.

Method 3: Top-3 Filtering + Model Selection

First, use test voting to narrow down to the top 3 edits (the ones passing the most tests). Then use model selection among only 3 candidates instead of 10. Score: 55.6%. The filtering step removes noise, making the model's job much easier.

Method 4: Selection State Machine (Best)

Upgrade from a single model call to a multi-turn state machine. After top-3 filtering, the selection model can write new test scripts specifically designed to distinguish between the remaining candidates. It runs these tests on all 3 edits and the unedited codebase, then decides: write another distinguishing test, or pick a winner. Score: 57.4%.

The voting mechanism explained: Each trajectory produces both an edit and a test. Trajectory #3's test might be "import the fixed module and check the error message format." Trajectory #7's test might be "trigger the bug with a specific input and check the return value." A correct edit should pass both tests. An incorrect edit might pass its own test but fail others. Cross-voting exposes this.

Selection Method	Score %
Random Selection	45.8
Majority Voting (tests)	53.0
Model Selection (raw)	52.0
Top-3 Filter + Model	55.6
Selection State Machine	57.4
Oracle (coverage ceiling)	69.8

The selection state machine costs <6% of total budget. Despite being the most expensive selection method, it adds only $132 across all 500 issues. That's a great return: +1.8% score over top-3 model selection for 5.8% of the total cost.

Why does model selection (52.0%) underperform majority voting with tests (53.0%)?

Because distinguishing between 10 similar-looking git diffs by reading alone is harder than running executable tests that provide ground-truth signal about whether an edit actually fixes the bug Because the model used for selection is weaker Because model selection is more expensive

Chapter 7: Results

Final Score: 57.4% on SWE-bench Verified

CodeMonkeys resolves 57.4% of SWE-bench Verified issues. At the time of publication (January 2025), this placed it among the top systems on the leaderboard.

Cost Breakdown

Total cost for all 500 issues: $2,291.90. That's about $4.58 per issue. Let's break it down.

Stage	Cost (USD)	% of Total
Relevance Scan (Qwen, local)	334	14.6%
Ranking (Claude)	20	0.9%
Test Generation (Claude)	440	19.2%
Edit Generation (Claude)	1,366	59.6%
Selection (Claude)	132	5.8%

Edit generation dominates. Nearly 60% of costs go to the editing state machines. The biggest sub-component is prompt cache reads — every editing turn re-reads the ~74K tokens of codebase context. This is why separating test generation (no codebase context needed) from edit generation was an important cost optimization.

The Barrel of Monkeys

CodeMonkeys' selection method works on candidates from any source. The authors created an ensemble — the "Barrel of Monkeys" — by combining CodeMonkeys' edits with the top 4 SWE-bench submissions (Blackbox AI, CodeStory, Learn-by-interact, devlo). This 5-system ensemble has coverage of 80.8%. After selection: 66.2% — higher than any individual system.

SWE-bench Verified Leaderboard

Comparing CodeMonkeys and Barrel of Monkeys to top submissions. Oracle selection shows the coverage ceiling.

Why does the Barrel of Monkeys (66.2%) outperform its best individual member (Blackbox AI at 62.8%)?

Because different systems solve different subsets of problems, so ensembling gives higher coverage (80.8%), and the selection state machine can identify the correct edit among diverse candidates Because the Barrel uses a stronger model Because ensemble methods always outperform individual methods

Chapter 8: Scaling Analysis

The paper provides one of the most thorough analyses of serial vs. parallel scaling tradeoffs in the literature. Let's dig into the data.

The Coverage Frontier

The authors sweep over all combinations of 1-10 parallel trajectories and 1-8 serial turns, measuring coverage and cost for each configuration. The key finding: after the first few iterations, configurations with similar total cost achieve similar coverage, regardless of how the budget is split.

For example, at ~$500 total cost:

3 trajectories x 8 turns = ~58% coverage
5 trajectories x 4 turns = ~58% coverage
10 trajectories x 2 turns = ~57% coverage

The configurations converge to roughly the same coverage. But they don't converge to the same final score after selection, because more parallel samples make selection harder.

The first 2-3 turns are special. Going from 1 turn to 3 turns per trajectory produces a massive coverage jump. This makes sense: the first turn writes a draft, the second catches import/config errors, the third fixes the actual logic. Beyond that, additional turns encounter diminishing returns.

Serial vs. Parallel: Different Tradeoffs

Dimension	Serial (more turns)	Parallel (more trajectories)
What it provides	Refinement within an approach	Diversity across approaches
Diminishing returns	Model may approve early and terminate	Log-linear: each new sample has decreasing marginal coverage
Selection impact	Fewer candidates = easier selection	More candidates = harder selection
Failure mode	Stuck trajectory can't escape bad approach	Shallow attempts may not iterate enough to fix errors
Guarantee	No guarantee model uses extra turns	Each new trajectory guarantees a fresh start

Majority Voting Score Also Shows Convergence

When using majority voting for selection (not the selection state machine), similar-cost configurations also yield similar final scores. This means the frontier property holds not just for coverage but also for the final output — at least under simple selection methods.

Practical recommendation: The paper suggests that the "sweet spot" is somewhere around 8-10 trajectories with 6-8 turns each. This gives enough serial depth for the first few critical iterations while providing enough parallel diversity to cover the long tail of problems that any single trajectory might miss.

Why don't configurations with the same coverage always achieve the same final score?

Because coverage is measured on a different dataset Because parallel-heavy configurations use more expensive models Because more parallel samples make selection harder — having a correct edit among many wrong ones is useless if you can't identify it

Chapter 9: Connections

CodeMonkeys sits at the intersection of test-time compute scaling, coding agents, and software engineering automation. Let's map its lineage.

Large Language Monkeys (Brown et al., 2024)

By the same authors. Established that coverage increases log-linearly with repeated sampling across math and coding tasks. CodeMonkeys is the direct follow-up: "How would you design a system if scaling test-time compute was a primary consideration?" Large Language Monkeys showed the potential; CodeMonkeys builds the actual system.

AlphaCode (Li et al., 2022)

DeepMind's system for competitive programming. Also uses massive parallel sampling (up to 1M samples) with a filtering/clustering pipeline for selection. CodeMonkeys operates at a much smaller scale (10 samples) but with richer per-sample compute (multi-turn iteration). AlphaCode showed that generating many candidates and selecting well can beat generating one perfect answer.

Agentless (Xia et al., 2024)

Also decomposes SWE-bench into context + generation + selection. But Agentless uses a fixed pipeline (localization → repair → selection) without multi-turn iteration. CodeMonkeys adds the iteration loop and scales both serial and parallel compute more aggressively.

Meta-Harness (Lee et al., 2026)

Takes the outer-loop optimization idea further: instead of designing the coding agent by hand, automatically search over the harness code that wraps the model. Meta-Harness could in principle discover a CodeMonkeys-like architecture through automated search.

RLEF / Process Reward Models

CodeMonkeys uses model-generated tests as a form of process verification. Each test checks whether the edit resolves the specific issue. This is related to process reward models that evaluate intermediate reasoning steps, but CodeMonkeys' tests are executable — they provide ground-truth signal, not learned estimates.

Cheat Sheet

Aspect	CodeMonkeys
Task	SWE-bench Verified (500 real GitHub issues)
Primary model	Claude 3.5 Sonnet
Context model	Qwen-2.5-Coder-32B (local)
Parallel trajectories	10 per issue
Serial turns	Up to 8 per state machine
Context strategy	Scan every file, rank, 128K limit (92.6% recall)
Coverage	69.8%
Final score	57.4%
Total cost	~$2,300 ($4.58/issue)
Selection method	Test voting top-3 + selection state machine
Barrel of Monkeys	66.2% (5-system ensemble)

The bigger picture: CodeMonkeys demonstrates that test-time compute scaling works for software engineering, not just math and reasoning. The system design principles — iterate with execution feedback, sample diverse approaches, amortize shared costs, combine automated and model-based selection — likely generalize to any domain where outputs are verifiable.

What is the key design principle from CodeMonkeys that generalizes beyond SWE-bench?

Using Claude 3.5 Sonnet specifically Only working with Python codebases Combine serial iteration (with execution feedback) and parallel diversity (with amortized shared costs), then select using verifiable signals — applicable wherever outputs can be tested

CodeMonkeys: Scaling Test-Time Compute for Software Engineering