Fix real GitHub issues by combining serial (multi-turn iteration) and parallel (multi-trajectory sampling) test-time compute. 57.4% on SWE-bench Verified for ~$2300.
You have a real GitHub issue. A user reports a bug in a popular Python library — maybe django, sympy, or scikit-learn. The issue description explains the expected behavior, the actual behavior, and maybe a minimal reproduction. Your job: read the codebase, find the right files, write a patch that resolves the issue, and make sure you didn't break anything else.
This is the SWE-bench problem. It's a benchmark of 500 real GitHub issues (the "Verified" split) from 12 popular Python repositories, each with a known solution and a test suite that checks whether your patch actually works. No toy problems — these are the same issues that real open-source developers solved by hand.
The standard approach is to give an LLM the issue description and some codebase context, ask it to generate a patch, and submit. Maybe you let it use tools — search files, run commands. Maybe you let it iterate a few turns. But you fundamentally generate one attempt and submit it.
The problem? Single-shot approaches plateau. An LLM might solve 40-50% of issues on its best attempt, but there's a long tail of issues where it gets stuck — wrong files, wrong diagnosis, wrong patch. More turns sometimes help, but the model can get locked into a bad trajectory. What if you could throw more compute at the problem?
A single attempt solves ~46% of issues. But if you sample 10 independent attempts and could magically pick the right one, coverage jumps to ~70%. The gap is the selection problem.
CodeMonkeys decomposes the problem into three subtasks — context, generation, selection — and scales test-time compute along two orthogonal axes.
Instead of asking the model to generate a patch in one shot, let it iterate. The model writes a draft patch, generates a test script, runs the test, reads the output, revises the patch, re-runs the test. Each turn gives the model execution feedback it can use to refine its answer. More turns = more refinement. This is serial scaling.
Instead of running one trajectory, run many. Each trajectory starts from scratch with a positive sampling temperature, producing diverse approaches to the same issue. If trajectory #3 gets stuck on a bad diagnosis but trajectory #7 nails it, you haven't lost anything. More trajectories = more diversity. This is parallel scaling.
The system uses Claude 3.5 Sonnet as the primary model and Qwen-2.5-Coder-32B for the initial file scan. For each of the 500 SWE-bench Verified issues, CodeMonkeys runs 10 parallel trajectories with up to 8 turns each, then selects a final answer. The whole run costs ~$2300 in API fees.
The core execution unit of CodeMonkeys is a pair of state machines: a Testing State Machine followed by an Editing State Machine. Let's walk through exactly what happens.
Given the issue description (but not the codebase context), the model writes a standalone Python test script that tries to reproduce the bug. The script uses exit codes to communicate: exit 1 if the bug is present, exit 0 if it's fixed. The model runs this script on the unedited codebase, reads the execution output, and iterates until the script correctly detects the bug.
Now the model receives the full codebase context (~74K tokens of relevant files) plus the test script from step 1. It generates a codebase edit in aider-style diff format. Then it runs the test before the edit (should fail) and after the edit (should pass). If the test doesn't pass post-edit, the model sees the execution output and iterates.
Each state machine is limited to 8 iterations (model completions). The model can choose to approve its work and terminate early. Format corrections and malformed outputs also count against the limit.
Most SWE-bench codebases contain millions of tokens. You can't feed everything to the model. Existing approaches use embedding-based retrieval, iterative file-tree expansion, or tool-based browsing. CodeMonkeys takes a radically simpler approach: let an LLM read every file.
The key insight is amortization. If you're generating 10 candidate edits per issue, you can run the expensive file scan once and share the results across all 10 trajectories. The scan costs $334 across all 500 issues — about 15% of the total budget. If you re-ran it for each of the 10 edits, it would become the most expensive step. Sharing makes the simple approach affordable.
Stage 1 — Relevance scan: Qwen-2.5-Coder-32B (run locally, cheap) reads every Python file in the repo (excluding test directories) and decides whether each file is relevant to the issue. It also generates a concise summary of how each relevant file relates to the issue. On average, this processes 2.94 million tokens per problem.
Stage 2 — Ranking: Claude 3.5 Sonnet receives the file names, summaries, and token counts of all relevant files. It ranks them by importance, targeting ~60K tokens of context. Three ranking completions are averaged to produce a stable final ordering. Files are included up to a 128K token limit, yielding an average of 74,570 tokens of context — a 50.5x compression from the raw scan.
With a 128K token limit, 92.6% of SWE-bench instances have all the correct files in context. The remaining 7.4% are cases where the needed files were either missed by the scan or ranked too low to fit.
Watch the two-stage compression pipeline. Click to step through each stage.
Serial scaling means giving each trajectory more turns to iterate. The model writes an edit, runs its test, sees what failed, and tries again. Let's see this in action.
In the editing state machine, a typical trajectory might look like this:
Each turn provides execution feedback — the model sees the actual stdout/stderr from running its test. This is far richer than having the model self-reflect without execution. The authors note that "models cannot self-correct reasoning" without external feedback, which is why the test-execute-iterate loop is essential.
Serial scaling has a fundamental limitation: a model can incorrectly approve its own work and terminate the state machine early. If the model believes its edit is correct (even when it isn't), adding more iterations doesn't help because the state machine has already terminated. This is a key reason why parallel scaling is needed as a complement.
Step through turns of an editing state machine. Watch the edit quality improve with execution feedback.
Parallel scaling means running multiple independent trajectories for the same issue, each with a positive sampling temperature to introduce diversity. The trajectories don't know about each other — they're completely independent.
This was the central finding of the authors' prior work, Large Language Monkeys: the fraction of problems solved by any sample (coverage) increases approximately log-linearly with the number of samples. If you plot coverage vs. number of trajectories on a log-x scale, you get a roughly straight line.
For CodeMonkeys with 10 trajectories and 8 iterations each, coverage reaches 69.8%. That means nearly 70% of SWE-bench Verified issues have a correct patch somewhere among the 10 candidates.
Consider two configurations that cost roughly the same:
They achieve similar coverage! The paper finds that, after the first few iterations, configurations with similar total cost reach similar coverage regardless of how the budget is split between serial and parallel. But there's a catch: more parallel samples make selection harder. You have more candidates to sift through, and more wrong answers diluting the signal.
Parallel scaling has one guarantee that serial scaling lacks: every new trajectory is a fresh start. A stuck trajectory that approved a wrong edit can't be saved by more turns. But a new parallel trajectory starts from scratch and may take a completely different approach to the same problem.
Watch coverage grow as more trajectories are sampled. Each new trajectory is an independent attempt that may solve a problem no prior trajectory solved.
You have 10 candidate edits per issue, each with a corresponding test script. Coverage is 69.8%. Random selection gives 45.8%. How do you pick the right one?
Run each of the 10 model-generated tests on each of the 10 edits. That's 100 test executions per issue. Select the edit that passes the most tests. The intuition: if your edit is correct, it should pass most of the tests written by other trajectories (since those tests are trying to reproduce the same bug). Score: 53.0%.
Show a model the issue description, codebase context, and all 10 candidate edits in git diff form. Ask it to pick the best one. Score: 52.0%. Surprisingly, this underperforms pure test voting. The model struggles to distinguish between 10 similar-looking diffs.
First, use test voting to narrow down to the top 3 edits (the ones passing the most tests). Then use model selection among only 3 candidates instead of 10. Score: 55.6%. The filtering step removes noise, making the model's job much easier.
Upgrade from a single model call to a multi-turn state machine. After top-3 filtering, the selection model can write new test scripts specifically designed to distinguish between the remaining candidates. It runs these tests on all 3 edits and the unedited codebase, then decides: write another distinguishing test, or pick a winner. Score: 57.4%.
| Selection Method | Score % |
|---|---|
| Random Selection | 45.8 |
| Majority Voting (tests) | 53.0 |
| Model Selection (raw) | 52.0 |
| Top-3 Filter + Model | 55.6 |
| Selection State Machine | 57.4 |
| Oracle (coverage ceiling) | 69.8 |
CodeMonkeys resolves 57.4% of SWE-bench Verified issues. At the time of publication (January 2025), this placed it among the top systems on the leaderboard.
Total cost for all 500 issues: $2,291.90. That's about $4.58 per issue. Let's break it down.
| Stage | Cost (USD) | % of Total |
|---|---|---|
| Relevance Scan (Qwen, local) | 334 | 14.6% |
| Ranking (Claude) | 20 | 0.9% |
| Test Generation (Claude) | 440 | 19.2% |
| Edit Generation (Claude) | 1,366 | 59.6% |
| Selection (Claude) | 132 | 5.8% |
CodeMonkeys' selection method works on candidates from any source. The authors created an ensemble — the "Barrel of Monkeys" — by combining CodeMonkeys' edits with the top 4 SWE-bench submissions (Blackbox AI, CodeStory, Learn-by-interact, devlo). This 5-system ensemble has coverage of 80.8%. After selection: 66.2% — higher than any individual system.
Comparing CodeMonkeys and Barrel of Monkeys to top submissions. Oracle selection shows the coverage ceiling.
The paper provides one of the most thorough analyses of serial vs. parallel scaling tradeoffs in the literature. Let's dig into the data.
The authors sweep over all combinations of 1-10 parallel trajectories and 1-8 serial turns, measuring coverage and cost for each configuration. The key finding: after the first few iterations, configurations with similar total cost achieve similar coverage, regardless of how the budget is split.
For example, at ~$500 total cost:
The configurations converge to roughly the same coverage. But they don't converge to the same final score after selection, because more parallel samples make selection harder.
| Dimension | Serial (more turns) | Parallel (more trajectories) |
|---|---|---|
| What it provides | Refinement within an approach | Diversity across approaches |
| Diminishing returns | Model may approve early and terminate | Log-linear: each new sample has decreasing marginal coverage |
| Selection impact | Fewer candidates = easier selection | More candidates = harder selection |
| Failure mode | Stuck trajectory can't escape bad approach | Shallow attempts may not iterate enough to fix errors |
| Guarantee | No guarantee model uses extra turns | Each new trajectory guarantees a fresh start |
When using majority voting for selection (not the selection state machine), similar-cost configurations also yield similar final scores. This means the frontier property holds not just for coverage but also for the final output — at least under simple selection methods.
CodeMonkeys sits at the intersection of test-time compute scaling, coding agents, and software engineering automation. Let's map its lineage.
By the same authors. Established that coverage increases log-linearly with repeated sampling across math and coding tasks. CodeMonkeys is the direct follow-up: "How would you design a system if scaling test-time compute was a primary consideration?" Large Language Monkeys showed the potential; CodeMonkeys builds the actual system.
DeepMind's system for competitive programming. Also uses massive parallel sampling (up to 1M samples) with a filtering/clustering pipeline for selection. CodeMonkeys operates at a much smaller scale (10 samples) but with richer per-sample compute (multi-turn iteration). AlphaCode showed that generating many candidates and selecting well can beat generating one perfect answer.
Also decomposes SWE-bench into context + generation + selection. But Agentless uses a fixed pipeline (localization → repair → selection) without multi-turn iteration. CodeMonkeys adds the iteration loop and scales both serial and parallel compute more aggressively.
Takes the outer-loop optimization idea further: instead of designing the coding agent by hand, automatically search over the harness code that wraps the model. Meta-Harness could in principle discover a CodeMonkeys-like architecture through automated search.
CodeMonkeys uses model-generated tests as a form of process verification. Each test checks whether the edit resolves the specific issue. This is related to process reward models that evaluate intermediate reasoning steps, but CodeMonkeys' tests are executable — they provide ground-truth signal, not learned estimates.
| Aspect | CodeMonkeys |
|---|---|
| Task | SWE-bench Verified (500 real GitHub issues) |
| Primary model | Claude 3.5 Sonnet |
| Context model | Qwen-2.5-Coder-32B (local) |
| Parallel trajectories | 10 per issue |
| Serial turns | Up to 8 per state machine |
| Context strategy | Scan every file, rank, 128K limit (92.6% recall) |
| Coverage | 69.8% |
| Final score | 57.4% |
| Total cost | ~$2,300 ($4.58/issue) |
| Selection method | Test voting top-3 + selection state machine |
| Barrel of Monkeys | 66.2% (5-system ensemble) |