A coding agent that modifies its own source code, validates changes on benchmarks, and grows an archive of ever-better versions of itself. 20% to 50% on SWE-bench without any human intervention.
You have a coding agent. It uses an LLM to read code, propose edits, and fix bugs. It solves about 20% of the tasks on SWE-bench. You want it to be better. What do you do?
The standard approach: a human engineer studies the agent's failures, figures out what went wrong, and manually improves the code. Maybe they add a better file-editing tool. Maybe they restructure the prompt. Maybe they add a retry mechanism. Each improvement takes hours of human thought. After a year of this, the best open-source agents solve about 50% of SWE-bench tasks.
Here is the fundamental bottleneck: every improvement to an AI agent currently requires a human to design it. The agent itself has no ability to improve its own architecture, tools, or workflow. It is a fixed system, frozen in whatever configuration its creators left it in.
This is not just a convenience problem. It is a scaling problem. Human designers have limited time, limited intuition, and limited ability to explore the vast space of possible agent designs. There are millions of ways to structure prompts, tools, workflows, and retry strategies. A human explores a few dozen. An automated system could explore thousands.
A fixed agent stays at whatever level its human designer left it. Click "Run Iterations" to see what happens when the agent can improve itself.
Here is the idea that makes the Darwin Godel Machine work: a coding agent's ability to solve coding tasks IS its ability to improve itself, because self-improvement is itself a coding task.
Think about what self-improvement means for a coding agent. The agent's behavior is determined by its Python source code: the prompt templates, the tools it can use, the workflow that orchestrates them. Improving the agent means modifying that source code. But modifying source code is exactly what a coding agent does for a living.
This is a profound observation. In most AI systems, the ability to solve downstream tasks is separate from the ability to improve the system itself. A chess engine gets better at chess but not at improving chess engines. A language model generates better text but does not modify its own architecture. But for a coding agent, the two are the same skill.
The DGM makes this loop concrete. It starts with a simple coding agent (two tools: bash and file editor). It asks that agent to solve coding benchmarks. It then asks the agent to modify its own code to do better. The modified agent is evaluated. If it improves, the improvement is kept. The cycle repeats.
In 2007, Jurgen Schmidhuber proposed the Godel Machine: a theoretical AI that modifies itself in a provably beneficial way. Before making any change to its own code, the system would construct a mathematical proof that the change would improve future performance. Only proven improvements would be applied.
This is a beautiful idea. It is also completely impractical.
Consider what a formal proof would require. The system would need to model its own future behavior precisely enough to guarantee that a code change leads to better outcomes. This means modeling the LLM's stochastic outputs, the diversity of tasks it will encounter, and the complex interactions between its tools, prompts, and workflow.
For any non-trivial system, this is computationally intractable. Even verifying simple properties of programs is undecidable in general (Rice's theorem). Proving that "adding a retry mechanism to a coding agent will improve its average SWE-bench score" is far beyond what formal methods can handle today.
The Darwin Godel Machine takes a pragmatic approach. Instead of proving that a change is beneficial, it tests it. Run the modified agent on a benchmark. Did the score go up? Keep it. Did it go down? Still keep it in the archive (it might be a useful stepping stone later). Did the agent lose the ability to edit code entirely? Discard it.
| Property | Godel Machine (2007) | Darwin Godel Machine (2026) |
|---|---|---|
| Validation | Formal proof of improvement | Empirical benchmark testing |
| Search strategy | Proof search | Evolutionary (archive + mutation) |
| Modification scope | Entire system | Agent codebase (frozen FM) |
| Guarantee | Provably optimal | No guarantee (but empirically effective) |
| Practical result | 0 modifications applied | 80 iterations, 20% → 50% on SWE-bench |
The DGM is surprisingly simple. Three components: an archive of agent variants, a mutation operator (the agent modifying its own code), and an evaluation function (benchmark testing). Here is the full algorithm:
pseudocode function DGM(base_agent, benchmark, T): score = evaluate(base_agent, benchmark) archive = {(base_agent, score)} for t = 1 to T: parents = select_parents(archive) # fitness-proportionate + novelty for p in parents: child = p.modify(p) # agent modifies its OWN code score = evaluate(child, benchmark) if child.can_edit_code(): # must retain basic ability archive.add((child, score)) return archive
That is the entire algorithm. Notice what is not here: no gradient descent, no reward shaping, no prompt optimization framework. The search is driven by the agent's own ability to read failure logs, diagnose problems, and write code fixes.
p.modify(p). The parent agent modifies itself. This is self-referential: the agent that produces the next generation is not a fixed meta-agent — it is the improved version from the previous generation. Better agents produce better mutations.The DGM starts with a deliberately simple coding agent. It has exactly two tools:
That is it. No retrieval-augmented generation. No multi-attempt strategies. No context management. No test validation. These features are discovered automatically by the DGM during its evolutionary run. Starting simple is deliberate: it gives the system room to discover improvements that a human might not think of.
The agent's codebase is a Python project:
coding_agent.py — Main entry point. The forward() function orchestrates the agent's workflow.tools/ — Directory of tool implementations. Each tool has tool_info() (schema) and tool_function() (execution).prompts/ — Prompt templates for different stages of the workflow.utils/ — Shared utility functions.During self-modification, the agent can add new files, modify existing ones, change the workflow logic, create entirely new tools, or restructure the prompt strategy. The only constraint is that Python is Turing-complete — it can express any computable process.
Watch the DGM grow its archive of agent variants. Each node is an agent. Color = benchmark score. Edges show parent-child relationships. Click "Step" to run one iteration.
If the DGM simply kept the best agent and always mutated from it, it would be doing hill-climbing. Hill-climbing gets stuck in local optima. The agent might discover a line of improvement (say, better prompts) that prevents it from exploring a fundamentally different approach (say, multi-attempt strategies).
The DGM solves this with an archive: it keeps every agent that passes the basic code-editing test, regardless of its benchmark score. Low-scoring agents remain in the archive because they might contain stepping stones — partial innovations that a future mutation can build upon.
Not all agents are equally likely to be chosen as parents. The DGM uses a selection probability that balances two factors:
Where:
1 / (1 + exp(-λ(αi - α0))). High-performing agents get higher weight, but the sigmoid prevents any single agent from dominating.1 / (1 + ni), where ni is the number of functioning children. Agents with fewer children get explored more.This is inspired by quality-diversity algorithms like MAP-Elites. The archive does not just track the best solution — it maintains a diverse population of interesting solutions. Every agent retains a non-zero selection probability, so any path to improvement remains reachable given enough compute.
Consider a concrete example from the paper. The DGM discovered "more granular file editing via string replacement" at node 24 in the SWE-bench archive tree. This was a relatively low-scoring agent at the time. But it introduced a fundamental capability — line-level editing instead of whole-file replacement — that all subsequent high-performing agents built upon. If the archive had been pruned to only the current best, this stepping stone would have been lost.
| Strategy | SWE-bench | Polyglot |
|---|---|---|
| DGM (full archive + diversity) | 50.0% | 38.0% |
| DGM Greedy (always best parent) | 39.7% | 30.0% |
| DGM w/o open-ended (latest only) | 23.0% | 14.0% |
| DGM w/o self-improve (fixed meta-agent) | 39.0% | 28.0% |
In biological evolution, mutations are random. In the DGM, mutations are intelligent: an LLM reads the agent's failure logs, diagnoses what went wrong, and proposes a targeted code modification. This is not random search — it is informed, directed self-improvement.
Each mutation involves two foundation model calls:
Notice the elegance: the implementation phase uses the agent's own capabilities. A better coding agent produces better self-modifications, which produce an even better coding agent. The mutation quality improves as the system evolves.
The diagnosis prompt is carefully structured. It provides the reasoning model with:
coding_agent.py, tools/, utils/)The output is a structured JSON with five fields: log_summarization, potential_improvements, improvement_proposal, implementation_suggestion, and problem_description. The last field is formatted as a GitHub issue that the coding agent then solves.
Every newly generated agent variant must be evaluated. Evaluation is expensive — running a coding agent on SWE-bench costs about $350 per 60 tasks with Claude 3.5 Sonnet. The DGM uses a staged evaluation strategy to manage this cost.
This staged approach is crucial for cost management. Only about 51.3% of generated agents even pass Stage 1 (retaining basic code-editing functionality). The rest lose their core capability during self-modification — they break themselves.
Consider what the DGM needs to validate. "Does adding a retry mechanism improve SWE-bench performance?" A formal proof would need to model the LLM's stochastic behavior, the distribution of tasks, and the complex interaction between the retry logic and the rest of the workflow. An empirical test simply runs the agent on 60 tasks and checks the score.
The downside of empirical validation is noise. LLMs are stochastic, so the same agent can score differently on different runs. The DGM handles this through staged evaluation: small initial samples to filter, larger samples for promising agents. The stability analysis shows a standard deviation of 2.3% across three independent DGM runs on Polyglot, confirming that the results are reproducible.
After 80 iterations, the DGM transforms a simple two-tool coding agent into a sophisticated system that rivals hand-crafted solutions built by expert teams over months.
The base agent: 20.0% success rate. Two tools, one prompt, no retry logic.
After 80 iterations of autonomous self-improvement: 50.0% success rate. The agent now has granular file viewing, string-replacement editing, multi-attempt strategies with ranking, history-aware patch generation, and automatic context summarization.
For comparison, the checked open-source state-of-the-art (OpenHands + CodeAct v2.1) achieves 51.0% on the same 200-task subset. The DGM matches human-engineered solutions.
Compare DGM against its ablations. Toggle between benchmarks to see results on SWE-bench and Polyglot.
The base agent starts at 14.2% on the full Polyglot benchmark. After 80 DGM iterations: 30.7%. For comparison, Aider — a popular open-source coding agent actively developed by human engineers for over a year — achieves 24.5% with the same model. The DGM surpasses a year of human engineering.
The most striking finding is that DGM improvements generalize:
The most philosophically interesting question about the DGM: does improvement ever stop? Or can it keep going indefinitely?
Look at the SWE-bench performance curve. It does not plateau in 80 iterations. Each component ablation (no self-improvement, no archive) stagnates well before the full DGM does. The full system shows no clear sign of saturation at iteration 80.
But we should be careful. The DGM is currently constrained by the frozen foundation model. The agent can improve its prompts, tools, and workflow, but it cannot retrain the underlying LLM. This sets a ceiling — at some point, all the low-hanging architectural improvements will be exhausted, and further progress will require better reasoning capabilities in the FM itself.
Open-endedness in the DGM draws from the broader open-endedness research program. A system is open-ended if it continuously generates novel and learnable artifacts. In the DGM's case:
The key insight from open-endedness research is that optimizing a single objective leads to convergence and stagnation. But maintaining diversity — even diversity that appears suboptimal in the short term — enables long-term innovation. This is why the archive matters so much.
A self-improving system raises obvious safety questions. The DGM takes several precautions:
The authors found no evidence of harmful or misaligned behavior in any generated agent. Self-modifications focused exclusively on improving coding capabilities. But they explicitly note that as FMs become more capable, the risks of self-improving systems will increase and require more robust safeguards.
The Darwin Godel Machine sits at the intersection of several research threads that have been converging over the past few years.
| System | Self-Referential? | Archive? | Empirical Validation? | Domain |
|---|---|---|---|---|
| Darwin Godel Machine | Yes (p.modify(p)) | Yes (full) | Yes (coding benchmarks) | Coding agents |
| ADAS (Hu et al.) | No (fixed meta-agent) | Yes | Yes | Agent design |
| Godel Agent (Yin et al.) | Yes | No | Partially | General agents |
| AlphaEvolve (Google) | No | Yes (program DB) | Yes | Algorithm discovery |
| Meta-Harness (Lee et al.) | No | Yes (filesystem) | Yes | Harness optimization |
| The AI Scientist (Lu et al.) | No | No | Yes (paper reviews) | Research papers |
| Self-Improving Agent (Robeyns) | Yes | No (latest only) | Yes | Coding agents |
DGM vs. ADAS: ADAS uses a fixed meta-agent to generate downstream agents. The meta-agent never improves. DGM is self-referential: the agent that proposes improvements is itself the improved version. This is the "DGM w/o self-improve" baseline — it performs 11 points worse on SWE-bench.
DGM vs. Robeyns et al.: The concurrent self-improving agent work by Robeyns et al. is very similar but lacks the archive. It always builds from the latest version, which corresponds to the "DGM w/o open-ended exploration" baseline — it performs 27 points worse on SWE-bench.
DGM vs. AlphaEvolve: AlphaEvolve discovers algorithms (programs that solve specific mathematical problems). DGM discovers agents (programs that use LLMs to solve arbitrary coding problems). AlphaEvolve's search space is algorithmic code; DGM's search space is agent architecture.
DGM vs. Meta-Harness: Meta-Harness optimizes the code wrapping a fixed LLM for a specific task distribution. DGM optimizes a coding agent's entire codebase for general coding ability. Meta-Harness uses a fixed coding agent as proposer; DGM's proposer evolves.
The DGM represents a concrete step toward what Jeff Clune calls AI-Generating Algorithms (AI-GAs): AI systems that generate new AI systems. The vision is that instead of humans designing AI architectures by hand, we build systems that can design (and redesign) themselves.
The missing piece is training. The current DGM modifies agent code but keeps the foundation model frozen. The truly transformative version would rewrite its own training scripts to produce a better FM — closing the loop between architecture design and model training. The authors explicitly identify this as the most important direction for future work.