Darwin Godel Machine

Chapter 0: The Problem

You have a coding agent. It uses an LLM to read code, propose edits, and fix bugs. It solves about 20% of the tasks on SWE-bench. You want it to be better. What do you do?

The standard approach: a human engineer studies the agent's failures, figures out what went wrong, and manually improves the code. Maybe they add a better file-editing tool. Maybe they restructure the prompt. Maybe they add a retry mechanism. Each improvement takes hours of human thought. After a year of this, the best open-source agents solve about 50% of SWE-bench tasks.

Here is the fundamental bottleneck: every improvement to an AI agent currently requires a human to design it. The agent itself has no ability to improve its own architecture, tools, or workflow. It is a fixed system, frozen in whatever configuration its creators left it in.

The agent improvement paradox: Your coding agent is good enough to read code, understand logic, and propose edits to arbitrary repositories. But it cannot read, understand, or edit its own code. The same capabilities that make it useful for solving coding tasks could, in principle, be turned inward.

This is not just a convenience problem. It is a scaling problem. Human designers have limited time, limited intuition, and limited ability to explore the vast space of possible agent designs. There are millions of ways to structure prompts, tools, workflows, and retry strategies. A human explores a few dozen. An automated system could explore thousands.

Fixed Agent vs. Self-Improving Agent

A fixed agent stays at whatever level its human designer left it. Click "Run Iterations" to see what happens when the agent can improve itself.

Why is manual agent improvement a scaling bottleneck?

Because human designers have limited time and can only explore a tiny fraction of the vast design space of possible agent architectures Because LLMs are too slow to run experiments Because benchmarks are too noisy to measure improvement

Chapter 1: The Key Insight

Here is the idea that makes the Darwin Godel Machine work: a coding agent's ability to solve coding tasks IS its ability to improve itself, because self-improvement is itself a coding task.

Think about what self-improvement means for a coding agent. The agent's behavior is determined by its Python source code: the prompt templates, the tools it can use, the workflow that orchestrates them. Improving the agent means modifying that source code. But modifying source code is exactly what a coding agent does for a living.

The self-referential loop: If you get better at coding, you get better at modifying your own code. If you get better at modifying your own code, you get better at coding. This is not a vicious circle — it is a virtuous spiral. Each improvement compounds into the next.

This is a profound observation. In most AI systems, the ability to solve downstream tasks is separate from the ability to improve the system itself. A chess engine gets better at chess but not at improving chess engines. A language model generates better text but does not modify its own architecture. But for a coding agent, the two are the same skill.

The DGM makes this loop concrete. It starts with a simple coding agent (two tools: bash and file editor). It asks that agent to solve coding benchmarks. It then asks the agent to modify its own code to do better. The modified agent is evaluated. If it improves, the improvement is kept. The cycle repeats.

Agent solves coding tasks

The agent reads repositories, proposes edits, fixes bugs using its current tools and workflow

↓

Agent reads its own failures

Diagnostic FM analyzes benchmark logs, identifies where the agent struggled, proposes a feature

↓

Agent modifies its own code

The agent edits its own Python codebase — adding tools, changing prompts, restructuring workflow

↓

Modified agent is evaluated

New version is benchmarked. If it retains basic code-editing ability, it enters the archive

↻ repeat

Why "Darwin"? Because the process mirrors natural selection. Mutations (code edits) are generated. Fitness is measured (benchmark score). The fittest variants survive in the archive and serve as parents for future mutations. No single path is guaranteed — improvement emerges from population-level exploration.

Why is a coding agent uniquely positioned for self-improvement?

Because it has the largest context window Because the skill needed to improve itself (modifying code) is the same skill it uses to solve downstream tasks, so improvement compounds Because coding benchmarks are easy to game

Chapter 2: The Original Godel Machine

In 2007, Jurgen Schmidhuber proposed the Godel Machine: a theoretical AI that modifies itself in a provably beneficial way. Before making any change to its own code, the system would construct a mathematical proof that the change would improve future performance. Only proven improvements would be applied.

This is a beautiful idea. It is also completely impractical.

Why Formal Proofs Fail

Consider what a formal proof would require. The system would need to model its own future behavior precisely enough to guarantee that a code change leads to better outcomes. This means modeling the LLM's stochastic outputs, the diversity of tasks it will encounter, and the complex interactions between its tools, prompts, and workflow.

For any non-trivial system, this is computationally intractable. Even verifying simple properties of programs is undecidable in general (Rice's theorem). Proving that "adding a retry mechanism to a coding agent will improve its average SWE-bench score" is far beyond what formal methods can handle today.

The original Godel Machine: Provably optimal self-improvement. In practice: zero modifications ever pass the proof requirement. It is a theoretical ceiling that no real system can reach.

The DGM Relaxation

The Darwin Godel Machine takes a pragmatic approach. Instead of proving that a change is beneficial, it tests it. Run the modified agent on a benchmark. Did the score go up? Keep it. Did it go down? Still keep it in the archive (it might be a useful stepping stone later). Did the agent lose the ability to edit code entirely? Discard it.

Property	Godel Machine (2007)	Darwin Godel Machine (2026)
Validation	Formal proof of improvement	Empirical benchmark testing
Search strategy	Proof search	Evolutionary (archive + mutation)
Modification scope	Entire system	Agent codebase (frozen FM)
Guarantee	Provably optimal	No guarantee (but empirically effective)
Practical result	0 modifications applied	80 iterations, 20% → 50% on SWE-bench

Trading certainty for progress: The original Godel Machine is provably correct but never fires. The DGM is not provably correct but achieves 2.5x improvement on real coding benchmarks. This is the same tradeoff that makes empirical science work — you test hypotheses rather than proving them a priori.

Why is Schmidhuber's original Godel Machine impractical?

Because it requires too many GPUs Because the proofs take too long to verify Because formally proving that a code change improves a complex system is computationally intractable for any non-trivial modification

Chapter 3: DGM Architecture

The DGM is surprisingly simple. Three components: an archive of agent variants, a mutation operator (the agent modifying its own code), and an evaluation function (benchmark testing). Here is the full algorithm:

pseudocode
function DGM(base_agent, benchmark, T):
    score = evaluate(base_agent, benchmark)
    archive = {(base_agent, score)}

    for t = 1 to T:
        parents = select_parents(archive)    # fitness-proportionate + novelty
        for p in parents:
            child = p.modify(p)               # agent modifies its OWN code
            score = evaluate(child, benchmark)
            if child.can_edit_code():        # must retain basic ability
                archive.add((child, score))

    return archive

That is the entire algorithm. Notice what is not here: no gradient descent, no reward shaping, no prompt optimization framework. The search is driven by the agent's own ability to read failure logs, diagnose problems, and write code fixes.

The critical line is p.modify(p). The parent agent modifies itself. This is self-referential: the agent that produces the next generation is not a fixed meta-agent — it is the improved version from the previous generation. Better agents produce better mutations.

The Base Agent

The DGM starts with a deliberately simple coding agent. It has exactly two tools:

Bash tool: Execute shell commands. Navigate files, run tests, inspect output.
Editor tool: View and edit entire files. No line-range viewing, no string replacement — just whole-file operations.

That is it. No retrieval-augmented generation. No multi-attempt strategies. No context management. No test validation. These features are discovered automatically by the DGM during its evolutionary run. Starting simple is deliberate: it gives the system room to discover improvements that a human might not think of.

What the Agent Actually Modifies

The agent's codebase is a Python project:

coding_agent.py — Main entry point. The forward() function orchestrates the agent's workflow.
tools/ — Directory of tool implementations. Each tool has tool_info() (schema) and tool_function() (execution).
prompts/ — Prompt templates for different stages of the workflow.
utils/ — Shared utility functions.

During self-modification, the agent can add new files, modify existing ones, change the workflow logic, create entirely new tools, or restructure the prompt strategy. The only constraint is that Python is Turing-complete — it can express any computable process.

DGM Evolution: Interactive Archive

Watch the DGM grow its archive of agent variants. Each node is an agent. Color = benchmark score. Edges show parent-child relationships. Click "Step" to run one iteration.

Iteration 0 / 40 | Best: 20%

What makes the DGM's mutation operator different from standard meta-learning approaches like ADAS?

In DGM, the agent that proposes modifications is itself the improved version from prior iterations (p.modify(p)), not a fixed meta-agent. Better agents produce better mutations. DGM uses a larger language model DGM runs more evaluation iterations

Chapter 4: The Archive

If the DGM simply kept the best agent and always mutated from it, it would be doing hill-climbing. Hill-climbing gets stuck in local optima. The agent might discover a line of improvement (say, better prompts) that prevents it from exploring a fundamentally different approach (say, multi-attempt strategies).

The DGM solves this with an archive: it keeps every agent that passes the basic code-editing test, regardless of its benchmark score. Low-scoring agents remain in the archive because they might contain stepping stones — partial innovations that a future mutation can build upon.

Stepping stones: In the SWE-bench run, the lineage of the final best agent passes through two performance dips (iterations 4 and 56) where the agent temporarily got worse. A hill-climbing approach would have abandoned those paths. But the DGM's archive preserved them, and later mutations built on those "failed" experiments to create the best agent.

Parent Selection

Not all agents are equally likely to be chosen as parents. The DGM uses a selection probability that balances two factors:

p_i ∝ s_i · h_i

Where:

s_i is a sigmoid-scaled performance score: 1 / (1 + exp(-λ(α_i - α₀))). High-performing agents get higher weight, but the sigmoid prevents any single agent from dominating.
h_i is a novelty bonus: 1 / (1 + n_i), where n_i is the number of functioning children. Agents with fewer children get explored more.

This is inspired by quality-diversity algorithms like MAP-Elites. The archive does not just track the best solution — it maintains a diverse population of interesting solutions. Every agent retains a non-zero selection probability, so any path to improvement remains reachable given enough compute.

Ablation result: DGM with greedy selection (always pick the best agent) achieves 39.7% on SWE-bench. DGM with the full archive and diversity-aware selection achieves 50.0%. The archive contributes over 10 percentage points of improvement.

Why Not Just Keep the Best?

Consider a concrete example from the paper. The DGM discovered "more granular file editing via string replacement" at node 24 in the SWE-bench archive tree. This was a relatively low-scoring agent at the time. But it introduced a fundamental capability — line-level editing instead of whole-file replacement — that all subsequent high-performing agents built upon. If the archive had been pruned to only the current best, this stepping stone would have been lost.

Strategy	SWE-bench	Polyglot
DGM (full archive + diversity)	50.0%	38.0%
DGM Greedy (always best parent)	39.7%	30.0%
DGM w/o open-ended (latest only)	23.0%	14.0%
DGM w/o self-improve (fixed meta-agent)	39.0%	28.0%

Why does the DGM keep low-scoring agents in the archive instead of pruning them?

To save memory for future analysis Because evaluation is noisy and scores fluctuate Because low-scoring agents may contain partial innovations (stepping stones) that future mutations can build upon to reach higher performance

Chapter 5: Mutation Operators

In biological evolution, mutations are random. In the DGM, mutations are intelligent: an LLM reads the agent's failure logs, diagnoses what went wrong, and proposes a targeted code modification. This is not random search — it is informed, directed self-improvement.

The Two-Phase Mutation Process

Each mutation involves two foundation model calls:

Phase 1: Diagnosis (o1)

A reasoning model reads: (1) the agent's full source code, (2) the benchmark failure logs, (3) the task description, (4) the agent's predicted patch, and (5) the test results. It outputs a structured JSON with log analysis, potential improvements, one concrete proposal, and a GitHub-issue-style problem description.

↓

Phase 2: Implementation (Claude 3.5 Sonnet)

The coding agent receives the problem description and implements it by editing its own codebase. It uses the same bash + editor tools it uses for solving coding tasks.

Notice the elegance: the implementation phase uses the agent's own capabilities. A better coding agent produces better self-modifications, which produce an even better coding agent. The mutation quality improves as the system evolves.

Real mutations discovered by the DGM on SWE-bench:
• Non-empty patch validation and retry (if the first attempt produces an empty patch, try again)
• Granular file viewing via line ranges (instead of viewing entire files)
• String replacement editing (instead of overwriting entire files)
• Multiple patch generations with ranking (generate several solutions, pick the best)
• History-aware patch generation (learn from previous failed attempts)
• Auto-summarize on context limit (compress conversation when context window fills up)

The Self-Modification Prompt

The diagnosis prompt is carefully structured. It provides the reasoning model with:

A summary of the codebase structure (coding_agent.py, tools/, utils/)
The full implementation of the current agent
A failed task log showing where the agent struggled
The ground-truth test patch (what the correct solution looked like)
Explicit instructions to propose ONE improvement that is general (not task-specific)

The output is a structured JSON with five fields: log_summarization, potential_improvements, improvement_proposal, implementation_suggestion, and problem_description. The last field is formatted as a GitHub issue that the coding agent then solves.

Self-improvement as a coding task: The DGM frames each self-modification as a GitHub issue to solve on its own repository. The agent receives "implement string-replacement editing in tools/edit.py" and treats it exactly like any other coding task. This means the agent's improving coding ability directly translates into better self-modifications.

Why do DGM mutations improve in quality over time?

Because the implementation phase uses the current (improved) agent, and a better coding agent writes better self-modifications Because the diagnosis model is fine-tuned on each iteration Because the benchmark gets easier over time

Chapter 6: Empirical Validation

Every newly generated agent variant must be evaluated. Evaluation is expensive — running a coding agent on SWE-bench costs about $350 per 60 tasks with Claude 3.5 Sonnet. The DGM uses a staged evaluation strategy to manage this cost.

The Three Stages

Stage 1: Functionality Check (10 tasks)

Can the agent still edit code at all? This is the minimum bar. Agents that fail to produce any valid edits are discarded entirely — they cannot self-modify or solve tasks.

↓ pass

Stage 2: General Assessment (50-60 tasks)

Estimate the agent's coding ability on a representative sample. For SWE-bench, 60 tasks from SWE-bench-verified-mini. For Polyglot, 50 random tasks across languages.

↓ score ≥ 40% and top-2 in archive

Stage 3: Full Evaluation (200 tasks)

Only for the most promising agents. 200 tasks on SWE-bench for accurate performance estimation. This is where the final numbers come from.

This staged approach is crucial for cost management. Only about 51.3% of generated agents even pass Stage 1 (retaining basic code-editing functionality). The rest lose their core capability during self-modification — they break themselves.

Half of all mutations are lethal. 48.7% of DGM-generated agents lose the ability to edit code entirely. This is not a failure of the system — it is a feature. Biological evolution also produces many non-viable offspring. The archive keeps the viable ones and discards the rest. The key metric is not mutation success rate; it is whether the best agent keeps improving.

Why Empirical Beats Formal

Consider what the DGM needs to validate. "Does adding a retry mechanism improve SWE-bench performance?" A formal proof would need to model the LLM's stochastic behavior, the distribution of tasks, and the complex interaction between the retry logic and the rest of the workflow. An empirical test simply runs the agent on 60 tasks and checks the score.

The downside of empirical validation is noise. LLMs are stochastic, so the same agent can score differently on different runs. The DGM handles this through staged evaluation: small initial samples to filter, larger samples for promising agents. The stability analysis shows a standard deviation of 2.3% across three independent DGM runs on Polyglot, confirming that the results are reproducible.

Cost breakdown: A single DGM run on SWE-bench (80 iterations) costs approximately $22,000 in API fees. The same run on Polyglot costs far less because o3-mini is cheaper and tasks are simpler. For context, leading SWE-bench agents were built by teams of expert engineers over months — the DGM achieves comparable performance in 2 weeks of autonomous operation.

Why does the DGM use a staged evaluation strategy instead of running all agents on the full benchmark?

Because nearly half of mutations break the agent entirely, and full evaluation is expensive ($350/60 tasks), so staged evaluation filters out non-viable agents cheaply before investing in thorough assessment Because the full benchmark is too large to download Because smaller benchmarks give more accurate results

Chapter 7: Results

After 80 iterations, the DGM transforms a simple two-tool coding agent into a sophisticated system that rivals hand-crafted solutions built by expert teams over months.

SWE-bench Performance

The base agent: 20.0% success rate. Two tools, one prompt, no retry logic.

After 80 iterations of autonomous self-improvement: 50.0% success rate. The agent now has granular file viewing, string-replacement editing, multi-attempt strategies with ranking, history-aware patch generation, and automatic context summarization.

For comparison, the checked open-source state-of-the-art (OpenHands + CodeAct v2.1) achieves 51.0% on the same 200-task subset. The DGM matches human-engineered solutions.

DGM Results: Performance Over Iterations

Compare DGM against its ablations. Toggle between benchmarks to see results on SWE-bench and Polyglot.

Polyglot Performance

The base agent starts at 14.2% on the full Polyglot benchmark. After 80 DGM iterations: 30.7%. For comparison, Aider — a popular open-source coding agent actively developed by human engineers for over a year — achieves 24.5% with the same model. The DGM surpasses a year of human engineering.

Transfer Results

The most striking finding is that DGM improvements generalize:

Model transfer: Agents discovered with Claude 3.5 Sonnet improve performance when run with Claude 3.7 Sonnet (19% → 59.5%) or o3-mini (23% → 33%)
Benchmark transfer: The best agent from SWE-bench achieves 28.9% on Polyglot (vs. 14.2% base). The best agent from Polyglot achieves 24.5% on SWE-bench (vs. 20% base)
Language transfer: Agents trained only on Python tasks still improve on C++, Rust, Go, Java, and JavaScript

These are not benchmark hacks. The transfer results prove that DGM discovers genuinely useful coding strategies, not task-specific exploits. A retry mechanism helps on any benchmark. Line-range file viewing helps with any programming language. The improvements are architectural, not adversarial.

What evidence shows that DGM improvements are genuine rather than benchmark-specific exploits?

The agent's code is shorter after evolution The benchmark scores are very high Improvements transfer across different models, benchmarks, and programming languages, proving they reflect general coding strategies rather than task-specific exploits

Chapter 8: Open-Endedness

The most philosophically interesting question about the DGM: does improvement ever stop? Or can it keep going indefinitely?

Evidence for Continued Innovation

Look at the SWE-bench performance curve. It does not plateau in 80 iterations. Each component ablation (no self-improvement, no archive) stagnates well before the full DGM does. The full system shows no clear sign of saturation at iteration 80.

But we should be careful. The DGM is currently constrained by the frozen foundation model. The agent can improve its prompts, tools, and workflow, but it cannot retrain the underlying LLM. This sets a ceiling — at some point, all the low-hanging architectural improvements will be exhausted, and further progress will require better reasoning capabilities in the FM itself.

The open question: Would running the DGM for 800 iterations instead of 80 continue to improve? The authors hypothesize yes, but also acknowledge that the current version is bounded by the FM's capabilities. The truly open-ended version would modify its own training scripts and retrain the FM — a direction left for future work.

What "Open-Ended" Means Here

Open-endedness in the DGM draws from the broader open-endedness research program. A system is open-ended if it continuously generates novel and learnable artifacts. In the DGM's case:

Novel: Each mutation produces a new agent variant with different code. The archive grows with genuinely different implementations of similar capabilities (e.g., multiple approaches to granular file editing).
Learnable: New agents build on previous stepping stones. Innovation at iteration 60 uses capabilities discovered at iteration 24. The system's past informs its future.

The key insight from open-endedness research is that optimizing a single objective leads to convergence and stagnation. But maintaining diversity — even diversity that appears suboptimal in the short term — enables long-term innovation. This is why the archive matters so much.

The analogy to evolution: Biological evolution did not optimize for intelligence. It optimized for reproduction. But by maintaining a diverse archive of organisms (the biosphere), it produced stepping stones that eventually led to intelligence. The DGM similarly does not optimize for the single best agent — it maintains a diverse archive that enables unpredictable innovations.

Safety Implications

A self-improving system raises obvious safety questions. The DGM takes several precautions:

Sandboxing: All agent execution happens in isolated environments with no host access
Time limits: Strict per-execution time caps prevent resource exhaustion
Auditability: The archive provides a complete lineage of every modification, enabling rollback and review
Scope limitation: Self-improvement is confined to the Python codebase; the FM weights are frozen

The authors found no evidence of harmful or misaligned behavior in any generated agent. Self-modifications focused exclusively on improving coding capabilities. But they explicitly note that as FMs become more capable, the risks of self-improving systems will increase and require more robust safeguards.

Why is maintaining an archive of diverse agents essential for open-ended improvement?

Because it reduces evaluation costs Because optimizing a single objective leads to convergence and stagnation, while a diverse archive preserves stepping stones that enable future innovations — even through temporary performance dips Because older agents have more reliable code

Chapter 9: Connections

The Darwin Godel Machine sits at the intersection of several research threads that have been converging over the past few years.

Related Systems

System	Self-Referential?	Archive?	Empirical Validation?	Domain
Darwin Godel Machine	Yes (p.modify(p))	Yes (full)	Yes (coding benchmarks)	Coding agents
ADAS (Hu et al.)	No (fixed meta-agent)	Yes	Yes	Agent design
Godel Agent (Yin et al.)	Yes	No	Partially	General agents
AlphaEvolve (Google)	No	Yes (program DB)	Yes	Algorithm discovery
Meta-Harness (Lee et al.)	No	Yes (filesystem)	Yes	Harness optimization
The AI Scientist (Lu et al.)	No	No	Yes (paper reviews)	Research papers
Self-Improving Agent (Robeyns)	Yes	No (latest only)	Yes	Coding agents

Key Distinctions

DGM vs. ADAS: ADAS uses a fixed meta-agent to generate downstream agents. The meta-agent never improves. DGM is self-referential: the agent that proposes improvements is itself the improved version. This is the "DGM w/o self-improve" baseline — it performs 11 points worse on SWE-bench.

DGM vs. Robeyns et al.: The concurrent self-improving agent work by Robeyns et al. is very similar but lacks the archive. It always builds from the latest version, which corresponds to the "DGM w/o open-ended exploration" baseline — it performs 27 points worse on SWE-bench.

DGM vs. AlphaEvolve: AlphaEvolve discovers algorithms (programs that solve specific mathematical problems). DGM discovers agents (programs that use LLMs to solve arbitrary coding problems). AlphaEvolve's search space is algorithmic code; DGM's search space is agent architecture.

DGM vs. Meta-Harness: Meta-Harness optimizes the code wrapping a fixed LLM for a specific task distribution. DGM optimizes a coding agent's entire codebase for general coding ability. Meta-Harness uses a fixed coding agent as proposer; DGM's proposer evolves.

The Bigger Picture

The DGM represents a concrete step toward what Jeff Clune calls AI-Generating Algorithms (AI-GAs): AI systems that generate new AI systems. The vision is that instead of humans designing AI architectures by hand, we build systems that can design (and redesign) themselves.

The missing piece is training. The current DGM modifies agent code but keeps the foundation model frozen. The truly transformative version would rewrite its own training scripts to produce a better FM — closing the loop between architecture design and model training. The authors explicitly identify this as the most important direction for future work.

The trajectory: Schmidhuber (2007) proposed self-improving AI through formal proofs. The DGM (2026) achieves it through empirical validation and evolutionary search. The next step is self-improving AI that can also retrain its own foundation model. Each generation gets more capable and more autonomous.

What is the key capability that the DGM currently lacks for truly open-ended self-improvement?

The ability to access the internet The ability to modify its own training scripts and retrain its foundation model, which would remove the ceiling imposed by the frozen FM's capabilities The ability to run on multiple GPUs

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents