Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, Jeff Clune — UBC, Vector Institute, Sakana AI, 2026

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

A coding agent that modifies its own source code, validates changes on benchmarks, and grows an archive of ever-better versions of itself. 20% to 50% on SWE-bench without any human intervention.

Prerequisites: What an LLM agent is + Basic idea of evolutionary algorithms
10
Chapters
3
Simulations

Chapter 0: The Problem

You have a coding agent. It uses an LLM to read code, propose edits, and fix bugs. It solves about 20% of the tasks on SWE-bench. You want it to be better. What do you do?

The standard approach: a human engineer studies the agent's failures, figures out what went wrong, and manually improves the code. Maybe they add a better file-editing tool. Maybe they restructure the prompt. Maybe they add a retry mechanism. Each improvement takes hours of human thought. After a year of this, the best open-source agents solve about 50% of SWE-bench tasks.

Here is the fundamental bottleneck: every improvement to an AI agent currently requires a human to design it. The agent itself has no ability to improve its own architecture, tools, or workflow. It is a fixed system, frozen in whatever configuration its creators left it in.

The agent improvement paradox: Your coding agent is good enough to read code, understand logic, and propose edits to arbitrary repositories. But it cannot read, understand, or edit its own code. The same capabilities that make it useful for solving coding tasks could, in principle, be turned inward.

This is not just a convenience problem. It is a scaling problem. Human designers have limited time, limited intuition, and limited ability to explore the vast space of possible agent designs. There are millions of ways to structure prompts, tools, workflows, and retry strategies. A human explores a few dozen. An automated system could explore thousands.

Fixed Agent vs. Self-Improving Agent

A fixed agent stays at whatever level its human designer left it. Click "Run Iterations" to see what happens when the agent can improve itself.

Why is manual agent improvement a scaling bottleneck?

Chapter 1: The Key Insight

Here is the idea that makes the Darwin Godel Machine work: a coding agent's ability to solve coding tasks IS its ability to improve itself, because self-improvement is itself a coding task.

Think about what self-improvement means for a coding agent. The agent's behavior is determined by its Python source code: the prompt templates, the tools it can use, the workflow that orchestrates them. Improving the agent means modifying that source code. But modifying source code is exactly what a coding agent does for a living.

The self-referential loop: If you get better at coding, you get better at modifying your own code. If you get better at modifying your own code, you get better at coding. This is not a vicious circle — it is a virtuous spiral. Each improvement compounds into the next.

This is a profound observation. In most AI systems, the ability to solve downstream tasks is separate from the ability to improve the system itself. A chess engine gets better at chess but not at improving chess engines. A language model generates better text but does not modify its own architecture. But for a coding agent, the two are the same skill.

The DGM makes this loop concrete. It starts with a simple coding agent (two tools: bash and file editor). It asks that agent to solve coding benchmarks. It then asks the agent to modify its own code to do better. The modified agent is evaluated. If it improves, the improvement is kept. The cycle repeats.

Agent solves coding tasks
The agent reads repositories, proposes edits, fixes bugs using its current tools and workflow
Agent reads its own failures
Diagnostic FM analyzes benchmark logs, identifies where the agent struggled, proposes a feature
Agent modifies its own code
The agent edits its own Python codebase — adding tools, changing prompts, restructuring workflow
Modified agent is evaluated
New version is benchmarked. If it retains basic code-editing ability, it enters the archive
↻ repeat
Why "Darwin"? Because the process mirrors natural selection. Mutations (code edits) are generated. Fitness is measured (benchmark score). The fittest variants survive in the archive and serve as parents for future mutations. No single path is guaranteed — improvement emerges from population-level exploration.
Why is a coding agent uniquely positioned for self-improvement?

Chapter 2: The Original Godel Machine

In 2007, Jurgen Schmidhuber proposed the Godel Machine: a theoretical AI that modifies itself in a provably beneficial way. Before making any change to its own code, the system would construct a mathematical proof that the change would improve future performance. Only proven improvements would be applied.

This is a beautiful idea. It is also completely impractical.

Why Formal Proofs Fail

Consider what a formal proof would require. The system would need to model its own future behavior precisely enough to guarantee that a code change leads to better outcomes. This means modeling the LLM's stochastic outputs, the diversity of tasks it will encounter, and the complex interactions between its tools, prompts, and workflow.

For any non-trivial system, this is computationally intractable. Even verifying simple properties of programs is undecidable in general (Rice's theorem). Proving that "adding a retry mechanism to a coding agent will improve its average SWE-bench score" is far beyond what formal methods can handle today.

The original Godel Machine: Provably optimal self-improvement. In practice: zero modifications ever pass the proof requirement. It is a theoretical ceiling that no real system can reach.

The DGM Relaxation

The Darwin Godel Machine takes a pragmatic approach. Instead of proving that a change is beneficial, it tests it. Run the modified agent on a benchmark. Did the score go up? Keep it. Did it go down? Still keep it in the archive (it might be a useful stepping stone later). Did the agent lose the ability to edit code entirely? Discard it.

PropertyGodel Machine (2007)Darwin Godel Machine (2026)
ValidationFormal proof of improvementEmpirical benchmark testing
Search strategyProof searchEvolutionary (archive + mutation)
Modification scopeEntire systemAgent codebase (frozen FM)
GuaranteeProvably optimalNo guarantee (but empirically effective)
Practical result0 modifications applied80 iterations, 20% → 50% on SWE-bench
Trading certainty for progress: The original Godel Machine is provably correct but never fires. The DGM is not provably correct but achieves 2.5x improvement on real coding benchmarks. This is the same tradeoff that makes empirical science work — you test hypotheses rather than proving them a priori.
Why is Schmidhuber's original Godel Machine impractical?

Chapter 3: DGM Architecture

The DGM is surprisingly simple. Three components: an archive of agent variants, a mutation operator (the agent modifying its own code), and an evaluation function (benchmark testing). Here is the full algorithm:

pseudocode
function DGM(base_agent, benchmark, T):
    score = evaluate(base_agent, benchmark)
    archive = {(base_agent, score)}

    for t = 1 to T:
        parents = select_parents(archive)    # fitness-proportionate + novelty
        for p in parents:
            child = p.modify(p)               # agent modifies its OWN code
            score = evaluate(child, benchmark)
            if child.can_edit_code():        # must retain basic ability
                archive.add((child, score))

    return archive

That is the entire algorithm. Notice what is not here: no gradient descent, no reward shaping, no prompt optimization framework. The search is driven by the agent's own ability to read failure logs, diagnose problems, and write code fixes.

The critical line is p.modify(p). The parent agent modifies itself. This is self-referential: the agent that produces the next generation is not a fixed meta-agent — it is the improved version from the previous generation. Better agents produce better mutations.

The Base Agent

The DGM starts with a deliberately simple coding agent. It has exactly two tools:

That is it. No retrieval-augmented generation. No multi-attempt strategies. No context management. No test validation. These features are discovered automatically by the DGM during its evolutionary run. Starting simple is deliberate: it gives the system room to discover improvements that a human might not think of.

What the Agent Actually Modifies

The agent's codebase is a Python project:

During self-modification, the agent can add new files, modify existing ones, change the workflow logic, create entirely new tools, or restructure the prompt strategy. The only constraint is that Python is Turing-complete — it can express any computable process.

DGM Evolution: Interactive Archive

Watch the DGM grow its archive of agent variants. Each node is an agent. Color = benchmark score. Edges show parent-child relationships. Click "Step" to run one iteration.

Iteration 0 / 40 | Best: 20%
What makes the DGM's mutation operator different from standard meta-learning approaches like ADAS?

Chapter 4: The Archive

If the DGM simply kept the best agent and always mutated from it, it would be doing hill-climbing. Hill-climbing gets stuck in local optima. The agent might discover a line of improvement (say, better prompts) that prevents it from exploring a fundamentally different approach (say, multi-attempt strategies).

The DGM solves this with an archive: it keeps every agent that passes the basic code-editing test, regardless of its benchmark score. Low-scoring agents remain in the archive because they might contain stepping stones — partial innovations that a future mutation can build upon.

Stepping stones: In the SWE-bench run, the lineage of the final best agent passes through two performance dips (iterations 4 and 56) where the agent temporarily got worse. A hill-climbing approach would have abandoned those paths. But the DGM's archive preserved them, and later mutations built on those "failed" experiments to create the best agent.

Parent Selection

Not all agents are equally likely to be chosen as parents. The DGM uses a selection probability that balances two factors:

pi ∝ si · hi

Where:

This is inspired by quality-diversity algorithms like MAP-Elites. The archive does not just track the best solution — it maintains a diverse population of interesting solutions. Every agent retains a non-zero selection probability, so any path to improvement remains reachable given enough compute.

Ablation result: DGM with greedy selection (always pick the best agent) achieves 39.7% on SWE-bench. DGM with the full archive and diversity-aware selection achieves 50.0%. The archive contributes over 10 percentage points of improvement.

Why Not Just Keep the Best?

Consider a concrete example from the paper. The DGM discovered "more granular file editing via string replacement" at node 24 in the SWE-bench archive tree. This was a relatively low-scoring agent at the time. But it introduced a fundamental capability — line-level editing instead of whole-file replacement — that all subsequent high-performing agents built upon. If the archive had been pruned to only the current best, this stepping stone would have been lost.

StrategySWE-benchPolyglot
DGM (full archive + diversity)50.0%38.0%
DGM Greedy (always best parent)39.7%30.0%
DGM w/o open-ended (latest only)23.0%14.0%
DGM w/o self-improve (fixed meta-agent)39.0%28.0%
Why does the DGM keep low-scoring agents in the archive instead of pruning them?

Chapter 5: Mutation Operators

In biological evolution, mutations are random. In the DGM, mutations are intelligent: an LLM reads the agent's failure logs, diagnoses what went wrong, and proposes a targeted code modification. This is not random search — it is informed, directed self-improvement.

The Two-Phase Mutation Process

Each mutation involves two foundation model calls:

Phase 1: Diagnosis (o1)
A reasoning model reads: (1) the agent's full source code, (2) the benchmark failure logs, (3) the task description, (4) the agent's predicted patch, and (5) the test results. It outputs a structured JSON with log analysis, potential improvements, one concrete proposal, and a GitHub-issue-style problem description.
Phase 2: Implementation (Claude 3.5 Sonnet)
The coding agent receives the problem description and implements it by editing its own codebase. It uses the same bash + editor tools it uses for solving coding tasks.

Notice the elegance: the implementation phase uses the agent's own capabilities. A better coding agent produces better self-modifications, which produce an even better coding agent. The mutation quality improves as the system evolves.

Real mutations discovered by the DGM on SWE-bench:
• Non-empty patch validation and retry (if the first attempt produces an empty patch, try again)
• Granular file viewing via line ranges (instead of viewing entire files)
• String replacement editing (instead of overwriting entire files)
• Multiple patch generations with ranking (generate several solutions, pick the best)
• History-aware patch generation (learn from previous failed attempts)
• Auto-summarize on context limit (compress conversation when context window fills up)

The Self-Modification Prompt

The diagnosis prompt is carefully structured. It provides the reasoning model with:

The output is a structured JSON with five fields: log_summarization, potential_improvements, improvement_proposal, implementation_suggestion, and problem_description. The last field is formatted as a GitHub issue that the coding agent then solves.

Self-improvement as a coding task: The DGM frames each self-modification as a GitHub issue to solve on its own repository. The agent receives "implement string-replacement editing in tools/edit.py" and treats it exactly like any other coding task. This means the agent's improving coding ability directly translates into better self-modifications.
Why do DGM mutations improve in quality over time?

Chapter 6: Empirical Validation

Every newly generated agent variant must be evaluated. Evaluation is expensive — running a coding agent on SWE-bench costs about $350 per 60 tasks with Claude 3.5 Sonnet. The DGM uses a staged evaluation strategy to manage this cost.

The Three Stages

Stage 1: Functionality Check (10 tasks)
Can the agent still edit code at all? This is the minimum bar. Agents that fail to produce any valid edits are discarded entirely — they cannot self-modify or solve tasks.
↓ pass
Stage 2: General Assessment (50-60 tasks)
Estimate the agent's coding ability on a representative sample. For SWE-bench, 60 tasks from SWE-bench-verified-mini. For Polyglot, 50 random tasks across languages.
↓ score ≥ 40% and top-2 in archive
Stage 3: Full Evaluation (200 tasks)
Only for the most promising agents. 200 tasks on SWE-bench for accurate performance estimation. This is where the final numbers come from.

This staged approach is crucial for cost management. Only about 51.3% of generated agents even pass Stage 1 (retaining basic code-editing functionality). The rest lose their core capability during self-modification — they break themselves.

Half of all mutations are lethal. 48.7% of DGM-generated agents lose the ability to edit code entirely. This is not a failure of the system — it is a feature. Biological evolution also produces many non-viable offspring. The archive keeps the viable ones and discards the rest. The key metric is not mutation success rate; it is whether the best agent keeps improving.

Why Empirical Beats Formal

Consider what the DGM needs to validate. "Does adding a retry mechanism improve SWE-bench performance?" A formal proof would need to model the LLM's stochastic behavior, the distribution of tasks, and the complex interaction between the retry logic and the rest of the workflow. An empirical test simply runs the agent on 60 tasks and checks the score.

The downside of empirical validation is noise. LLMs are stochastic, so the same agent can score differently on different runs. The DGM handles this through staged evaluation: small initial samples to filter, larger samples for promising agents. The stability analysis shows a standard deviation of 2.3% across three independent DGM runs on Polyglot, confirming that the results are reproducible.

Cost breakdown: A single DGM run on SWE-bench (80 iterations) costs approximately $22,000 in API fees. The same run on Polyglot costs far less because o3-mini is cheaper and tasks are simpler. For context, leading SWE-bench agents were built by teams of expert engineers over months — the DGM achieves comparable performance in 2 weeks of autonomous operation.
Why does the DGM use a staged evaluation strategy instead of running all agents on the full benchmark?

Chapter 7: Results

After 80 iterations, the DGM transforms a simple two-tool coding agent into a sophisticated system that rivals hand-crafted solutions built by expert teams over months.

SWE-bench Performance

The base agent: 20.0% success rate. Two tools, one prompt, no retry logic.

After 80 iterations of autonomous self-improvement: 50.0% success rate. The agent now has granular file viewing, string-replacement editing, multi-attempt strategies with ranking, history-aware patch generation, and automatic context summarization.

For comparison, the checked open-source state-of-the-art (OpenHands + CodeAct v2.1) achieves 51.0% on the same 200-task subset. The DGM matches human-engineered solutions.

DGM Results: Performance Over Iterations

Compare DGM against its ablations. Toggle between benchmarks to see results on SWE-bench and Polyglot.

Polyglot Performance

The base agent starts at 14.2% on the full Polyglot benchmark. After 80 DGM iterations: 30.7%. For comparison, Aider — a popular open-source coding agent actively developed by human engineers for over a year — achieves 24.5% with the same model. The DGM surpasses a year of human engineering.

Transfer Results

The most striking finding is that DGM improvements generalize:

These are not benchmark hacks. The transfer results prove that DGM discovers genuinely useful coding strategies, not task-specific exploits. A retry mechanism helps on any benchmark. Line-range file viewing helps with any programming language. The improvements are architectural, not adversarial.
What evidence shows that DGM improvements are genuine rather than benchmark-specific exploits?

Chapter 8: Open-Endedness

The most philosophically interesting question about the DGM: does improvement ever stop? Or can it keep going indefinitely?

Evidence for Continued Innovation

Look at the SWE-bench performance curve. It does not plateau in 80 iterations. Each component ablation (no self-improvement, no archive) stagnates well before the full DGM does. The full system shows no clear sign of saturation at iteration 80.

But we should be careful. The DGM is currently constrained by the frozen foundation model. The agent can improve its prompts, tools, and workflow, but it cannot retrain the underlying LLM. This sets a ceiling — at some point, all the low-hanging architectural improvements will be exhausted, and further progress will require better reasoning capabilities in the FM itself.

The open question: Would running the DGM for 800 iterations instead of 80 continue to improve? The authors hypothesize yes, but also acknowledge that the current version is bounded by the FM's capabilities. The truly open-ended version would modify its own training scripts and retrain the FM — a direction left for future work.

What "Open-Ended" Means Here

Open-endedness in the DGM draws from the broader open-endedness research program. A system is open-ended if it continuously generates novel and learnable artifacts. In the DGM's case:

The key insight from open-endedness research is that optimizing a single objective leads to convergence and stagnation. But maintaining diversity — even diversity that appears suboptimal in the short term — enables long-term innovation. This is why the archive matters so much.

The analogy to evolution: Biological evolution did not optimize for intelligence. It optimized for reproduction. But by maintaining a diverse archive of organisms (the biosphere), it produced stepping stones that eventually led to intelligence. The DGM similarly does not optimize for the single best agent — it maintains a diverse archive that enables unpredictable innovations.

Safety Implications

A self-improving system raises obvious safety questions. The DGM takes several precautions:

The authors found no evidence of harmful or misaligned behavior in any generated agent. Self-modifications focused exclusively on improving coding capabilities. But they explicitly note that as FMs become more capable, the risks of self-improving systems will increase and require more robust safeguards.

Why is maintaining an archive of diverse agents essential for open-ended improvement?

Chapter 9: Connections

The Darwin Godel Machine sits at the intersection of several research threads that have been converging over the past few years.

Related Systems

SystemSelf-Referential?Archive?Empirical Validation?Domain
Darwin Godel MachineYes (p.modify(p))Yes (full)Yes (coding benchmarks)Coding agents
ADAS (Hu et al.)No (fixed meta-agent)YesYesAgent design
Godel Agent (Yin et al.)YesNoPartiallyGeneral agents
AlphaEvolve (Google)NoYes (program DB)YesAlgorithm discovery
Meta-Harness (Lee et al.)NoYes (filesystem)YesHarness optimization
The AI Scientist (Lu et al.)NoNoYes (paper reviews)Research papers
Self-Improving Agent (Robeyns)YesNo (latest only)YesCoding agents

Key Distinctions

DGM vs. ADAS: ADAS uses a fixed meta-agent to generate downstream agents. The meta-agent never improves. DGM is self-referential: the agent that proposes improvements is itself the improved version. This is the "DGM w/o self-improve" baseline — it performs 11 points worse on SWE-bench.

DGM vs. Robeyns et al.: The concurrent self-improving agent work by Robeyns et al. is very similar but lacks the archive. It always builds from the latest version, which corresponds to the "DGM w/o open-ended exploration" baseline — it performs 27 points worse on SWE-bench.

DGM vs. AlphaEvolve: AlphaEvolve discovers algorithms (programs that solve specific mathematical problems). DGM discovers agents (programs that use LLMs to solve arbitrary coding problems). AlphaEvolve's search space is algorithmic code; DGM's search space is agent architecture.

DGM vs. Meta-Harness: Meta-Harness optimizes the code wrapping a fixed LLM for a specific task distribution. DGM optimizes a coding agent's entire codebase for general coding ability. Meta-Harness uses a fixed coding agent as proposer; DGM's proposer evolves.

The Bigger Picture

The DGM represents a concrete step toward what Jeff Clune calls AI-Generating Algorithms (AI-GAs): AI systems that generate new AI systems. The vision is that instead of humans designing AI architectures by hand, we build systems that can design (and redesign) themselves.

The missing piece is training. The current DGM modifies agent code but keeps the foundation model frozen. The truly transformative version would rewrite its own training scripts to produce a better FM — closing the loop between architecture design and model training. The authors explicitly identify this as the most important direction for future work.

The trajectory: Schmidhuber (2007) proposed self-improving AI through formal proofs. The DGM (2026) achieves it through empirical validation and evolutionary search. The next step is self-improving AI that can also retrain its own foundation model. Each generation gets more capable and more autonomous.
What is the key capability that the DGM currently lacks for truly open-ended self-improvement?