Novikov, Vũ, Eisenberger et al. — Google DeepMind, 2025

AlphaEvolve: A Coding Agent for Scientific Discovery

An evolutionary coding agent that uses LLMs as mutation operators inside an automated search loop. It discovered the first improvement over Strassen's 1969 matrix-multiplication algorithm, optimized Google's data-center scheduling, and accelerated its own LLM training — all with provably correct results.

Prerequisites: Evolutionary algorithms (basics) + LLM code generation + Matrix multiplication
10
Chapters
4+
Simulations

Chapter 0: The Problem

Discovering a new algorithm is not like writing code to a spec. There is no specification. You are searching an enormous space of possible programs, and you do not even know what "better" looks like until you try something and measure it.

Think about matrix multiplication. In 1969, Volker Strassen shocked the mathematics community by showing that two 2×2 matrices could be multiplied with 7 scalar multiplications instead of the obvious 8. This single insight cascades: apply it recursively and you get sub-cubic algorithms for all matrix sizes. For 56 years, nobody found a way to do better for 4×4 matrices over the complex numbers. Thousands of researchers tried.

Or consider Google's data-center scheduling. Billions of compute jobs need to be assigned to machines every day. A slightly better heuristic — even 0.7% better — recovers enormous amounts of stranded resources across the fleet. But the search space of possible heuristics is effectively infinite, and each candidate requires realistic simulation to evaluate.

These problems share a structure: the search space is vast, the evaluation is automatable, and small improvements compound into massive impact. Traditional approaches — handcrafted heuristics, brute-force search, reinforcement learning on narrow tasks — have made progress, but they hit walls. Hand-design cannot explore fast enough. RL approaches like AlphaTensor work on single problems but do not generalize.

The core question: What if we could automate the entire creative loop — proposing ideas, testing them, keeping the good ones, combining them — using LLMs as the idea engine and code execution as the judge? That is what AlphaEvolve does. It turns the LLM into an evolution operator inside an automated discovery pipeline.

AlphaEvolve is the successor to FunSearch (Romera-Paredes et al., 2023), which showed that LLM-guided evolution could find new mathematical constructions. But FunSearch was limited: it could only evolve a single short Python function, it optimized a single metric, and it used small LLMs. AlphaEvolve removes all these constraints.

Why is algorithmic discovery fundamentally harder than software engineering?

Chapter 1: The Key Insight

The central idea is beautifully simple: use an LLM as the mutation operator in an evolutionary algorithm, with automated code execution as the fitness function.

In classical evolutionary algorithms, mutations are random: flip a bit, swap two elements, perturb a number. These mutations are blind — they do not understand the structure of the solution. Most random mutations to code produce garbage.

But LLMs understand code. Given a program and a description of what it does, an LLM can propose meaningful changes: swap an optimizer, add a regularization term, restructure a loop. Each proposed change is a hypothesis about how to improve the program. The evolutionary framework then tests these hypotheses automatically and keeps the winners.

Why this works: Evolution provides the exploration discipline — maintaining diversity, selecting for fitness, combining good ideas. The LLM provides the intelligence of the mutations — understanding code semantics, drawing on its training knowledge, proposing changes that are syntactically valid and semantically meaningful. Neither alone is sufficient. The LLM alone will hallucinate or get stuck in local optima. Evolution alone with random mutations is too slow. Together, they search effectively.

Here is the key difference from just asking an LLM to solve a problem:

ApproachExplorationGroundingDuration
Single LLM queryOne shotNone (may hallucinate)Seconds
LLM + chain of thoughtSequential reasoningSelf-consistency onlyMinutes
AlphaEvolvePopulation-based, thousands of trialsCode execution + automated evaluationHours to days

AlphaEvolve grounds every suggestion in reality: the proposed code is actually run, and the result is actually measured. This completely sidesteps LLM hallucinations. If the LLM proposes a change that makes the code worse (or breaks it), the evaluator catches it and that variant dies. If the change helps, it survives and seeds the next generation.

The system uses an ensemble of models: Gemini 2.0 Flash for high-throughput exploration (many cheap samples) and Gemini 2.0 Pro for occasional high-quality breakthroughs. Flash generates volume; Pro generates insight. The evolutionary framework harvests the best of both.

What makes AlphaEvolve's mutations fundamentally different from classical evolutionary algorithm mutations?

Chapter 2: The Pipeline

AlphaEvolve runs a continuous loop. Each iteration follows six steps:

1. Sample parent
Pick a program from the database + optional "inspiration" programs
2. Build prompt
Assemble parent code + scores + context + instructions into an LLM prompt
3. LLM generates diff
The LLM proposes code modifications as SEARCH/REPLACE blocks
4. Apply diff
Patch the parent program to create a child program
5. Evaluate
Run the child program and score it with automated evaluators
6. Update database
If promising, add the child to the program database for future iterations

Let us walk through each step in detail.

Step 1: Sample a parent

The program database holds all previously evaluated solutions with their scores. The system samples a parent program — the solution to be improved — plus zero or more inspiration programs that scored well on different metrics. This is the crossover mechanism: by showing the LLM multiple good solutions, it can combine ideas from different lineages.

Step 2: Build the prompt

The prompt is rich. It includes:

Step 3: LLM generates a diff

The LLM outputs its proposed changes in a SEARCH/REPLACE format:

<<<<<<< SEARCH
return optax.adam(self.learning_rate)
=======
return optax.adamw(self.learning_rate, weight_decay=1e-4)
>>>>>>> REPLACE

This format is important: it lets the LLM make targeted edits to specific parts of potentially large codebases without rewriting everything. For short programs, the LLM can also output the entire replacement code directly.

Step 4–6: Apply, evaluate, store

The diff is applied to create a child program. The child is executed and scored. If the child improves on any metric, it enters the database and becomes available as a parent or inspiration for future iterations.

Asynchronous execution: The pipeline is fully asynchronous. Many LLM queries and evaluations run concurrently. The system is optimized for throughput — maximize the number of ideas explored per unit time — rather than the speed of any single evaluation. This is why AlphaEvolve succeeds with just thousands of samples (not millions), since each sample is a high-quality, semantically meaningful mutation.
Why does AlphaEvolve use a SEARCH/REPLACE diff format instead of generating complete programs?

Chapter 3: Evolutionary Components

AlphaEvolve's evolutionary loop has four moving parts: the population, selection, crossover, and mutation. Each is reimagined for the LLM era.

Population: the program database

In classical evolution, the population is a flat set of genomes. AlphaEvolve's population is a structured program database — a collection of programs tagged with their evaluation scores across multiple metrics. The database uses an algorithm inspired by MAP-Elites and island-based models to balance exploration (try new things) with exploitation (refine what works).

Selection: sampling parents

When the system needs a parent, it does not just pick the best-scoring program. It samples from the database with a bias toward higher-scoring programs but enough randomness to maintain diversity. It can also sample "inspiration" programs — solutions that excel on different metrics — to inject variety.

Crossover: combining ideas via LLM

Classical crossover swaps subsequences between two parent genomes. AlphaEvolve's crossover is far more powerful: the LLM reads multiple good solutions and synthesizes new code that combines their ideas. It might take an optimizer from one solution and a loss function from another, adapting them to work together. This is semantic crossover — the LLM understands what each piece does.

Mutation: LLM-generated code changes

The LLM proposes modifications to the parent program. These mutations range from small (changing a hyperparameter) to large (rewriting an entire function with a new algorithm). Crucially, the mutations are informed: the LLM sees the evaluation scores, the code, and optional problem context, and reasons about what to change.

FunSearch vs. AlphaEvolve: FunSearch could only evolve a single short function (10-20 lines) in Python. AlphaEvolve evolves entire codebases — hundreds of lines across multiple functions and files, in any programming language. FunSearch optimized one metric; AlphaEvolve optimizes multiple simultaneously. FunSearch used small code-only LLMs; AlphaEvolve uses frontier models with rich context. This is not an incremental improvement — it is a qualitative leap in what the system can discover.
ComponentFunSearch (2023)AlphaEvolve (2025)
ScopeSingle function, 10-20 linesEntire codebase, hundreds of lines
LanguagePython onlyAny language
MetricsSingle objectiveMultiple objectives
LLMsSmall, code-onlyFrontier models (Gemini 2.0)
ContextMinimal (prior solutions)Rich (context, feedback, meta-prompts)
Evaluation≤20 min on 1 CPUHours, parallel, on accelerators
SamplesMillions neededThousands suffice
How does AlphaEvolve implement "crossover" compared to classical evolutionary algorithms?

Chapter 4: The Strassen Discovery

This is the headline result: AlphaEvolve found a way to multiply two 4×4 complex matrices using 48 scalar multiplications, breaking a 56-year-old barrier set by Strassen's algorithm.

Why matrix multiplication matters

Matrix multiplication is the single most important operation in scientific computing. It underpins everything from training neural networks to rendering graphics to simulating physics. The naive algorithm for multiplying two n×n matrices requires n³ multiplications. For n=4, that is 64 multiplications.

In 1969, Strassen showed a clever trick: you can multiply two 2×2 matrices with 7 multiplications instead of 8, by forming specific linear combinations. Apply this recursively: two 4×4 matrices can be broken into 2×2 blocks, and each block multiplication uses Strassen's trick. The result: 7² = 49 multiplications for 4×4 matrices.

For 56 years, 49 was the best known for 4×4 complex matrix multiplication. Nobody could do better.

How AlphaEvolve found 48

Matrix multiplication algorithms can be encoded as tensor decompositions. The product of an m×n matrix and an n×p matrix corresponds to a specific 3D tensor. A decomposition of this tensor into R rank-one terms gives an algorithm that uses R scalar multiplications. So the problem reduces to: find the lowest-rank decomposition of the matrix multiplication tensor.

AlphaEvolve was given:

AlphaEvolve then evolved the entire decomposition algorithm — not just one function, but the optimizer, the loss function, the initialization, the hyperparameters, and even the search strategy. Over 15 mutations, it introduced:

AdamW optimizer
Replaced Adam with AdamW + weight decay for better regularization
Gradient noise
Added stochastic gradient noise for exploration
Discretization loss
Penalized non-integer values to encourage exact decompositions
Target noise
Added noise to the target tensor for robustness
Cyclical annealing
Oscillating clipping thresholds to escape local minima
Soft clipping
Clipped complex-valued factor entries to bounded ranges
"Hallucination" loss
Randomly replaced decomposition values to encourage exploration (the LLM's own creative name)

None of these ideas is individually revolutionary. But the combination — discovered through automated evolution — was enough to push past the 56-year barrier. The final algorithm finds a rank-48 decomposition of the 4×4 complex matrix multiplication tensor.

What makes this remarkable: A general-purpose coding agent, with no domain-specific architecture for tensor decomposition, discovered a result that eluded specialist mathematicians and dedicated systems (including AlphaTensor, which used deep RL specifically designed for tensor decomposition) for over half a century. And the result is provably correct — the decomposition can be verified by direct computation.

The numbers

Matrix size ⟨m,n,p⟩NaivePrevious bestAlphaEvolve
⟨2,4,5⟩403332
⟨3,3,3⟩272323
⟨3,4,6⟩725654
⟨4,4,4⟩6449 (Strassen)48
⟨4,4,5⟩806261
⟨5,5,5⟩1259393

AlphaEvolve improved the state of the art for 14 different matrix multiplication targets. For all targets with m,n,p ≤ 5, it either matched or surpassed the best known solutions.

How did AlphaEvolve find the rank-48 decomposition for 4x4 matrix multiplication?

Chapter 5: Google Infrastructure

AlphaEvolve is not just a research tool. It has been deployed to optimize real production systems at Google, with measurable impact across four layers of the compute stack.

1. Data center scheduling (Borg)

Google's data centers run billions of jobs using the Borg cluster manager. When assigning a job to a machine, Borg uses a scoring heuristic to rank candidate machines. A bad assignment strands resources: a machine runs out of memory but still has free CPU, or vice versa.

AlphaEvolve evolved the scoring heuristic. The result is remarkably simple — just 7 lines of code:

def alpha_evolve_score(required, free):
  cpu_r = required.cpu / free.cpu
  mem_r = required.mem / free.mem
  return -1.0 * (cpu_r + mem_r
    + mem_r / cpu_r
    + cpu_r / mem_r)

The first two terms penalize high resource consumption. The last two terms — the ratios of ratios — penalize imbalanced consumption. A machine that is 80% on CPU but only 20% on memory gets a low score, because it will soon strand memory. This is exactly the right shape for a bin-packing heuristic.

Impact: Deployed fleet-wide, this heuristic continuously recovers 0.7% of Google's entire compute resources. AlphaEvolve was chosen over a deep RL approach because the code solution is interpretable, debuggable, and easy to deploy in a mission-critical system.

2. Gemini kernel engineering

Training Gemini requires optimized matrix multiplication kernels. The key optimization is tiling: dividing a large matrix multiply into smaller tiles that fit into accelerator memory. Choosing the right tile size (M, N, P) for each input shape requires deep knowledge of hardware intricacies.

AlphaEvolve evolved a tiling heuristic that automatically selects near-optimal tile sizes for any input shape. Result: 23% average kernel speedup over the existing expert-designed heuristic, translating to a 1% reduction in Gemini's overall training time. This also reduced kernel optimization time from months of engineering to days of automated search.

Self-improvement loop: Gemini is the LLM that powers AlphaEvolve. By optimizing Gemini's training kernels, AlphaEvolve made its own underlying model train faster. This is the first instance where an LLM, through an agent pipeline, optimized its own training process.

3. TPU circuit design

AlphaEvolve optimized a Verilog implementation of an arithmetic circuit in the TPU matrix multiplication unit. It found a simple code rewrite that removed unnecessary bits, reducing area and power. This is now integrated into an upcoming TPU — Gemini's first direct contribution to TPU hardware.

4. Compiler-generated code (FlashAttention)

AlphaEvolve directly optimized the XLA compiler's intermediate representation for FlashAttention. This is especially impressive because the IR was already highly optimized by the compiler. Result: 32% speedup of the FlashAttention kernel and 15% speedup of input/output processing.

Why was AlphaEvolve chosen over deep RL for the Borg scheduling heuristic?

Chapter 6: Evaluation & Verification

The evaluation system is what makes AlphaEvolve's discoveries trustworthy. Every proposed solution is tested automatically, and mathematical results are formally verified.

The evaluation function

For every problem, the user provides an evaluate function that maps a solution to a dictionary of scalar scores. This is the fitness function. It can be simple (count the size of a graph) or complex (train a model and measure accuracy). AlphaEvolve maximizes these scores.

def evaluate(solution) -> dict[str, float]:
  graph = solution.construct()
  if not satisfies_property(graph):
    return {"score": 0}
  return {"score": len(graph.nodes)}

Evaluation cascade

For expensive evaluations, AlphaEvolve uses a cascade: test on easy cases first, and only promote to harder (more expensive) tests if the solution passes. This is hypothesis testing applied to program evaluation — it prunes bad solutions early without wasting compute.

Stage 1: Syntax check
Does the code even parse?
↓ pass
Stage 2: Quick eval
Run on small inputs, check basic correctness
↓ pass
Stage 3: Full eval
Run on full test suite, measure all metrics
↓ pass
Stage 4: Stress test
Run from many random seeds, measure robustness

LLM-generated feedback

Some quality criteria are hard to capture in a numeric score — for example, "is the code simple?" AlphaEvolve can use separate LLM calls to grade these soft properties and include the feedback in the evaluation scores. This steers evolution toward solutions that are not only effective but also clean and interpretable.

Formal verification for math

For mathematical discoveries, correctness is non-negotiable. When AlphaEvolve claims it found a rank-48 tensor decomposition, the decomposition can be verified by direct computation: multiply the factors together and check that they exactly reproduce the matrix multiplication tensor. For the Strassen result, elements were rounded to the nearest half-integer and verified to be exact.

Why automated evaluation is the key constraint: AlphaEvolve can only be applied to problems where solutions can be automatically evaluated. This covers mathematics, computer science, and system optimization, but not (yet) domains requiring physical experiments or subjective human judgment. The evaluation function is the bridge between LLM creativity and verifiable truth.
What is the purpose of the evaluation cascade?

Chapter 7: Program Database

The program database is AlphaEvolve's memory. It stores every evaluated solution alongside its scores and serves them back to the LLMs as context for future mutations. Getting this right is crucial — it controls the exploration-exploitation tradeoff.

The exploration-exploitation dilemma

If the database always serves the single best solution as the parent, evolution converges fast but gets stuck in local optima. If it serves random solutions, it explores broadly but wastes time on dead ends. AlphaEvolve needs to do both.

MAP-Elites inspiration

The database is inspired by MAP-Elites, an algorithm from evolutionary robotics. MAP-Elites divides the solution space into cells based on behavioral descriptors — different ways a solution can be "different." It keeps the best solution in each cell. This maintains diversity automatically: solutions that are good in different ways all survive.

In AlphaEvolve, the behavioral dimensions come from the multiple evaluation metrics. A solution that is fast but uses a lot of memory occupies a different cell from one that is memory-efficient but slower. Both survive and can contribute ideas to future generations.

Island model

AlphaEvolve also uses an island model: multiple sub-populations evolve semi-independently, with occasional migration between islands. This prevents premature convergence. Different islands may discover different solution strategies, and when they exchange individuals, the LLM can combine ideas from independent evolutionary lineages.

Multiple metrics drive diversity

Even when the user only cares about one metric, AlphaEvolve benefits from optimizing multiple metrics simultaneously. The paper hypothesizes why: programs that excel under different definitions of "good" have structurally different code. By including diverse high-performers in the LLM's prompt, the system generates more varied candidates, increasing the chance of discovering novel approaches for the target metric.

Think of it as a library, not a leaderboard. The program database is not just tracking "what is the best solution." It is maintaining a diverse archive of strategies, patterns, and partial insights. The LLM browses this library for inspiration. A mediocre overall solution might contain one brilliant subroutine that, combined with another solution's loss function, yields a breakthrough.
Why does optimizing multiple metrics help, even when only one metric is the target?

Chapter 8: Results Across Domains

AlphaEvolve's breadth is its most surprising property. The same system — with no task-specific architecture changes — makes progress across mathematics, computer science, and engineering.

Mathematics: 50+ open problems

AlphaEvolve was applied to a curated set of over 50 mathematical problems spanning analysis, combinatorics, number theory, and geometry. Results:

Highlights include:

The key methodological insight: for mathematical problems, AlphaEvolve evolves heuristic search algorithms rather than the constructions themselves. Each generation produces a search program that is given a time budget and the best construction found so far. Evolution selects for search heuristics that improve already-good solutions. The final discoveries emerge from a sequence of specialized heuristics — early ones making large gains from scratch, later ones fine-tuning near-optimal configurations.

Ablation results

Ablation experiments show that every component matters:

AblationEffect on tensor decomposition
No evolution (just repeated LLM calls)Severe degradation — evolution is essential
No context in promptSignificant degradation — the LLM needs problem info
No meta-prompt evolutionModerate degradation — automated prompt tuning helps
No full-file evolution (single function only)Moderate degradation — multi-component changes matter
Small LLM onlySignificant degradation — frontier models contribute key breakthroughs

The ablation on "no evolution" is the most striking: repeatedly querying the LLM with the same initial program, even with a powerful model, performs far worse than the evolutionary approach. Evolution provides sustained, directed improvement that individual LLM calls cannot.

A collaboration between humans and AI: Many of the mathematical problems were suggested by external mathematicians, including Terence Tao and Javier Gomez Serrano, who also advised on formulation. This highlights the potential for synergistic partnerships: humans provide the right problems and evaluation criteria, AlphaEvolve provides tireless, intelligent search.
What does the "no evolution" ablation demonstrate?

Chapter 9: Connections

AlphaEvolve sits at the intersection of several research threads. Understanding where it fits helps you see both what it builds on and where the field is heading.

FunSearch (2023)

AlphaEvolve's direct predecessor. FunSearch used LLM-guided evolution to discover new mathematical constructions, notably for the cap set problem in extremal combinatorics. AlphaEvolve inherits the core idea — LLM as mutation operator — and removes every limitation: scope (single function → full codebase), language (Python only → any), metrics (single → multiple), models (small → frontier).

AlphaTensor (2022)

DeepMind's earlier system for discovering fast matrix multiplication algorithms using deep reinforcement learning. AlphaTensor was purpose-built for tensor decomposition and achieved strong results, but it was a specialized system. AlphaEvolve is general-purpose — it takes the same problem as one of many applications — and achieves better results on several targets.

Evolutionary programming tradition

Genetic programming (GP) has evolved programs since the 1980s. The key limitation was always the mutation operators: random AST manipulations rarely produce valid, meaningful code. AlphaEvolve solves this by using LLMs as the mutation operator. It is, in a sense, the culmination of the GP vision: evolve programs using operators that understand programs.

Agentic AI systems

AlphaEvolve is part of the broader movement toward agentic AI — systems that autonomously plan, execute, and iterate. Compared to other coding agents (SWE-Bench, Devin, etc.), AlphaEvolve focuses on discovery rather than task completion. It does not follow instructions; it searches for novel solutions in an open-ended space.

Self-improving AI

AlphaEvolve optimized its own LLM's training kernels and the FlashAttention implementation used in inference. These are early, modest examples of AI self-improvement: the feedback loops take months and the gains are incremental. But the paper explicitly envisions this becoming more powerful over time — better AlphaEvolve yields better Gemini, which yields better AlphaEvolve.

Test-time compute scaling

AlphaEvolve can be viewed as a test-time compute system: it takes a base LLM and amplifies its capability by spending more compute at inference time (through many evolutionary iterations). The key insight is that machine feedback (code execution + evaluation) sustains this scaling far beyond what self-consistency or chain-of-thought can achieve.

Limitations

The bigger picture: AlphaEvolve demonstrates that the combination of LLM creativity + automated evaluation + evolutionary discipline can achieve what neither component can alone. The LLM provides the ideas, execution provides the truth, and evolution provides the patience. This is a template for applying AI to any domain where you can write a scoring function.

Read the AlphaEvolve paper  |  ← Back to Veanors