An evolutionary coding agent that uses LLMs as mutation operators inside an automated search loop. It discovered the first improvement over Strassen's 1969 matrix-multiplication algorithm, optimized Google's data-center scheduling, and accelerated its own LLM training — all with provably correct results.
Discovering a new algorithm is not like writing code to a spec. There is no specification. You are searching an enormous space of possible programs, and you do not even know what "better" looks like until you try something and measure it.
Think about matrix multiplication. In 1969, Volker Strassen shocked the mathematics community by showing that two 2×2 matrices could be multiplied with 7 scalar multiplications instead of the obvious 8. This single insight cascades: apply it recursively and you get sub-cubic algorithms for all matrix sizes. For 56 years, nobody found a way to do better for 4×4 matrices over the complex numbers. Thousands of researchers tried.
Or consider Google's data-center scheduling. Billions of compute jobs need to be assigned to machines every day. A slightly better heuristic — even 0.7% better — recovers enormous amounts of stranded resources across the fleet. But the search space of possible heuristics is effectively infinite, and each candidate requires realistic simulation to evaluate.
These problems share a structure: the search space is vast, the evaluation is automatable, and small improvements compound into massive impact. Traditional approaches — handcrafted heuristics, brute-force search, reinforcement learning on narrow tasks — have made progress, but they hit walls. Hand-design cannot explore fast enough. RL approaches like AlphaTensor work on single problems but do not generalize.
AlphaEvolve is the successor to FunSearch (Romera-Paredes et al., 2023), which showed that LLM-guided evolution could find new mathematical constructions. But FunSearch was limited: it could only evolve a single short Python function, it optimized a single metric, and it used small LLMs. AlphaEvolve removes all these constraints.
The central idea is beautifully simple: use an LLM as the mutation operator in an evolutionary algorithm, with automated code execution as the fitness function.
In classical evolutionary algorithms, mutations are random: flip a bit, swap two elements, perturb a number. These mutations are blind — they do not understand the structure of the solution. Most random mutations to code produce garbage.
But LLMs understand code. Given a program and a description of what it does, an LLM can propose meaningful changes: swap an optimizer, add a regularization term, restructure a loop. Each proposed change is a hypothesis about how to improve the program. The evolutionary framework then tests these hypotheses automatically and keeps the winners.
Here is the key difference from just asking an LLM to solve a problem:
| Approach | Exploration | Grounding | Duration |
|---|---|---|---|
| Single LLM query | One shot | None (may hallucinate) | Seconds |
| LLM + chain of thought | Sequential reasoning | Self-consistency only | Minutes |
| AlphaEvolve | Population-based, thousands of trials | Code execution + automated evaluation | Hours to days |
AlphaEvolve grounds every suggestion in reality: the proposed code is actually run, and the result is actually measured. This completely sidesteps LLM hallucinations. If the LLM proposes a change that makes the code worse (or breaks it), the evaluator catches it and that variant dies. If the change helps, it survives and seeds the next generation.
The system uses an ensemble of models: Gemini 2.0 Flash for high-throughput exploration (many cheap samples) and Gemini 2.0 Pro for occasional high-quality breakthroughs. Flash generates volume; Pro generates insight. The evolutionary framework harvests the best of both.
AlphaEvolve runs a continuous loop. Each iteration follows six steps:
Let us walk through each step in detail.
The program database holds all previously evaluated solutions with their scores. The system samples a parent program — the solution to be improved — plus zero or more inspiration programs that scored well on different metrics. This is the crossover mechanism: by showing the LLM multiple good solutions, it can combine ideas from different lineages.
The prompt is rich. It includes:
# EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END markers around the parts to modifyThe LLM outputs its proposed changes in a SEARCH/REPLACE format:
<<<<<<< SEARCH return optax.adam(self.learning_rate) ======= return optax.adamw(self.learning_rate, weight_decay=1e-4) >>>>>>> REPLACE
This format is important: it lets the LLM make targeted edits to specific parts of potentially large codebases without rewriting everything. For short programs, the LLM can also output the entire replacement code directly.
The diff is applied to create a child program. The child is executed and scored. If the child improves on any metric, it enters the database and becomes available as a parent or inspiration for future iterations.
AlphaEvolve's evolutionary loop has four moving parts: the population, selection, crossover, and mutation. Each is reimagined for the LLM era.
In classical evolution, the population is a flat set of genomes. AlphaEvolve's population is a structured program database — a collection of programs tagged with their evaluation scores across multiple metrics. The database uses an algorithm inspired by MAP-Elites and island-based models to balance exploration (try new things) with exploitation (refine what works).
When the system needs a parent, it does not just pick the best-scoring program. It samples from the database with a bias toward higher-scoring programs but enough randomness to maintain diversity. It can also sample "inspiration" programs — solutions that excel on different metrics — to inject variety.
Classical crossover swaps subsequences between two parent genomes. AlphaEvolve's crossover is far more powerful: the LLM reads multiple good solutions and synthesizes new code that combines their ideas. It might take an optimizer from one solution and a loss function from another, adapting them to work together. This is semantic crossover — the LLM understands what each piece does.
The LLM proposes modifications to the parent program. These mutations range from small (changing a hyperparameter) to large (rewriting an entire function with a new algorithm). Crucially, the mutations are informed: the LLM sees the evaluation scores, the code, and optional problem context, and reasons about what to change.
| Component | FunSearch (2023) | AlphaEvolve (2025) |
|---|---|---|
| Scope | Single function, 10-20 lines | Entire codebase, hundreds of lines |
| Language | Python only | Any language |
| Metrics | Single objective | Multiple objectives |
| LLMs | Small, code-only | Frontier models (Gemini 2.0) |
| Context | Minimal (prior solutions) | Rich (context, feedback, meta-prompts) |
| Evaluation | ≤20 min on 1 CPU | Hours, parallel, on accelerators |
| Samples | Millions needed | Thousands suffice |
This is the headline result: AlphaEvolve found a way to multiply two 4×4 complex matrices using 48 scalar multiplications, breaking a 56-year-old barrier set by Strassen's algorithm.
Matrix multiplication is the single most important operation in scientific computing. It underpins everything from training neural networks to rendering graphics to simulating physics. The naive algorithm for multiplying two n×n matrices requires n³ multiplications. For n=4, that is 64 multiplications.
In 1969, Strassen showed a clever trick: you can multiply two 2×2 matrices with 7 multiplications instead of 8, by forming specific linear combinations. Apply this recursively: two 4×4 matrices can be broken into 2×2 blocks, and each block multiplication uses Strassen's trick. The result: 7² = 49 multiplications for 4×4 matrices.
For 56 years, 49 was the best known for 4×4 complex matrix multiplication. Nobody could do better.
Matrix multiplication algorithms can be encoded as tensor decompositions. The product of an m×n matrix and an n×p matrix corresponds to a specific 3D tensor. A decomposition of this tensor into R rank-one terms gives an algorithm that uses R scalar multiplications. So the problem reduces to: find the lowest-rank decomposition of the matrix multiplication tensor.
AlphaEvolve was given:
AlphaEvolve then evolved the entire decomposition algorithm — not just one function, but the optimizer, the loss function, the initialization, the hyperparameters, and even the search strategy. Over 15 mutations, it introduced:
None of these ideas is individually revolutionary. But the combination — discovered through automated evolution — was enough to push past the 56-year barrier. The final algorithm finds a rank-48 decomposition of the 4×4 complex matrix multiplication tensor.
| Matrix size ⟨m,n,p⟩ | Naive | Previous best | AlphaEvolve |
|---|---|---|---|
| ⟨2,4,5⟩ | 40 | 33 | 32 |
| ⟨3,3,3⟩ | 27 | 23 | 23 |
| ⟨3,4,6⟩ | 72 | 56 | 54 |
| ⟨4,4,4⟩ | 64 | 49 (Strassen) | 48 |
| ⟨4,4,5⟩ | 80 | 62 | 61 |
| ⟨5,5,5⟩ | 125 | 93 | 93 |
AlphaEvolve improved the state of the art for 14 different matrix multiplication targets. For all targets with m,n,p ≤ 5, it either matched or surpassed the best known solutions.
AlphaEvolve is not just a research tool. It has been deployed to optimize real production systems at Google, with measurable impact across four layers of the compute stack.
Google's data centers run billions of jobs using the Borg cluster manager. When assigning a job to a machine, Borg uses a scoring heuristic to rank candidate machines. A bad assignment strands resources: a machine runs out of memory but still has free CPU, or vice versa.
AlphaEvolve evolved the scoring heuristic. The result is remarkably simple — just 7 lines of code:
def alpha_evolve_score(required, free): cpu_r = required.cpu / free.cpu mem_r = required.mem / free.mem return -1.0 * (cpu_r + mem_r + mem_r / cpu_r + cpu_r / mem_r)
The first two terms penalize high resource consumption. The last two terms — the ratios of ratios — penalize imbalanced consumption. A machine that is 80% on CPU but only 20% on memory gets a low score, because it will soon strand memory. This is exactly the right shape for a bin-packing heuristic.
Impact: Deployed fleet-wide, this heuristic continuously recovers 0.7% of Google's entire compute resources. AlphaEvolve was chosen over a deep RL approach because the code solution is interpretable, debuggable, and easy to deploy in a mission-critical system.
Training Gemini requires optimized matrix multiplication kernels. The key optimization is tiling: dividing a large matrix multiply into smaller tiles that fit into accelerator memory. Choosing the right tile size (M, N, P) for each input shape requires deep knowledge of hardware intricacies.
AlphaEvolve evolved a tiling heuristic that automatically selects near-optimal tile sizes for any input shape. Result: 23% average kernel speedup over the existing expert-designed heuristic, translating to a 1% reduction in Gemini's overall training time. This also reduced kernel optimization time from months of engineering to days of automated search.
AlphaEvolve optimized a Verilog implementation of an arithmetic circuit in the TPU matrix multiplication unit. It found a simple code rewrite that removed unnecessary bits, reducing area and power. This is now integrated into an upcoming TPU — Gemini's first direct contribution to TPU hardware.
AlphaEvolve directly optimized the XLA compiler's intermediate representation for FlashAttention. This is especially impressive because the IR was already highly optimized by the compiler. Result: 32% speedup of the FlashAttention kernel and 15% speedup of input/output processing.
The evaluation system is what makes AlphaEvolve's discoveries trustworthy. Every proposed solution is tested automatically, and mathematical results are formally verified.
For every problem, the user provides an evaluate function that maps a solution to a dictionary of scalar scores. This is the fitness function. It can be simple (count the size of a graph) or complex (train a model and measure accuracy). AlphaEvolve maximizes these scores.
def evaluate(solution) -> dict[str, float]: graph = solution.construct() if not satisfies_property(graph): return {"score": 0} return {"score": len(graph.nodes)}
For expensive evaluations, AlphaEvolve uses a cascade: test on easy cases first, and only promote to harder (more expensive) tests if the solution passes. This is hypothesis testing applied to program evaluation — it prunes bad solutions early without wasting compute.
Some quality criteria are hard to capture in a numeric score — for example, "is the code simple?" AlphaEvolve can use separate LLM calls to grade these soft properties and include the feedback in the evaluation scores. This steers evolution toward solutions that are not only effective but also clean and interpretable.
For mathematical discoveries, correctness is non-negotiable. When AlphaEvolve claims it found a rank-48 tensor decomposition, the decomposition can be verified by direct computation: multiply the factors together and check that they exactly reproduce the matrix multiplication tensor. For the Strassen result, elements were rounded to the nearest half-integer and verified to be exact.
The program database is AlphaEvolve's memory. It stores every evaluated solution alongside its scores and serves them back to the LLMs as context for future mutations. Getting this right is crucial — it controls the exploration-exploitation tradeoff.
If the database always serves the single best solution as the parent, evolution converges fast but gets stuck in local optima. If it serves random solutions, it explores broadly but wastes time on dead ends. AlphaEvolve needs to do both.
The database is inspired by MAP-Elites, an algorithm from evolutionary robotics. MAP-Elites divides the solution space into cells based on behavioral descriptors — different ways a solution can be "different." It keeps the best solution in each cell. This maintains diversity automatically: solutions that are good in different ways all survive.
In AlphaEvolve, the behavioral dimensions come from the multiple evaluation metrics. A solution that is fast but uses a lot of memory occupies a different cell from one that is memory-efficient but slower. Both survive and can contribute ideas to future generations.
AlphaEvolve also uses an island model: multiple sub-populations evolve semi-independently, with occasional migration between islands. This prevents premature convergence. Different islands may discover different solution strategies, and when they exchange individuals, the LLM can combine ideas from independent evolutionary lineages.
Even when the user only cares about one metric, AlphaEvolve benefits from optimizing multiple metrics simultaneously. The paper hypothesizes why: programs that excel under different definitions of "good" have structurally different code. By including diverse high-performers in the LLM's prompt, the system generates more varied candidates, increasing the chance of discovering novel approaches for the target metric.
AlphaEvolve's breadth is its most surprising property. The same system — with no task-specific architecture changes — makes progress across mathematics, computer science, and engineering.
AlphaEvolve was applied to a curated set of over 50 mathematical problems spanning analysis, combinatorics, number theory, and geometry. Results:
Highlights include:
The key methodological insight: for mathematical problems, AlphaEvolve evolves heuristic search algorithms rather than the constructions themselves. Each generation produces a search program that is given a time budget and the best construction found so far. Evolution selects for search heuristics that improve already-good solutions. The final discoveries emerge from a sequence of specialized heuristics — early ones making large gains from scratch, later ones fine-tuning near-optimal configurations.
Ablation experiments show that every component matters:
| Ablation | Effect on tensor decomposition |
|---|---|
| No evolution (just repeated LLM calls) | Severe degradation — evolution is essential |
| No context in prompt | Significant degradation — the LLM needs problem info |
| No meta-prompt evolution | Moderate degradation — automated prompt tuning helps |
| No full-file evolution (single function only) | Moderate degradation — multi-component changes matter |
| Small LLM only | Significant degradation — frontier models contribute key breakthroughs |
The ablation on "no evolution" is the most striking: repeatedly querying the LLM with the same initial program, even with a powerful model, performs far worse than the evolutionary approach. Evolution provides sustained, directed improvement that individual LLM calls cannot.
AlphaEvolve sits at the intersection of several research threads. Understanding where it fits helps you see both what it builds on and where the field is heading.
AlphaEvolve's direct predecessor. FunSearch used LLM-guided evolution to discover new mathematical constructions, notably for the cap set problem in extremal combinatorics. AlphaEvolve inherits the core idea — LLM as mutation operator — and removes every limitation: scope (single function → full codebase), language (Python only → any), metrics (single → multiple), models (small → frontier).
DeepMind's earlier system for discovering fast matrix multiplication algorithms using deep reinforcement learning. AlphaTensor was purpose-built for tensor decomposition and achieved strong results, but it was a specialized system. AlphaEvolve is general-purpose — it takes the same problem as one of many applications — and achieves better results on several targets.
Genetic programming (GP) has evolved programs since the 1980s. The key limitation was always the mutation operators: random AST manipulations rarely produce valid, meaningful code. AlphaEvolve solves this by using LLMs as the mutation operator. It is, in a sense, the culmination of the GP vision: evolve programs using operators that understand programs.
AlphaEvolve is part of the broader movement toward agentic AI — systems that autonomously plan, execute, and iterate. Compared to other coding agents (SWE-Bench, Devin, etc.), AlphaEvolve focuses on discovery rather than task completion. It does not follow instructions; it searches for novel solutions in an open-ended space.
AlphaEvolve optimized its own LLM's training kernels and the FlashAttention implementation used in inference. These are early, modest examples of AI self-improvement: the feedback loops take months and the gains are incremental. But the paper explicitly envisions this becoming more powerful over time — better AlphaEvolve yields better Gemini, which yields better AlphaEvolve.
AlphaEvolve can be viewed as a test-time compute system: it takes a base LLM and amplifies its capability by spending more compute at inference time (through many evolutionary iterations). The key insight is that machine feedback (code execution + evaluation) sustains this scaling far beyond what self-consistency or chain-of-thought can achieve.