AlphaEvolve — Veanors

Chapter 0: The Problem

Discovering a new algorithm is not like writing code to a spec. There is no specification. You are searching an enormous space of possible programs, and you do not even know what "better" looks like until you try something and measure it.

Think about matrix multiplication. In 1969, Volker Strassen shocked the mathematics community by showing that two 2×2 matrices could be multiplied with 7 scalar multiplications instead of the obvious 8. This single insight cascades: apply it recursively and you get sub-cubic algorithms for all matrix sizes. For 56 years, nobody found a way to do better for 4×4 matrices over the complex numbers. Thousands of researchers tried.

Or consider Google's data-center scheduling. Billions of compute jobs need to be assigned to machines every day. A slightly better heuristic — even 0.7% better — recovers enormous amounts of stranded resources across the fleet. But the search space of possible heuristics is effectively infinite, and each candidate requires realistic simulation to evaluate.

These problems share a structure: the search space is vast, the evaluation is automatable, and small improvements compound into massive impact. Traditional approaches — handcrafted heuristics, brute-force search, reinforcement learning on narrow tasks — have made progress, but they hit walls. Hand-design cannot explore fast enough. RL approaches like AlphaTensor work on single problems but do not generalize.

The core question: What if we could automate the entire creative loop — proposing ideas, testing them, keeping the good ones, combining them — using LLMs as the idea engine and code execution as the judge? That is what AlphaEvolve does. It turns the LLM into an evolution operator inside an automated discovery pipeline.

AlphaEvolve is the successor to FunSearch (Romera-Paredes et al., 2023), which showed that LLM-guided evolution could find new mathematical constructions. But FunSearch was limited: it could only evolve a single short Python function, it optimized a single metric, and it used small LLMs. AlphaEvolve removes all these constraints.

Why is algorithmic discovery fundamentally harder than software engineering?

There is no specification — you are searching a vast space of programs where "better" can only be determined by automated evaluation, and small improvements require sustained creative exploration The code is too long to write by hand LLMs are not good at writing code

Chapter 1: The Key Insight

The central idea is beautifully simple: use an LLM as the mutation operator in an evolutionary algorithm, with automated code execution as the fitness function.

In classical evolutionary algorithms, mutations are random: flip a bit, swap two elements, perturb a number. These mutations are blind — they do not understand the structure of the solution. Most random mutations to code produce garbage.

But LLMs understand code. Given a program and a description of what it does, an LLM can propose meaningful changes: swap an optimizer, add a regularization term, restructure a loop. Each proposed change is a hypothesis about how to improve the program. The evolutionary framework then tests these hypotheses automatically and keeps the winners.

Why this works: Evolution provides the exploration discipline — maintaining diversity, selecting for fitness, combining good ideas. The LLM provides the intelligence of the mutations — understanding code semantics, drawing on its training knowledge, proposing changes that are syntactically valid and semantically meaningful. Neither alone is sufficient. The LLM alone will hallucinate or get stuck in local optima. Evolution alone with random mutations is too slow. Together, they search effectively.

Here is the key difference from just asking an LLM to solve a problem:

Approach	Exploration	Grounding	Duration
Single LLM query	One shot	None (may hallucinate)	Seconds
LLM + chain of thought	Sequential reasoning	Self-consistency only	Minutes
AlphaEvolve	Population-based, thousands of trials	Code execution + automated evaluation	Hours to days

AlphaEvolve grounds every suggestion in reality: the proposed code is actually run, and the result is actually measured. This completely sidesteps LLM hallucinations. If the LLM proposes a change that makes the code worse (or breaks it), the evaluator catches it and that variant dies. If the change helps, it survives and seeds the next generation.

The system uses an ensemble of models: Gemini 2.0 Flash for high-throughput exploration (many cheap samples) and Gemini 2.0 Pro for occasional high-quality breakthroughs. Flash generates volume; Pro generates insight. The evolutionary framework harvests the best of both.

What makes AlphaEvolve's mutations fundamentally different from classical evolutionary algorithm mutations?

LLMs propose semantically meaningful code changes informed by their training knowledge, rather than blind random perturbations — so most mutations are valid programs with intentional modifications AlphaEvolve uses larger populations AlphaEvolve runs faster

Chapter 2: The Pipeline

AlphaEvolve runs a continuous loop. Each iteration follows six steps:

1. Sample parent

Pick a program from the database + optional "inspiration" programs

↓

2. Build prompt

Assemble parent code + scores + context + instructions into an LLM prompt

↓

3. LLM generates diff

The LLM proposes code modifications as SEARCH/REPLACE blocks

↓

4. Apply diff

Patch the parent program to create a child program

↓

5. Evaluate

Run the child program and score it with automated evaluators

↓

6. Update database

If promising, add the child to the program database for future iterations

Let us walk through each step in detail.

Step 1: Sample a parent

The program database holds all previously evaluated solutions with their scores. The system samples a parent program — the solution to be improved — plus zero or more inspiration programs that scored well on different metrics. This is the crossover mechanism: by showing the LLM multiple good solutions, it can combine ideas from different lineages.

Step 2: Build the prompt

The prompt is rich. It includes:

The parent program's full code, with # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END markers around the parts to modify
The parent's evaluation scores
Inspiration programs and their scores
System instructions explaining the diff format
Optional: problem context (equations, papers), stochastic formatting for diversity, and LLM-generated meta-prompts

Step 3: LLM generates a diff

The LLM outputs its proposed changes in a SEARCH/REPLACE format:

<<<<<<< SEARCH
return optax.adam(self.learning_rate)
=======
return optax.adamw(self.learning_rate, weight_decay=1e-4)
>>>>>>> REPLACE

This format is important: it lets the LLM make targeted edits to specific parts of potentially large codebases without rewriting everything. For short programs, the LLM can also output the entire replacement code directly.

Step 4–6: Apply, evaluate, store

The diff is applied to create a child program. The child is executed and scored. If the child improves on any metric, it enters the database and becomes available as a parent or inspiration for future iterations.

Asynchronous execution: The pipeline is fully asynchronous. Many LLM queries and evaluations run concurrently. The system is optimized for throughput — maximize the number of ideas explored per unit time — rather than the speed of any single evaluation. This is why AlphaEvolve succeeds with just thousands of samples (not millions), since each sample is a high-quality, semantically meaningful mutation.

Why does AlphaEvolve use a SEARCH/REPLACE diff format instead of generating complete programs?

Diffs allow targeted edits to specific parts of large codebases, letting AlphaEvolve evolve entire files (hundreds of lines) without regenerating unchanged code Complete programs are too long for LLMs to generate The diff format is faster to parse

Chapter 3: Evolutionary Components

AlphaEvolve's evolutionary loop has four moving parts: the population, selection, crossover, and mutation. Each is reimagined for the LLM era.

Population: the program database

In classical evolution, the population is a flat set of genomes. AlphaEvolve's population is a structured program database — a collection of programs tagged with their evaluation scores across multiple metrics. The database uses an algorithm inspired by MAP-Elites and island-based models to balance exploration (try new things) with exploitation (refine what works).

Selection: sampling parents

When the system needs a parent, it does not just pick the best-scoring program. It samples from the database with a bias toward higher-scoring programs but enough randomness to maintain diversity. It can also sample "inspiration" programs — solutions that excel on different metrics — to inject variety.

Crossover: combining ideas via LLM

Classical crossover swaps subsequences between two parent genomes. AlphaEvolve's crossover is far more powerful: the LLM reads multiple good solutions and synthesizes new code that combines their ideas. It might take an optimizer from one solution and a loss function from another, adapting them to work together. This is semantic crossover — the LLM understands what each piece does.

Mutation: LLM-generated code changes

The LLM proposes modifications to the parent program. These mutations range from small (changing a hyperparameter) to large (rewriting an entire function with a new algorithm). Crucially, the mutations are informed: the LLM sees the evaluation scores, the code, and optional problem context, and reasons about what to change.

FunSearch vs. AlphaEvolve: FunSearch could only evolve a single short function (10-20 lines) in Python. AlphaEvolve evolves entire codebases — hundreds of lines across multiple functions and files, in any programming language. FunSearch optimized one metric; AlphaEvolve optimizes multiple simultaneously. FunSearch used small code-only LLMs; AlphaEvolve uses frontier models with rich context. This is not an incremental improvement — it is a qualitative leap in what the system can discover.

Component	FunSearch (2023)	AlphaEvolve (2025)
Scope	Single function, 10-20 lines	Entire codebase, hundreds of lines
Language	Python only	Any language
Metrics	Single objective	Multiple objectives
LLMs	Small, code-only	Frontier models (Gemini 2.0)
Context	Minimal (prior solutions)	Rich (context, feedback, meta-prompts)
Evaluation	≤20 min on 1 CPU	Hours, parallel, on accelerators
Samples	Millions needed	Thousands suffice

How does AlphaEvolve implement "crossover" compared to classical evolutionary algorithms?

The LLM reads multiple high-scoring solutions and synthesizes new code that semantically combines their ideas, rather than blindly swapping code subsequences It randomly swaps functions between two programs It does not implement crossover at all

Chapter 4: The Strassen Discovery

This is the headline result: AlphaEvolve found a way to multiply two 4×4 complex matrices using 48 scalar multiplications, breaking a 56-year-old barrier set by Strassen's algorithm.

Why matrix multiplication matters

Matrix multiplication is the single most important operation in scientific computing. It underpins everything from training neural networks to rendering graphics to simulating physics. The naive algorithm for multiplying two n×n matrices requires n³ multiplications. For n=4, that is 64 multiplications.

In 1969, Strassen showed a clever trick: you can multiply two 2×2 matrices with 7 multiplications instead of 8, by forming specific linear combinations. Apply this recursively: two 4×4 matrices can be broken into 2×2 blocks, and each block multiplication uses Strassen's trick. The result: 7² = 49 multiplications for 4×4 matrices.

For 56 years, 49 was the best known for 4×4 complex matrix multiplication. Nobody could do better.

How AlphaEvolve found 48

Matrix multiplication algorithms can be encoded as tensor decompositions. The product of an m×n matrix and an n×p matrix corresponds to a specific 3D tensor. A decomposition of this tensor into R rank-one terms gives an algorithm that uses R scalar multiplications. So the problem reduces to: find the lowest-rank decomposition of the matrix multiplication tensor.

AlphaEvolve was given:

An initial program: a standard gradient-based tensor decomposition algorithm (Adam optimizer, reconstruction loss, random initialization)
An evaluation function: run the decomposition from multiple random seeds, report the best rank achieved

AlphaEvolve then evolved the entire decomposition algorithm — not just one function, but the optimizer, the loss function, the initialization, the hyperparameters, and even the search strategy. Over 15 mutations, it introduced:

AdamW optimizer

Replaced Adam with AdamW + weight decay for better regularization

↓

Gradient noise

Added stochastic gradient noise for exploration

↓

Discretization loss

Penalized non-integer values to encourage exact decompositions

↓

Target noise

Added noise to the target tensor for robustness

↓

Cyclical annealing

Oscillating clipping thresholds to escape local minima

↓

Soft clipping

Clipped complex-valued factor entries to bounded ranges

↓

"Hallucination" loss

Randomly replaced decomposition values to encourage exploration (the LLM's own creative name)

None of these ideas is individually revolutionary. But the combination — discovered through automated evolution — was enough to push past the 56-year barrier. The final algorithm finds a rank-48 decomposition of the 4×4 complex matrix multiplication tensor.

What makes this remarkable: A general-purpose coding agent, with no domain-specific architecture for tensor decomposition, discovered a result that eluded specialist mathematicians and dedicated systems (including AlphaTensor, which used deep RL specifically designed for tensor decomposition) for over half a century. And the result is provably correct — the decomposition can be verified by direct computation.

The numbers

Matrix size ⟨m,n,p⟩	Naive	Previous best	AlphaEvolve
⟨2,4,5⟩	40	33	32
⟨3,3,3⟩	27	23	23
⟨3,4,6⟩	72	56	54
⟨4,4,4⟩	64	49 (Strassen)	48
⟨4,4,5⟩	80	62	61
⟨5,5,5⟩	125	93	93

AlphaEvolve improved the state of the art for 14 different matrix multiplication targets. For all targets with m,n,p ≤ 5, it either matched or surpassed the best known solutions.

How did AlphaEvolve find the rank-48 decomposition for 4x4 matrix multiplication?

It evolved the entire tensor decomposition algorithm — optimizer, loss function, initialization, and search strategy — through 15 mutations, discovering techniques like discretization loss and cyclical annealing that together pushed past the 56-year barrier It used reinforcement learning to search over possible decompositions directly It enumerated all possible rank-48 decompositions

Chapter 5: Google Infrastructure

AlphaEvolve is not just a research tool. It has been deployed to optimize real production systems at Google, with measurable impact across four layers of the compute stack.

1. Data center scheduling (Borg)

Google's data centers run billions of jobs using the Borg cluster manager. When assigning a job to a machine, Borg uses a scoring heuristic to rank candidate machines. A bad assignment strands resources: a machine runs out of memory but still has free CPU, or vice versa.

AlphaEvolve evolved the scoring heuristic. The result is remarkably simple — just 7 lines of code:

def alpha_evolve_score(required, free):
  cpu_r = required.cpu / free.cpu
  mem_r = required.mem / free.mem
  return -1.0 * (cpu_r + mem_r
    + mem_r / cpu_r
    + cpu_r / mem_r)

The first two terms penalize high resource consumption. The last two terms — the ratios of ratios — penalize imbalanced consumption. A machine that is 80% on CPU but only 20% on memory gets a low score, because it will soon strand memory. This is exactly the right shape for a bin-packing heuristic.

Impact: Deployed fleet-wide, this heuristic continuously recovers 0.7% of Google's entire compute resources. AlphaEvolve was chosen over a deep RL approach because the code solution is interpretable, debuggable, and easy to deploy in a mission-critical system.

2. Gemini kernel engineering

Training Gemini requires optimized matrix multiplication kernels. The key optimization is tiling: dividing a large matrix multiply into smaller tiles that fit into accelerator memory. Choosing the right tile size (M, N, P) for each input shape requires deep knowledge of hardware intricacies.

AlphaEvolve evolved a tiling heuristic that automatically selects near-optimal tile sizes for any input shape. Result: 23% average kernel speedup over the existing expert-designed heuristic, translating to a 1% reduction in Gemini's overall training time. This also reduced kernel optimization time from months of engineering to days of automated search.

Self-improvement loop: Gemini is the LLM that powers AlphaEvolve. By optimizing Gemini's training kernels, AlphaEvolve made its own underlying model train faster. This is the first instance where an LLM, through an agent pipeline, optimized its own training process.

3. TPU circuit design

AlphaEvolve optimized a Verilog implementation of an arithmetic circuit in the TPU matrix multiplication unit. It found a simple code rewrite that removed unnecessary bits, reducing area and power. This is now integrated into an upcoming TPU — Gemini's first direct contribution to TPU hardware.

4. Compiler-generated code (FlashAttention)

AlphaEvolve directly optimized the XLA compiler's intermediate representation for FlashAttention. This is especially impressive because the IR was already highly optimized by the compiler. Result: 32% speedup of the FlashAttention kernel and 15% speedup of input/output processing.

Why was AlphaEvolve chosen over deep RL for the Borg scheduling heuristic?

AlphaEvolve's code solution is interpretable, debuggable, and easy to deploy in a mission-critical system — essential qualities that a neural network policy cannot provide Deep RL was too slow to train Deep RL could not be applied to scheduling problems

Chapter 6: Evaluation & Verification

The evaluation system is what makes AlphaEvolve's discoveries trustworthy. Every proposed solution is tested automatically, and mathematical results are formally verified.

The evaluation function

For every problem, the user provides an evaluate function that maps a solution to a dictionary of scalar scores. This is the fitness function. It can be simple (count the size of a graph) or complex (train a model and measure accuracy). AlphaEvolve maximizes these scores.

def evaluate(solution) -> dict[str, float]:
  graph = solution.construct()
  if not satisfies_property(graph):
    return {"score": 0}
  return {"score": len(graph.nodes)}

Evaluation cascade

For expensive evaluations, AlphaEvolve uses a cascade: test on easy cases first, and only promote to harder (more expensive) tests if the solution passes. This is hypothesis testing applied to program evaluation — it prunes bad solutions early without wasting compute.

Stage 1: Syntax check

Does the code even parse?

↓ pass

Stage 2: Quick eval

Run on small inputs, check basic correctness

↓ pass

Stage 3: Full eval

Run on full test suite, measure all metrics

↓ pass

Stage 4: Stress test

Run from many random seeds, measure robustness

LLM-generated feedback

Some quality criteria are hard to capture in a numeric score — for example, "is the code simple?" AlphaEvolve can use separate LLM calls to grade these soft properties and include the feedback in the evaluation scores. This steers evolution toward solutions that are not only effective but also clean and interpretable.

Formal verification for math

For mathematical discoveries, correctness is non-negotiable. When AlphaEvolve claims it found a rank-48 tensor decomposition, the decomposition can be verified by direct computation: multiply the factors together and check that they exactly reproduce the matrix multiplication tensor. For the Strassen result, elements were rounded to the nearest half-integer and verified to be exact.

Why automated evaluation is the key constraint: AlphaEvolve can only be applied to problems where solutions can be automatically evaluated. This covers mathematics, computer science, and system optimization, but not (yet) domains requiring physical experiments or subjective human judgment. The evaluation function is the bridge between LLM creativity and verifiable truth.

What is the purpose of the evaluation cascade?

It tests solutions on progressively harder stages, pruning bad candidates early to avoid wasting expensive compute on full evaluation of broken or low-quality programs It runs multiple LLMs in sequence for better code quality It trains the LLM on evaluation feedback

Chapter 7: Program Database

The program database is AlphaEvolve's memory. It stores every evaluated solution alongside its scores and serves them back to the LLMs as context for future mutations. Getting this right is crucial — it controls the exploration-exploitation tradeoff.

The exploration-exploitation dilemma

If the database always serves the single best solution as the parent, evolution converges fast but gets stuck in local optima. If it serves random solutions, it explores broadly but wastes time on dead ends. AlphaEvolve needs to do both.

MAP-Elites inspiration

The database is inspired by MAP-Elites, an algorithm from evolutionary robotics. MAP-Elites divides the solution space into cells based on behavioral descriptors — different ways a solution can be "different." It keeps the best solution in each cell. This maintains diversity automatically: solutions that are good in different ways all survive.

In AlphaEvolve, the behavioral dimensions come from the multiple evaluation metrics. A solution that is fast but uses a lot of memory occupies a different cell from one that is memory-efficient but slower. Both survive and can contribute ideas to future generations.

Island model

AlphaEvolve also uses an island model: multiple sub-populations evolve semi-independently, with occasional migration between islands. This prevents premature convergence. Different islands may discover different solution strategies, and when they exchange individuals, the LLM can combine ideas from independent evolutionary lineages.

Multiple metrics drive diversity

Even when the user only cares about one metric, AlphaEvolve benefits from optimizing multiple metrics simultaneously. The paper hypothesizes why: programs that excel under different definitions of "good" have structurally different code. By including diverse high-performers in the LLM's prompt, the system generates more varied candidates, increasing the chance of discovering novel approaches for the target metric.

Think of it as a library, not a leaderboard. The program database is not just tracking "what is the best solution." It is maintaining a diverse archive of strategies, patterns, and partial insights. The LLM browses this library for inspiration. A mediocre overall solution might contain one brilliant subroutine that, combined with another solution's loss function, yields a breakthrough.

Why does optimizing multiple metrics help, even when only one metric is the target?

Programs excelling under different metrics have structurally different code, and showing these diverse solutions to the LLM generates more varied candidates that are more likely to contain novel approaches for the target metric Multiple metrics make the evaluation function more accurate It prevents the LLM from overfitting to one metric

Chapter 8: Results Across Domains

AlphaEvolve's breadth is its most surprising property. The same system — with no task-specific architecture changes — makes progress across mathematics, computer science, and engineering.

Mathematics: 50+ open problems

AlphaEvolve was applied to a curated set of over 50 mathematical problems spanning analysis, combinatorics, number theory, and geometry. Results:

~75% of problems: matched the best known constructions (many likely already optimal)
~20% of problems: discovered new constructions that surpass the state of the art

Highlights include:

Kissing number in 11 dimensions: Found 593 non-overlapping unit spheres touching a central sphere, beating the previous record of 592
Erdős minimum overlap problem: New upper bound, improving on the previous record
Autocorrelation inequalities: Improved best known bounds on several problems in Fourier analysis
Packing problems: New results for polygon packing and variants of the Heilbronn problem

The key methodological insight: for mathematical problems, AlphaEvolve evolves heuristic search algorithms rather than the constructions themselves. Each generation produces a search program that is given a time budget and the best construction found so far. Evolution selects for search heuristics that improve already-good solutions. The final discoveries emerge from a sequence of specialized heuristics — early ones making large gains from scratch, later ones fine-tuning near-optimal configurations.

Ablation results

Ablation experiments show that every component matters:

Ablation	Effect on tensor decomposition
No evolution (just repeated LLM calls)	Severe degradation — evolution is essential
No context in prompt	Significant degradation — the LLM needs problem info
No meta-prompt evolution	Moderate degradation — automated prompt tuning helps
No full-file evolution (single function only)	Moderate degradation — multi-component changes matter
Small LLM only	Significant degradation — frontier models contribute key breakthroughs

The ablation on "no evolution" is the most striking: repeatedly querying the LLM with the same initial program, even with a powerful model, performs far worse than the evolutionary approach. Evolution provides sustained, directed improvement that individual LLM calls cannot.

A collaboration between humans and AI: Many of the mathematical problems were suggested by external mathematicians, including Terence Tao and Javier Gomez Serrano, who also advised on formulation. This highlights the potential for synergistic partnerships: humans provide the right problems and evaluation criteria, AlphaEvolve provides tireless, intelligent search.

What does the "no evolution" ablation demonstrate?

Repeatedly querying an LLM with the same initial program performs far worse than the evolutionary approach — evolution provides sustained, directed improvement that single-shot LLM calls cannot achieve LLMs cannot generate code without evolutionary guidance Evolution is only useful for simple problems

Chapter 9: Connections

AlphaEvolve sits at the intersection of several research threads. Understanding where it fits helps you see both what it builds on and where the field is heading.

FunSearch (2023)

AlphaEvolve's direct predecessor. FunSearch used LLM-guided evolution to discover new mathematical constructions, notably for the cap set problem in extremal combinatorics. AlphaEvolve inherits the core idea — LLM as mutation operator — and removes every limitation: scope (single function → full codebase), language (Python only → any), metrics (single → multiple), models (small → frontier).

AlphaTensor (2022)

DeepMind's earlier system for discovering fast matrix multiplication algorithms using deep reinforcement learning. AlphaTensor was purpose-built for tensor decomposition and achieved strong results, but it was a specialized system. AlphaEvolve is general-purpose — it takes the same problem as one of many applications — and achieves better results on several targets.

Evolutionary programming tradition

Genetic programming (GP) has evolved programs since the 1980s. The key limitation was always the mutation operators: random AST manipulations rarely produce valid, meaningful code. AlphaEvolve solves this by using LLMs as the mutation operator. It is, in a sense, the culmination of the GP vision: evolve programs using operators that understand programs.

Agentic AI systems

AlphaEvolve is part of the broader movement toward agentic AI — systems that autonomously plan, execute, and iterate. Compared to other coding agents (SWE-Bench, Devin, etc.), AlphaEvolve focuses on discovery rather than task completion. It does not follow instructions; it searches for novel solutions in an open-ended space.

Self-improving AI

AlphaEvolve optimized its own LLM's training kernels and the FlashAttention implementation used in inference. These are early, modest examples of AI self-improvement: the feedback loops take months and the gains are incremental. But the paper explicitly envisions this becoming more powerful over time — better AlphaEvolve yields better Gemini, which yields better AlphaEvolve.

Test-time compute scaling

AlphaEvolve can be viewed as a test-time compute system: it takes a base LLM and amplifies its capability by spending more compute at inference time (through many evolutionary iterations). The key insight is that machine feedback (code execution + evaluation) sustains this scaling far beyond what self-consistency or chain-of-thought can achieve.

Limitations

Requires an automated evaluator — cannot handle problems needing physical experiments or human judgment
LLM-generated evaluation (for soft criteria like code simplicity) is not yet a focus
Compute cost is significant: each run uses many LLM queries and evaluation hours
The system finds heuristics and constructions, not proofs — mathematical insight must come from elsewhere

The bigger picture: AlphaEvolve demonstrates that the combination of LLM creativity + automated evaluation + evolutionary discipline can achieve what neither component can alone. The LLM provides the ideas, execution provides the truth, and evolution provides the patience. This is a template for applying AI to any domain where you can write a scoring function.

Read the AlphaEvolve paper | ← Back to Veanors

AlphaEvolve: A Coding Agent for Scientific Discovery

Chapter 0: The Problem

Chapter 1: The Key Insight

Chapter 2: The Pipeline

Step 1: Sample a parent

Step 2: Build the prompt

Step 3: LLM generates a diff

Step 4–6: Apply, evaluate, store

Chapter 3: Evolutionary Components

Population: the program database

Selection: sampling parents

Crossover: combining ideas via LLM

Mutation: LLM-generated code changes

Chapter 4: The Strassen Discovery

Why matrix multiplication matters

How AlphaEvolve found 48

The numbers

Chapter 5: Google Infrastructure

1. Data center scheduling (Borg)

2. Gemini kernel engineering

3. TPU circuit design

4. Compiler-generated code (FlashAttention)

Chapter 6: Evaluation & Verification

The evaluation function

Evaluation cascade

LLM-generated feedback

Formal verification for math

Chapter 7: Program Database

The exploration-exploitation dilemma

MAP-Elites inspiration

Island model

Multiple metrics drive diversity

Chapter 8: Results Across Domains

Mathematics: 50+ open problems

Ablation results

Chapter 9: Connections

FunSearch (2023)

AlphaTensor (2022)

Evolutionary programming tradition

Agentic AI systems

Self-improving AI

Test-time compute scaling

Limitations