How DeepMind used LLMs not to answer questions, but to evolve programs that make genuine mathematical discoveries — including a 20-year breakthrough in extremal combinatorics.
You ask an LLM to solve an open mathematical problem. It writes a confident, fluent answer. The math looks right. The notation is perfect. But when you check the actual result, it's wrong. Not subtly wrong — fundamentally wrong. The LLM hallucinated a proof that proves nothing.
This is not a fluke. LLMs suffer from hallucinations — they generate plausible-sounding but factually incorrect outputs. For creative writing, this is a feature. For mathematics and science, it's fatal. A single wrong step in a proof invalidates everything that follows. A single wrong digit in a construction makes it worthless.
The problem gets worse as questions get harder. For well-known textbook problems, LLMs can often retrieve memorized solutions. But for open problems — questions where humanity doesn't know the answer yet — there's nothing to retrieve. The LLM is forced to reason, and it hallucinates.
You might think: "Just prompt the LLM more carefully." Researchers tried. Chain-of-thought prompting, self-consistency, multi-step verification — these help on known problems. But they all share a fundamental flaw: the LLM is both the generator and the verifier. It's checking its own homework.
What if, instead of trusting the LLM's answer, we could actually run the answer and measure whether it works? What if the LLM's output was not a final answer, but a program that computes an answer — and we could execute that program to check?
Click to see what happens when we ask an LLM for a direct answer versus a program. The program can be verified by running it.
This is the insight that launched FunSearch. Published in Nature in January 2024 by Romera-Paredes et al. at DeepMind, FunSearch doesn't ask an LLM to solve math problems. It asks the LLM to write programs that solve math problems — then runs those programs and scores the results. The LLM provides creativity. A computer provides verification. Together, they made genuine mathematical discoveries.
FunSearch stands for "searching in the function space." That name encodes the entire idea: instead of searching for a mathematical object (a number, a set, a proof), search for a function — a program — that constructs the object. Then run the program and check if the object is any good.
This reframing changes everything. A mathematical answer is opaque — you can't tell if it's correct without understanding the full proof. But a program is executable. You run it, you get an output, and you measure that output against a known scoring function. No trust required. No hallucination risk. The program either produces a valid construction or it doesn't.
FunSearch maintains a database of programs, ranked by how well they score. At each iteration, it samples a few programs from this database, combines them into a prompt, and sends them to a pretrained LLM. The LLM generates a new program (a variation or improvement). This new program is executed in a sandbox, scored, and — if it's correct and scores well — added back to the database. Repeat millions of times.
In traditional evolutionary algorithms, mutations are random: flip a bit, change a number, swap two lines. These random mutations almost never produce meaningful improvements on complex programs. The LLM replaces this with intelligent mutation. It reads existing programs, understands their structure, and proposes new variations that are syntactically valid and semantically meaningful.
Think of the LLM as a junior programmer who can read two working solutions and suggest a creative improvement. Most suggestions won't be better — but some will. And unlike random mutation, the LLM's suggestions at least compile and make structural sense.
Many systems run LLM-generated code. FunSearch is different in three crucial ways:
| Aspect | Typical LLM + Code | FunSearch |
|---|---|---|
| Goal | Solve a known problem | Discover new solutions to open problems |
| Iterations | 1-10 attempts | Millions of attempts |
| Memory | No memory across attempts | Programs database accumulates best solutions |
| Diversity | Single thread | Island-based populations prevent premature convergence |
| Output | A working solution | Programs better than any human-written solution |
Imagine you're breeding roses. If you have one garden and always pick the reddest rose to breed next, you'll quickly converge to one particular shade of red — and you'll never discover that a completely different genetic line could produce a deeper, richer red. You got stuck in a local optimum.
FunSearch faces the same risk. If it keeps all programs in one pool and always samples the best, it will quickly converge to minor variations of the first good solution it finds. To prevent this, FunSearch borrows a technique from evolutionary computation called the island model.
The programs database is split into multiple islands (also called subpopulations). Each island is created and evolved independently. When sampling programs to build a prompt, FunSearch first picks an island, then samples programs from within that island. The new program generated by the LLM is added back to the same island.
Periodically, a reset event occurs. The islands in the bottom half (worst average scores) are wiped clean. Their programs are discarded. Each empty island is then re-initialized by cloning the best individual from one of the surviving islands. Evolution continues from these fresh starting points.
Inside each island, programs are further organized into clusters based on their signature — the tuple of scores they achieve on different test inputs. Two programs with the same signature (same scores on every input) are grouped together. Within a cluster, shorter programs are preferred (they're simpler, more generalizable).
When sampling from an island, FunSearch first picks a cluster (favoring higher-scoring ones), then picks a program from within that cluster (favoring shorter ones). This two-level sampling ensures diversity: instead of always picking the single highest-scoring program, it samples from different behavioral clusters.
Watch 5 islands evolve independently. Each dot is a program; vertical position = score. Periodically, the worst islands reset from the best. Notice how diversity is maintained.
To understand what FunSearch discovered, we need to understand the problem it attacked. The cap set problem is a famous open question in a branch of mathematics called extremal combinatorics — the study of "how large can a structure be while obeying certain constraints?"
Imagine you have cards, each showing a string of digits. Each digit can be 0, 1, or 2. With 3 digits per card, there are 33 = 27 possible cards: 000, 001, 002, 010, ..., 222. Now, three cards form a line (also called an "arithmetic progression" in Z3n) if, in every digit position, the three values are either all the same or all different.
For example, with n=3 digits:
| Card A | Card B | Card C | Position 1 | Position 2 | Position 3 | Line? |
|---|---|---|---|---|---|---|
| 0 1 2 | 1 1 2 | 2 1 2 | 0,1,2 (all diff) | 1,1,1 (all same) | 2,2,2 (all same) | Yes |
| 0 0 0 | 1 1 1 | 2 2 2 | 0,1,2 (all diff) | 0,1,2 (all diff) | 0,1,2 (all diff) | Yes |
| 0 0 1 | 1 0 2 | 2 1 0 | 0,1,2 (all diff) | 0,0,1 (mixed!) | 1,2,0 (all diff) | No |
A cap set is a collection of cards such that no three cards form a line. The cap set problem asks: what is the largest possible cap set in n dimensions?
The cap set problem has been studied for decades. Here's what was known:
| Dimension n | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|
| Best known | 9 | 20 | 45 | 112 | 236 | 496 |
For n=8, the best known cap set had 496 elements. This record had stood for 20 years. The exact maximum for n=8 was unknown — 496 was a lower bound, and the upper bound was much higher.
FunSearch didn't just match the known results — it broke the 20-year record for n=8, finding a cap set of size 512. It also found constructions matching the best known sizes for n=3 through 7.
| Dimension n | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|
| Best known | 9 | 20 | 45 | 112 | 236 | 496 |
| FunSearch | 9 | 20 | 45 | 112 | 236 | 512 |
FunSearch didn't search for cap sets directly. Instead, it evolved a priority function — a program that assigns a score to each potential element, telling a greedy algorithm which elements to add first. The greedy algorithm builds the cap set one element at a time, always adding the highest-priority valid element.
python def solve(n): """Build a cap set of dimension n using priority.""" elements = all_elements(n) # all 3^n vectors scores = [priority(el, n) for el in elements] order = argsort(scores, descending=True) capset = [] for el in elements[order]: if can_be_added(el, capset): # no 3 collinear? capset.append(el) return capset
The function priority(element, n) is the only part FunSearch evolves. Everything else — the greedy loop, the validity check, the element enumeration — is fixed boilerplate provided in the skeleton. FunSearch focuses the LLM's creativity on exactly the part that matters most.
The cap set result proved FunSearch could make mathematical discoveries. But could it solve practical problems too? The team applied FunSearch to online bin packing, a classic problem in combinatorial optimization with real-world applications from logistics to cloud computing.
You have an infinite supply of bins, each with capacity 1.0. Items arrive one at a time, each with a size between 0 and 1. When an item arrives, you must immediately place it in a bin (you can't wait to see future items). Your goal: use as few bins as possible.
This is online bin packing — you make irrevocable decisions as items arrive, without knowing what comes next. It's like packing boxes into a moving truck one-by-one, without seeing the remaining pile. Every logistics company, every cloud scheduler, every warehouse manager faces this problem daily.
Decades of research produced well-known heuristics:
| Heuristic | Strategy | Weakness |
|---|---|---|
| First Fit | Place item in the first bin it fits | Leaves scattered gaps |
| Best Fit | Place item in the bin with least remaining space where it fits | Packs too tightly too early |
| First Fit Decreasing | Sort items large-to-small, then First Fit | Requires seeing all items (offline only) |
For bin packing, FunSearch evolved a heuristic function that scores each bin for each incoming item. Given an item and the array of current bin capacities, the function returns a priority score for each bin. The item goes into the highest-scoring bin.
python def heuristic(item: float, bins: np.ndarray) -> np.ndarray: """Return a score for each bin. Item goes in highest-scoring bin.""" score = 1000 * np.ones(bins.shape) # Penalize bins with large remaining capacity score -= bins - item # Find the best-fit bin index = np.argmin(bins) # Bonus: also check if fit is very tight score[index] -= (bins[index] - item) ** 2 return score
The evolved heuristics consistently outperformed First Fit and Best Fit across multiple benchmark distributions (OR1-OR5, Weibull1-Weibull5). The improvements were robust — not tuned to one particular distribution but generalizing across all of them.
| Method | OR1 | OR2 | OR3 | OR4 | Weibull1 |
|---|---|---|---|---|---|
| First Fit | 6.42% | 6.45% | 5.74% | 5.23% | 4.23% |
| Best Fit | 5.81% | 6.06% | 5.27% | 4.44% | 3.88% |
| FunSearch | 3.00% | 4.99% | 3.47% | 3.68% | 3.05% |
Values show fraction of excess bins used (lower is better). FunSearch results from 100,000 items.
Items arrive one at a time. Click a bin to place the current item. Try to minimize wasted space. The "Auto: Best Fit" button runs the classic Best Fit heuristic for comparison.
The LLM generates millions of programs. Most are garbage — they crash, produce invalid outputs, or score worse than what we already have. The evaluation pipeline is the gatekeeper that decides which programs survive. It's the "natural selection" in FunSearch's evolution.
Every FunSearch problem starts with a specification consisting of two parts:
1. The skeleton — boilerplate code that defines the overall algorithm. This is fixed and never changes. For the cap set problem, the skeleton contains the greedy loop that builds a cap set one element at a time. For bin packing, it contains the loop that assigns items to bins.
2. The evolve target — the one function the LLM must write. This is the creative part. For cap sets, it's the priority(element, n) function. For bin packing, it's the heuristic(item, bins) function.
When the LLM generates a new program, the evaluation pipeline:
Programs that crash, exceed time limits, produce wrong types, or violate constraints are immediately discarded. Only correct, complete programs enter the scoring phase.
A program isn't scored on a single input — it's evaluated on a set of inputs. For cap sets, these are different dimensionalities n. For bin packing, these are different item distributions. The scores across inputs are combined into an overall score (typically summed). This means a program must be generally good, not just good on one case.
The tuple of individual scores forms the program's signature. Two programs with identical signatures behave identically on all test inputs — they're functionally equivalent. Programs are clustered by signature within each island.
FunSearch is implemented as a distributed system with three types of workers that communicate asynchronously:
| Worker | Role | Count |
|---|---|---|
| Programs Database | Stores and serves programs, handles island management | 1 (central) |
| Samplers | Query the database, build prompts, call the LLM | Many (15+ in practice) |
| Evaluators | Receive programs from samplers, execute in sandbox, score | Many (concurrent) |
This architecture is embarrassingly parallel. While one sampler waits for the LLM to generate a program, another sampler is building a prompt, and evaluators are simultaneously scoring previously generated programs. The database is the only shared state. This lets FunSearch evaluate ~106 programs at practical cost — the paper ran experiments for hours, not weeks.
The programs database is FunSearch's long-term memory. While the LLM has no memory between calls (it's a frozen, stateless model), the database accumulates the best programs discovered so far. It's the population in evolutionary terms — and its design is what makes FunSearch work at scale.
The database is organized hierarchically:
When building a prompt, FunSearch needs to sample k programs (typically k=2) from the database. The sampling has two levels of bias:
Level 1 — Cluster selection: Higher-scoring clusters are more likely to be sampled. This ensures we learn from the best. But lower-scoring clusters still have a chance, maintaining diversity.
Level 2 — Program selection: Within a cluster, shorter programs are preferred. Why? Shorter programs are more likely to capture a general principle rather than being overfit to specific inputs. They're also easier for the LLM to read and modify. Think of it as an Occam's razor for code.
When k=2 programs are sampled, they are placed in the prompt as successive "versions" with ascending scores. The prompt looks like:
python def priority_v0(element, n): """Returns priority for adding element to cap set.""" # (code from lower-scoring sampled program) ... def priority_v1(element, n): """Improved version of priority_v0.""" # (code from higher-scoring sampled program) ... def priority_v2(element, n): """Improved version of priority_v1."""
The LLM must complete priority_v2. By showing it two previous versions (bad → good), the prompt implicitly communicates: "here's the trend — continue improving." The LLM sees the trajectory of improvement and extrapolates. This is more effective than showing a single best program, because it gives the LLM a direction of improvement, not just a target.
The prompt is the interface between the programs database and the LLM. It's surprisingly simple — no elaborate chain-of-thought, no system message engineering. Just code.
A FunSearch prompt consists of three parts assembled in order:
1. The skeleton header — imports, docstring, and the fixed boilerplate functions (the "solve" function, evaluation harness, etc.). This gives the LLM context about what problem we're solving.
2. Sampled program versions — k programs from the database, appended as priority_v0, priority_v1, etc. These are sorted worst-to-best by score. Each gets a docstring like "Improved version of priority_v0."
3. The function header to complete — just the def priority_v2(element, n): line and its docstring. The LLM must generate the body.
python """Finds large cap sets.""" import numpy as np import utils_capset def main(n): """Run solve on n-dimensional cap set and evaluate.""" solution = solve(n) return evaluate(solution, n) # ... (skeleton functions: solve, evaluate, etc.) ... def priority_v0(element, n): """Returns the priority of adding element to the cap set.""" score = 0 for d in element: score += d return score def priority_v1(element, n): """Improved version of priority_v0.""" score = n * 0.5 for i, d in enumerate(element): if d == 0: score += 1 elif d == 1: score -= 0.5 return score def priority_v2(element, n): """Improved version of priority_v1."""
The LLM completes the body of priority_v2. It might copy the structure of v1 and tweak coefficients. It might combine ideas from v0 and v1. It might try something completely new. The beauty is that any syntactically valid Python function body is a valid candidate — and we'll score it empirically.
FunSearch uses a high sampling temperature (Tsample is a hyperparameter). Higher temperature means more diverse, more "creative" outputs. Most will be worse, but the occasional breakthrough makes it worthwhile. This is the opposite of typical LLM usage, where low temperature is preferred for accuracy. Here, accuracy is checked by execution — so we want maximum creativity.
Now you understand every component. Let's watch the full FunSearch loop run. In this simulation, you control an evolutionary search that evolves simple mathematical functions. Programs compete to maximize a scoring function. Islands maintain diversity. The LLM (simulated) proposes mutations. The best programs survive.
Watch programs evolve across islands. Each bar is a program; height = score. Colors represent different islands. The LLM generates new candidates (flashing), which are evaluated and placed into islands. Worst islands periodically reset.
Try these experiments:
| Experiment | What to observe |
|---|---|
| Low creativity (0.1-0.3) | Programs barely change. Evolution stalls early. Local optimum. |
| High creativity (1.5-2.0) | Wild mutations. Most are terrible, but occasional breakthroughs push the frontier. |
| 2 islands | Less diversity. Faster initial convergence, but lower final score. |
| 6 islands | More exploration. Slower start, but more likely to find the global optimum. |
| Force Reset | Watch worst islands get wiped and re-seeded. Scores in reset islands jump up. |
Best score per generation. Watch for the characteristic staircase pattern: long plateaus punctuated by sudden jumps when a breakthrough mutation is found.
FunSearch, published in January 2024, proved that LLM-driven program evolution could make genuine scientific discoveries. It was the starting gun. What came next?
DeepMind's successor to FunSearch, AlphaEvolve, extends the core idea in several directions:
| Aspect | FunSearch | AlphaEvolve |
|---|---|---|
| What evolves | A single function | Entire codebases (multi-file) |
| LLM | Codey (PaLM 2) | Gemini (mixture of models) |
| Edit style | Rewrite entire function | Diff-based edits (patches) |
| Evaluator | Simple score function | LLM-assisted meta-evaluation |
| Applications | Cap sets, bin packing | Matrix multiplication, chip design, kernel optimization |
AlphaEvolve discovered faster matrix multiplication algorithms (improving on Strassen's 1969 result), optimized Google's data center scheduling, and found improvements in hardware circuit design. The pattern FunSearch established — LLM creativity + automated verification + evolutionary search — turns out to be remarkably general.
FunSearch has clear requirements — and clear limitations:
| Requirement | Why |
|---|---|
| Efficient evaluator | Must score 106 programs. If evaluation takes hours, FunSearch is infeasible. |
| Rich scoring signal | Binary pass/fail is too sparse. A continuous score lets evolution make incremental progress. |
| Isolated critical function | The LLM must only write one function. Full program synthesis is too hard. |
| Not formal proofs | FunSearch finds constructions, not proofs. It can show "a cap set of size 512 exists" but not "512 is optimal." |
FunSearch connects to several threads in AI research:
Genetic programming — FunSearch is, at its core, a genetic programming system where the LLM replaces random mutation. The island model, tournament selection, and population-based search all come from decades of evolutionary computation research.
Program synthesis — The dream of automatically writing programs goes back to the 1960s. FunSearch shows that LLMs make program synthesis practical for real-world problems by providing a strong prior over code structure.
AI for science — FunSearch belongs to a growing family of systems where AI makes genuine scientific contributions: AlphaFold (protein structure), GNoME (materials discovery), and now FunSearch (mathematics). The common thread is that these systems don't just interpolate training data — they discover structures humans hadn't found.