FunSearch — From Absolute Zero to Mastery

Chapter 0: The Problem with LLMs Doing Math

You ask an LLM to solve an open mathematical problem. It writes a confident, fluent answer. The math looks right. The notation is perfect. But when you check the actual result, it's wrong. Not subtly wrong — fundamentally wrong. The LLM hallucinated a proof that proves nothing.

This is not a fluke. LLMs suffer from hallucinations — they generate plausible-sounding but factually incorrect outputs. For creative writing, this is a feature. For mathematics and science, it's fatal. A single wrong step in a proof invalidates everything that follows. A single wrong digit in a construction makes it worthless.

The problem gets worse as questions get harder. For well-known textbook problems, LLMs can often retrieve memorized solutions. But for open problems — questions where humanity doesn't know the answer yet — there's nothing to retrieve. The LLM is forced to reason, and it hallucinates.

The core tension: LLMs are remarkably creative at generating code and mathematical expressions. But they have no reliable way to verify whether their output is correct. Creativity without verification is just confident guessing.

Why Not Just Ask for Better Answers?

You might think: "Just prompt the LLM more carefully." Researchers tried. Chain-of-thought prompting, self-consistency, multi-step verification — these help on known problems. But they all share a fundamental flaw: the LLM is both the generator and the verifier. It's checking its own homework.

What if, instead of trusting the LLM's answer, we could actually run the answer and measure whether it works? What if the LLM's output was not a final answer, but a program that computes an answer — and we could execute that program to check?

LLM Outputs: Answers vs Programs

Click to see what happens when we ask an LLM for a direct answer versus a program. The program can be verified by running it.

This is the insight that launched FunSearch. Published in Nature in January 2024 by Romera-Paredes et al. at DeepMind, FunSearch doesn't ask an LLM to solve math problems. It asks the LLM to write programs that solve math problems — then runs those programs and scores the results. The LLM provides creativity. A computer provides verification. Together, they made genuine mathematical discoveries.

Why can't we simply trust an LLM's direct answer to an open mathematical problem?

LLMs are too slow at computation LLMs hallucinate — they generate plausible but incorrect outputs, and can't reliably self-verify LLMs don't know any mathematics

Chapter 1: The FunSearch Insight — Evolve Programs, Not Outputs

FunSearch stands for "searching in the function space." That name encodes the entire idea: instead of searching for a mathematical object (a number, a set, a proof), search for a function — a program — that constructs the object. Then run the program and check if the object is any good.

This reframing changes everything. A mathematical answer is opaque — you can't tell if it's correct without understanding the full proof. But a program is executable. You run it, you get an output, and you measure that output against a known scoring function. No trust required. No hallucination risk. The program either produces a valid construction or it doesn't.

Key insight: Programs are interpretable and verifiable in ways that raw outputs are not. A program that generates a large cap set can be inspected, understood, and — critically — scored by running it on actual inputs. This separates the creative process (LLM) from the verification process (execution).

The Architecture in One Paragraph

FunSearch maintains a database of programs, ranked by how well they score. At each iteration, it samples a few programs from this database, combines them into a prompt, and sends them to a pretrained LLM. The LLM generates a new program (a variation or improvement). This new program is executed in a sandbox, scored, and — if it's correct and scores well — added back to the database. Repeat millions of times.

Programs Database

Stores programs ranked by score

↓ sample k best programs

Build Prompt

Skeleton + sampled programs as examples

↓ send to LLM

Pretrained LLM

Generates a new program (creative mutation)

↓ new program

Evaluate

Run program in sandbox, compute score

↓ if correct, add to database

Programs Database

Updated with new program & score

↻ repeat ~10⁶ times

LLM as Mutation Operator

In traditional evolutionary algorithms, mutations are random: flip a bit, change a number, swap two lines. These random mutations almost never produce meaningful improvements on complex programs. The LLM replaces this with intelligent mutation. It reads existing programs, understands their structure, and proposes new variations that are syntactically valid and semantically meaningful.

Think of the LLM as a junior programmer who can read two working solutions and suggest a creative improvement. Most suggestions won't be better — but some will. And unlike random mutation, the LLM's suggestions at least compile and make structural sense.

Why not fine-tune the LLM? FunSearch uses a pretrained, frozen LLM (Codey, based on PaLM 2). No fine-tuning, no gradient updates. The LLM's weights never change. All the "learning" happens in the programs database — which programs survive and get sampled. The evolutionary pressure is external to the model.

What Makes This Different from "LLM + Code Execution"?

Many systems run LLM-generated code. FunSearch is different in three crucial ways:

Aspect	Typical LLM + Code	FunSearch
Goal	Solve a known problem	Discover new solutions to open problems
Iterations	1-10 attempts	Millions of attempts
Memory	No memory across attempts	Programs database accumulates best solutions
Diversity	Single thread	Island-based populations prevent premature convergence
Output	A working solution	Programs better than any human-written solution

What does FunSearch evolve?

The weights of the LLM Programs stored in a database, scored by execution Mathematical proofs in natural language

Chapter 2: Island-Based Evolution

Imagine you're breeding roses. If you have one garden and always pick the reddest rose to breed next, you'll quickly converge to one particular shade of red — and you'll never discover that a completely different genetic line could produce a deeper, richer red. You got stuck in a local optimum.

FunSearch faces the same risk. If it keeps all programs in one pool and always samples the best, it will quickly converge to minor variations of the first good solution it finds. To prevent this, FunSearch borrows a technique from evolutionary computation called the island model.

The island model: Instead of one big population, maintain multiple separate populations (islands) that evolve independently. Each island explores a different region of program space. Periodically, the worst-performing islands are wiped and re-seeded from the best islands. This balances exploration (islands diverge) with exploitation (good genes spread).

How Islands Work in FunSearch

The programs database is split into multiple islands (also called subpopulations). Each island is created and evolved independently. When sampling programs to build a prompt, FunSearch first picks an island, then samples programs from within that island. The new program generated by the LLM is added back to the same island.

Periodically, a reset event occurs. The islands in the bottom half (worst average scores) are wiped clean. Their programs are discarded. Each empty island is then re-initialized by cloning the best individual from one of the surviving islands. Evolution continues from these fresh starting points.

Initialize

Create N islands, each with a trivial seed program

↓

Evolve Separately

Each island runs independent LLM-evolution cycles

↓ after K iterations

Reset Worst

Bottom 50% of islands are wiped

↓

Re-seed

Clone best programs from surviving islands into empty ones

↻ repeat until termination

Within Each Island: Clusters

Inside each island, programs are further organized into clusters based on their signature — the tuple of scores they achieve on different test inputs. Two programs with the same signature (same scores on every input) are grouped together. Within a cluster, shorter programs are preferred (they're simpler, more generalizable).

When sampling from an island, FunSearch first picks a cluster (favoring higher-scoring ones), then picks a program from within that cluster (favoring shorter ones). This two-level sampling ensures diversity: instead of always picking the single highest-scoring program, it samples from different behavioral clusters.

Island Model: Populations Diverge & Reset

Watch 5 islands evolve independently. Each dot is a program; vertical position = score. Periodically, the worst islands reset from the best. Notice how diversity is maintained.

Gen 0

Why not just one big population? Experiments in the paper showed that removing islands caused FunSearch to converge prematurely and produce worse results. The island model is not optional decoration — it's a critical component. With islands, different populations can explore fundamentally different algorithmic strategies in parallel.

What happens to the worst-performing islands during a reset?

They receive random new programs They are wiped and re-seeded with clones of the best programs from surviving islands Their programs are merged with the best island

Chapter 3: The Cap Set Problem

To understand what FunSearch discovered, we need to understand the problem it attacked. The cap set problem is a famous open question in a branch of mathematics called extremal combinatorics — the study of "how large can a structure be while obeying certain constraints?"

Building Intuition: The Card Game

Imagine you have cards, each showing a string of digits. Each digit can be 0, 1, or 2. With 3 digits per card, there are 3³ = 27 possible cards: 000, 001, 002, 010, ..., 222. Now, three cards form a line (also called an "arithmetic progression" in Z₃ⁿ) if, in every digit position, the three values are either all the same or all different.

For example, with n=3 digits:

Card A	Card B	Card C	Position 1	Position 2	Position 3	Line?
0 1 2	1 1 2	2 1 2	0,1,2 (all diff)	1,1,1 (all same)	2,2,2 (all same)	Yes
0 0 0	1 1 1	2 2 2	0,1,2 (all diff)	0,1,2 (all diff)	0,1,2 (all diff)	Yes
0 0 1	1 0 2	2 1 0	0,1,2 (all diff)	0,0,1 (mixed!)	1,2,0 (all diff)	No

A cap set is a collection of cards such that no three cards form a line. The cap set problem asks: what is the largest possible cap set in n dimensions?

Why "cap set"? In geometry, a "cap" is a set of points with no three collinear. In the space Z₃ⁿ (vectors of length n over {0,1,2}), "collinear" means "forming an arithmetic progression mod 3." The cap set problem has connections to the Roth sunflower problem, the Erdős-Ginzburg-Ziv theorem, and the polynomial method in combinatorics.

The State of the Art Before FunSearch

The cap set problem has been studied for decades. Here's what was known:

Dimension n	3	4	5	6	7	8
Best known	9	20	45	112	236	496

For n=8, the best known cap set had 496 elements. This record had stood for 20 years. The exact maximum for n=8 was unknown — 496 was a lower bound, and the upper bound was much higher.

What FunSearch Discovered

FunSearch didn't just match the known results — it broke the 20-year record for n=8, finding a cap set of size 512. It also found constructions matching the best known sizes for n=3 through 7.

Dimension n	3	4	5	6	7	8
Best known	9	20	45	112	236	496
FunSearch	9	20	45	112	236	512

This is a genuine scientific discovery. The cap set of size 512 in dimension 8 was previously unknown to mathematics. FunSearch didn't look it up — it evolved a program that constructs it. When mathematicians inspected the program, they found it used a novel symmetry structure they hadn't considered. This is not retrieval. This is creation.

How FunSearch Approached It

FunSearch didn't search for cap sets directly. Instead, it evolved a priority function — a program that assigns a score to each potential element, telling a greedy algorithm which elements to add first. The greedy algorithm builds the cap set one element at a time, always adding the highest-priority valid element.

python
def solve(n):
    """Build a cap set of dimension n using priority."""
    elements = all_elements(n)  # all 3^n vectors
    scores = [priority(el, n) for el in elements]
    order = argsort(scores, descending=True)
    capset = []
    for el in elements[order]:
        if can_be_added(el, capset):  # no 3 collinear?
            capset.append(el)
    return capset

The function priority(element, n) is the only part FunSearch evolves. Everything else — the greedy loop, the validity check, the element enumeration — is fixed boilerplate provided in the skeleton. FunSearch focuses the LLM's creativity on exactly the part that matters most.

What did FunSearch actually evolve to attack the cap set problem?

The cap set itself as a list of vectors A mathematical proof of the optimal size A priority function that tells a greedy algorithm which elements to add first

Chapter 4: Bin Packing — From Theory to Practice

The cap set result proved FunSearch could make mathematical discoveries. But could it solve practical problems too? The team applied FunSearch to online bin packing, a classic problem in combinatorial optimization with real-world applications from logistics to cloud computing.

The Bin Packing Problem

You have an infinite supply of bins, each with capacity 1.0. Items arrive one at a time, each with a size between 0 and 1. When an item arrives, you must immediately place it in a bin (you can't wait to see future items). Your goal: use as few bins as possible.

This is online bin packing — you make irrevocable decisions as items arrive, without knowing what comes next. It's like packing boxes into a moving truck one-by-one, without seeing the remaining pile. Every logistics company, every cloud scheduler, every warehouse manager faces this problem daily.

Why is this hard? If you could see all items upfront (offline), you could plan the perfect packing. But online, you're flying blind. Place a 0.3 item in a bin now, and a 0.7 item might arrive next — a perfect pair you missed. The gap between offline optimal and the best online strategy is provably nonzero.

Classic Heuristics

Decades of research produced well-known heuristics:

Heuristic	Strategy	Weakness
First Fit	Place item in the first bin it fits	Leaves scattered gaps
Best Fit	Place item in the bin with least remaining space where it fits	Packs too tightly too early
First Fit Decreasing	Sort items large-to-small, then First Fit	Requires seeing all items (offline only)

What FunSearch Evolved

For bin packing, FunSearch evolved a heuristic function that scores each bin for each incoming item. Given an item and the array of current bin capacities, the function returns a priority score for each bin. The item goes into the highest-scoring bin.

python
def heuristic(item: float, bins: np.ndarray) -> np.ndarray:
    """Return a score for each bin. Item goes in highest-scoring bin."""
    score = 1000 * np.ones(bins.shape)
    # Penalize bins with large remaining capacity
    score -= bins - item
    # Find the best-fit bin
    index = np.argmin(bins)
    # Bonus: also check if fit is very tight
    score[index] -= (bins[index] - item) ** 2
    return score

The evolved heuristics consistently outperformed First Fit and Best Fit across multiple benchmark distributions (OR1-OR5, Weibull1-Weibull5). The improvements were robust — not tuned to one particular distribution but generalizing across all of them.

A surprising discovery: The evolved heuristics often preferred symmetric solutions — requiring less information to specify and being more parsimonious than human-designed heuristics. They also tended to avoid filling bins too tightly, instead packing the item only if the fit is very tight, leaving flexible gaps for future items. This "just tight enough" strategy was not something human designers typically used.

Results

Method	OR1	OR2	OR3	OR4	Weibull1
First Fit	6.42%	6.45%	5.74%	5.23%	4.23%
Best Fit	5.81%	6.06%	5.27%	4.44%	3.88%
FunSearch	3.00%	4.99%	3.47%	3.68%	3.05%

Values show fraction of excess bins used (lower is better). FunSearch results from 100,000 items.

Online Bin Packing: Try It Yourself

Items arrive one at a time. Click a bin to place the current item. Try to minimize wasted space. The "Auto: Best Fit" button runs the classic Best Fit heuristic for comparison.

Click a bin to place item

In online bin packing, what constraint makes the problem fundamentally harder than offline packing?

You must place each item immediately as it arrives, without seeing future items The bins have variable capacities You can only use a limited number of bins

Chapter 5: The Evaluation Pipeline

The LLM generates millions of programs. Most are garbage — they crash, produce invalid outputs, or score worse than what we already have. The evaluation pipeline is the gatekeeper that decides which programs survive. It's the "natural selection" in FunSearch's evolution.

The Specification: Skeleton + Evolve Target

Every FunSearch problem starts with a specification consisting of two parts:

1. The skeleton — boilerplate code that defines the overall algorithm. This is fixed and never changes. For the cap set problem, the skeleton contains the greedy loop that builds a cap set one element at a time. For bin packing, it contains the loop that assigns items to bins.

2. The evolve target — the one function the LLM must write. This is the creative part. For cap sets, it's the priority(element, n) function. For bin packing, it's the heuristic(item, bins) function.

Why this split? A fixed skeleton constrains the search space enormously. The LLM doesn't need to reinvent the wheel — it doesn't need to figure out how to enumerate elements, check validity, or loop through bins. It only needs to write the critical decision function. This is like giving a chess engine the rules of chess and asking it to learn strategy — not asking it to discover chess from scratch.

Execution and Scoring

When the LLM generates a new program, the evaluation pipeline:

1. Parse

Check if the code is syntactically valid Python

↓

2. Sandbox

Run in isolated container with time/memory limits

↓

3. Execute

Call the solve function with test inputs

↓

4. Validate

Check outputs: correct type? Within bounds? No errors?

↓

5. Score

Compute score across multiple test inputs, aggregate

Programs that crash, exceed time limits, produce wrong types, or violate constraints are immediately discarded. Only correct, complete programs enter the scoring phase.

Scoring Across Multiple Inputs

A program isn't scored on a single input — it's evaluated on a set of inputs. For cap sets, these are different dimensionalities n. For bin packing, these are different item distributions. The scores across inputs are combined into an overall score (typically summed). This means a program must be generally good, not just good on one case.

The tuple of individual scores forms the program's signature. Two programs with identical signatures behave identically on all test inputs — they're functionally equivalent. Programs are clustered by signature within each island.

The Distributed System

FunSearch is implemented as a distributed system with three types of workers that communicate asynchronously:

Worker	Role	Count
Programs Database	Stores and serves programs, handles island management	1 (central)
Samplers	Query the database, build prompts, call the LLM	Many (15+ in practice)
Evaluators	Receive programs from samplers, execute in sandbox, score	Many (concurrent)

This architecture is embarrassingly parallel. While one sampler waits for the LLM to generate a program, another sampler is building a prompt, and evaluators are simultaneously scoring previously generated programs. The database is the only shared state. This lets FunSearch evaluate ~10⁶ programs at practical cost — the paper ran experiments for hours, not weeks.

Scale of evaluation: FunSearch evaluates on the order of 10⁶ total samples (programs). The paper uses a fast inference LLM (Codey) to keep the generation cost manageable, and runs samplers and evaluators concurrently. The key tradeoff: faster inference (lower quality per sample) beats slower inference (higher quality per sample) because the evaluation pipeline filters everything anyway.

What is the "evolve target" in a FunSearch specification?

The single function the LLM must write — the critical decision function, while everything else is fixed boilerplate The entire program including all helper functions The test cases used for evaluation

Chapter 6: The Programs Database

The programs database is FunSearch's long-term memory. While the LLM has no memory between calls (it's a frozen, stateless model), the database accumulates the best programs discovered so far. It's the population in evolutionary terms — and its design is what makes FunSearch work at scale.

Storage Structure

The database is organized hierarchically:

Database

Contains multiple islands

↓

Island

Contains multiple clusters

↓

Cluster

Programs with the same score signature

↓

Program

Source code + score + length

Sampling: How Programs Are Selected

When building a prompt, FunSearch needs to sample k programs (typically k=2) from the database. The sampling has two levels of bias:

Level 1 — Cluster selection: Higher-scoring clusters are more likely to be sampled. This ensures we learn from the best. But lower-scoring clusters still have a chance, maintaining diversity.

Level 2 — Program selection: Within a cluster, shorter programs are preferred. Why? Shorter programs are more likely to capture a general principle rather than being overfit to specific inputs. They're also easier for the LLM to read and modify. Think of it as an Occam's razor for code.

Why favor short programs? A 200-line function that scores well might be full of hard-coded special cases. A 20-line function that scores equally well has discovered a pattern. When the LLM reads the shorter program, it's more likely to understand the underlying idea and extend it creatively. Length is a proxy for generality.

Version History in Prompts

When k=2 programs are sampled, they are placed in the prompt as successive "versions" with ascending scores. The prompt looks like:

python
def priority_v0(element, n):
    """Returns priority for adding element to cap set."""
    # (code from lower-scoring sampled program)
    ...

def priority_v1(element, n):
    """Improved version of priority_v0."""
    # (code from higher-scoring sampled program)
    ...

def priority_v2(element, n):
    """Improved version of priority_v1."""

The LLM must complete priority_v2. By showing it two previous versions (bad → good), the prompt implicitly communicates: "here's the trend — continue improving." The LLM sees the trajectory of improvement and extrapolates. This is more effective than showing a single best program, because it gives the LLM a direction of improvement, not just a target.

Why k=2? The paper found that k=2 (two example programs per prompt) works best. k=1 gives the LLM no sense of direction. k=3+ makes the prompt too long, reducing the LLM's ability to synthesize the examples. Two examples hit the sweet spot: enough context to show a trend, short enough to leave room for creativity.

Within a cluster, which programs are preferred when sampling for a prompt?

The longest programs, because they contain more logic The shortest programs, because brevity correlates with generality Programs are selected uniformly at random

Chapter 7: The Prompt — Talking to the LLM

The prompt is the interface between the programs database and the LLM. It's surprisingly simple — no elaborate chain-of-thought, no system message engineering. Just code.

Prompt Structure

A FunSearch prompt consists of three parts assembled in order:

1. The skeleton header — imports, docstring, and the fixed boilerplate functions (the "solve" function, evaluation harness, etc.). This gives the LLM context about what problem we're solving.

2. Sampled program versions — k programs from the database, appended as priority_v0, priority_v1, etc. These are sorted worst-to-best by score. Each gets a docstring like "Improved version of priority_v0."

3. The function header to complete — just the def priority_v2(element, n): line and its docstring. The LLM must generate the body.

The entire prompt is valid Python. There's no natural language instruction like "Please write a better function." The prompt is just a Python file with a missing function body. The LLM's code completion training kicks in naturally — it sees a sequence of increasingly better functions and continues the pattern. This leverages the LLM's strongest capability (code completion) rather than asking it to follow complex instructions.

What the LLM Sees

python
"""Finds large cap sets."""
import numpy as np
import utils_capset

def main(n):
    """Run solve on n-dimensional cap set and evaluate."""
    solution = solve(n)
    return evaluate(solution, n)

# ... (skeleton functions: solve, evaluate, etc.) ...

def priority_v0(element, n):
    """Returns the priority of adding element to the cap set."""
    score = 0
    for d in element:
        score += d
    return score

def priority_v1(element, n):
    """Improved version of priority_v0."""
    score = n * 0.5
    for i, d in enumerate(element):
        if d == 0:
            score += 1
        elif d == 1:
            score -= 0.5
    return score

def priority_v2(element, n):
    """Improved version of priority_v1."""

The LLM completes the body of priority_v2. It might copy the structure of v1 and tweak coefficients. It might combine ideas from v0 and v1. It might try something completely new. The beauty is that any syntactically valid Python function body is a valid candidate — and we'll score it empirically.

Temperature and Sampling

FunSearch uses a high sampling temperature (T_sample is a hyperparameter). Higher temperature means more diverse, more "creative" outputs. Most will be worse, but the occasional breakthrough makes it worthwhile. This is the opposite of typical LLM usage, where low temperature is preferred for accuracy. Here, accuracy is checked by execution — so we want maximum creativity.

The LLM is disposable. FunSearch doesn't need a state-of-the-art LLM. It needs a fast one. The paper used Codey (PaLM 2-based), chosen for inference speed, not benchmark performance. The quality of any individual generation matters less than the quantity — because the evaluation pipeline filters everything. A fast, mediocre LLM that generates 10x more candidates outperforms a slow, brilliant one.

Why does FunSearch use high sampling temperature when generating programs?

To reduce the cost of LLM inference To maximize diversity of generated programs, since correctness is verified by execution, not the LLM To match the training temperature of the model

Chapter 8: Showcase — Watch Evolution Happen

Now you understand every component. Let's watch the full FunSearch loop run. In this simulation, you control an evolutionary search that evolves simple mathematical functions. Programs compete to maximize a scoring function. Islands maintain diversity. The LLM (simulated) proposes mutations. The best programs survive.

FunSearch Evolutionary Loop

Watch programs evolve across islands. Each bar is a program; height = score. Colors represent different islands. The LLM generates new candidates (flashing), which are evaluated and placed into islands. Worst islands periodically reset.

Mutation creativity 0.8

Islands 4

Generation: 0 Best: 0.00 Resets: 0

Try these experiments:

Experiment	What to observe
Low creativity (0.1-0.3)	Programs barely change. Evolution stalls early. Local optimum.
High creativity (1.5-2.0)	Wild mutations. Most are terrible, but occasional breakthroughs push the frontier.
2 islands	Less diversity. Faster initial convergence, but lower final score.
6 islands	More exploration. Slower start, but more likely to find the global optimum.
Force Reset	Watch worst islands get wiped and re-seeded. Scores in reset islands jump up.

What you're seeing: This is the same loop FunSearch runs millions of times. In the real system, the "mutation" is an LLM call, evaluation runs real code in a sandbox, and the search runs for hours or days. The dynamics — exploration vs exploitation, island diversity, periodic resets — are faithful to the paper.

Score Over Time

Best score per generation. Watch for the characteristic staircase pattern: long plateaus punctuated by sudden jumps when a breakthrough mutation is found.

Chapter 9: Beyond FunSearch — AlphaEvolve and the Future

FunSearch, published in January 2024, proved that LLM-driven program evolution could make genuine scientific discoveries. It was the starting gun. What came next?

AlphaEvolve (2025)

DeepMind's successor to FunSearch, AlphaEvolve, extends the core idea in several directions:

Aspect	FunSearch	AlphaEvolve
What evolves	A single function	Entire codebases (multi-file)
LLM	Codey (PaLM 2)	Gemini (mixture of models)
Edit style	Rewrite entire function	Diff-based edits (patches)
Evaluator	Simple score function	LLM-assisted meta-evaluation
Applications	Cap sets, bin packing	Matrix multiplication, chip design, kernel optimization

AlphaEvolve discovered faster matrix multiplication algorithms (improving on Strassen's 1969 result), optimized Google's data center scheduling, and found improvements in hardware circuit design. The pattern FunSearch established — LLM creativity + automated verification + evolutionary search — turns out to be remarkably general.

The Broader Pattern

FunSearch's lasting contribution is not any specific mathematical result. It's the paradigm: use LLMs not as answer machines but as creative search operators in a loop with automated verification. This works whenever you can (1) express solutions as programs, (2) automatically score those programs, and (3) iterate at scale. The LLM doesn't need to be right — it needs to be creative. The evaluator ensures correctness.

What FunSearch Needs to Work

FunSearch has clear requirements — and clear limitations:

Requirement	Why
Efficient evaluator	Must score 10⁶ programs. If evaluation takes hours, FunSearch is infeasible.
Rich scoring signal	Binary pass/fail is too sparse. A continuous score lets evolution make incremental progress.
Isolated critical function	The LLM must only write one function. Full program synthesis is too hard.
Not formal proofs	FunSearch finds constructions, not proofs. It can show "a cap set of size 512 exists" but not "512 is optimal."

Connections

FunSearch connects to several threads in AI research:

Genetic programming — FunSearch is, at its core, a genetic programming system where the LLM replaces random mutation. The island model, tournament selection, and population-based search all come from decades of evolutionary computation research.

Program synthesis — The dream of automatically writing programs goes back to the 1960s. FunSearch shows that LLMs make program synthesis practical for real-world problems by providing a strong prior over code structure.

AI for science — FunSearch belongs to a growing family of systems where AI makes genuine scientific contributions: AlphaFold (protein structure), GNoME (materials discovery), and now FunSearch (mathematics). The common thread is that these systems don't just interpolate training data — they discover structures humans hadn't found.

The Feynman test: "What I cannot create, I do not understand." FunSearch flips this: by creating programs that solve open problems, it demonstrates a form of understanding — not in the LLM itself, but in the system of LLM + evolution + evaluation. No single component understands the mathematics. But the system, collectively, discovers new mathematics. Whether that counts as "understanding" is a question for philosophers. The cap set of size 512 doesn't care.

What is FunSearch's lasting paradigm contribution?

That LLMs can directly solve any mathematical problem Using LLMs as creative search operators in a loop with automated verification — creativity from the LLM, correctness from execution That evolutionary algorithms are obsolete

Understand FunSearch