How DeepMind used LLMs not to answer questions, but to discover new mathematics — by searching the space of programs, not the space of text.
You have a hard math problem. Really hard — the kind where the best known solutions come from 40 years of human intuition, custom heuristics, and occasionally lucky guesses. You try asking an LLM to solve it. The LLM confidently generates an answer. The answer is wrong.
This is the fundamental limitation of LLMs on hard reasoning tasks: they produce plausible-looking text, but they cannot verify their own output. They hallucinate proofs. They invent numbers. They sound certain about things that are false.
But here is the key insight that makes FunSearch work: while LLMs are bad at producing correct answers, they are remarkably good at producing creative variations of programs. And programs, unlike natural language, can be automatically checked.
Click "Generate" to see the difference. The red path (direct LLM) hallucinates answers. The green path (LLM + evaluator) only keeps verified solutions.
In December 2023, DeepMind published FunSearch (short for "searching in the Function space"). It used a pre-trained LLM — specifically a version of Codey, their code-generating model — paired with an automated evaluator inside an evolutionary loop. The result? New mathematical discoveries that surpassed decades of human effort.
Not incremental improvements. Not minor tweaks. FunSearch discovered genuinely new constructions for the cap set problem in combinatorics — structures that no mathematician had found before — and produced bin-packing heuristics that beat the state of the art on multiple benchmark distributions.
FunSearch is built on three components that form a closed loop. Understanding this loop is the single most important thing in this lesson. Everything else is detail.
A pre-trained code-generation LLM (Codey, based on PaLM 2) receives a prompt containing the problem specification and a few existing solution programs. It generates new candidate programs — creative variations, recombinations, and novel approaches. The LLM never sees scores or fitness values. It just sees code and produces code.
Every program the LLM generates is executed in a sandbox. The evaluator runs the program on test inputs and scores the output. This is the critical piece: the evaluator is deterministic and automated. There is no ambiguity. A program either produces a large cap set or it doesn't. It either packs bins efficiently or it doesn't. No LLM judgment is needed here — just math.
A structured database stores the best programs found so far, organized by score into "islands" to maintain diversity. When the LLM needs inspiration, the system samples a few high-scoring programs from the database and includes them in the prompt.
The name "FunSearch" is a play on words: it searches in the space of functions. The LLM doesn't generate solutions directly — it generates programs that generate solutions. This is a crucial distinction. A cap set of size 512 in F38 is a specific mathematical object. But the program that constructs such a set is a transferable recipe that can be analyzed, modified, and applied at larger scales.
Watch the evolutionary loop in action. Each dot is a candidate program. Green = accepted (beat threshold). Red = rejected. The yellow line tracks the best score found so far.
Let's talk about the problem that made headlines. Imagine a game with cards. Each card has properties that take one of three values (red/green/blue, circle/square/triangle, solid/striped/empty). A "set" is a collection of three cards where, for each property, the three cards are either all the same or all different. The cap set problem asks: what is the largest collection of cards you can choose such that NO three cards form a "set"?
This is exactly the mathematical cap set problem, just dressed up. Formally: in F3n (vectors of length n where each entry is 0, 1, or 2), a cap set is a subset where no three elements sum to zero modulo 3. The question is: how large can this set be?
The cap set problem connects to deep questions in additive combinatorics, coding theory, and complexity theory. The famous "cap set conjecture" (proved by Croot, Lev, Pach and Ellenberg-Gijswijt in 2016) showed that cap sets in F3n can't be too large — they must be smaller than (2.756...)n. But the conjecture only gives an upper bound. For specific dimensions, the exact answer is unknown for n ≥ 7.
The best known cap set constructions for large n had stood for over 40 years, built from a technique called the "Hill cap" construction. FunSearch discovered new constructions that beat these bounds. Specifically:
| Dimension n | Previous Best | FunSearch | Improvement |
|---|---|---|---|
| n = 8 | 496 | 512 | +16 elements |
| Large n | Hill cap method | New asymptotic construction | Beats 40-year record |
But the real breakthrough wasn't just the numbers. It was that the programs FunSearch found contained interpretable structure. Mathematicians could look at the code and extract new insights about why certain constructions work. The programs weren't black boxes — they were compressed mathematical knowledge.
In F32, there are 9 points (a 3×3 grid). Click points to add them to your set. A red line appears if three selected points are collinear (form a "set"). Try to select as many points as possible without creating any red lines. The maximum is 4.
Cap sets are beautiful mathematics, but FunSearch also tackled a problem with direct industrial impact: online bin packing. You work at a warehouse. Items arrive one at a time, each with a known size (between 0 and 1). You must place each item into a bin (capacity = 1) immediately — you can't wait to see future items. Your goal: use as few bins as possible.
This is the online bin packing problem, and it shows up everywhere: packing containers on ships, scheduling jobs on servers, allocating memory in operating systems, cutting stock in manufacturing. Any time you must assign incoming items to fixed-capacity resources without knowing the future, you're bin packing.
Decades of research have produced well-known heuristics:
| Heuristic | Rule | Quality |
|---|---|---|
| First Fit | Place item in first bin that has room | Decent baseline |
| Best Fit | Place item in fullest bin that has room | Better on average |
| First Fit Decreasing | Sort items largest-first, then First Fit | Excellent (offline only) |
| Harmonic | Classify items by size into classes, pack by class | Good worst-case bound |
But these are general-purpose. If you know the distribution that items are drawn from (say, uniform on [0,1], or a Weibull distribution skewed toward small items), you should be able to do better by exploiting that structure.
FunSearch discovered heuristics that outperformed all known algorithms on specific distributions. The key was that FunSearch could tailor its heuristic to the distribution. For uniform [0,1] items, FunSearch found a program that used adaptive bin-class boundaries — it assigned items to categories based on their size, but the category boundaries were not the obvious ones (1/2, 1/3, 1/4...). Instead, they were slightly asymmetric, tuned to the distribution in ways that human designers had not considered.
Items arrive one at a time. Compare First Fit vs Best Fit vs FunSearch-style (adaptive boundaries). Watch how many bins each uses.
Everyone focuses on the LLM. But the evaluator is what makes FunSearch actually work. Without it, you just have an LLM generating random programs — most of which are wrong, crash, or produce terrible solutions. The evaluator is the selection pressure that turns random variation into directed search.
The evaluator must have four properties:
In FunSearch, the user provides a specification — a Python file that defines:
python def evaluate(program_output): """Score a candidate solution. Higher = better.""" # For cap sets: return the size of the set # IF no three elements sum to zero mod 3 # ELSE return 0 (invalid) if is_valid_cap_set(program_output): return len(program_output) return 0 def priority(item, bins): """The function the LLM must evolve.""" # This is the "skeleton" - LLM fills in the body pass
The specification separates what we want (the evaluate function, written by the human) from how to achieve it (the priority function, evolved by the LLM). The human never writes the solution. They write the test.
Because the LLM generates arbitrary code, every candidate program runs in a sandbox with strict timeouts (typically 30 seconds) and no network access. Programs that crash, hang, or produce malformed output simply score zero. This makes the system robust to the vast majority of LLM failures — bad programs are silently discarded.
Each bar represents a candidate program. Height = score. Click "Evaluate Batch" to generate a new batch and see how the evaluator filters them. Most programs fail (score 0) or are mediocre. Only a few are good.
FunSearch doesn't just randomly generate programs and keep the best one. It uses an evolutionary strategy inspired by biological evolution. The key data structure is the programs database, which maintains a population of programs organized into islands.
Think of the Galapagos. Darwin's finches evolved different beak shapes on different islands because each island had different selection pressures. FunSearch does the same thing: it maintains multiple islands, each with its own population of programs. Within an island, programs are ranked by score. Periodically, the best program from one island migrates to another, cross-pollinating good ideas.
Why islands? Without them, the population quickly converges to variations of one good solution. With islands, different lineages explore different parts of the solution space. One island might discover a clever recursion trick. Another might find an efficient data structure. When they cross, the offspring might combine both.
When it's time to generate a new candidate, FunSearch:
The prompt is constructed carefully. It includes the problem specification (so the LLM knows what the function signature should be), followed by 2-3 existing programs at different score levels. The LLM sees programs that are "pretty good" and "very good" and is implicitly challenged to produce something "even better." This is in-context learning as evolutionary recombination.
Each island holds at most K programs (typically K = 40). When a new high-scoring program arrives and the island is full, the lowest-scoring program is evicted. Additionally, programs are grouped by score into clusters, and sampling is biased toward higher clusters. This creates soft selection pressure: mediocre programs survive for a while (providing diversity) but are eventually displaced by better ones.
Watch 3 islands evolve independently. Each column is an island, each dot is a program (height = score). Occasionally a program migrates between islands (dashed line).
You don't need DeepMind's infrastructure to build a FunSearch-style system. The core loop is surprisingly simple. Let's walk through a minimal implementation step by step.
Pick a problem where you can automatically score solutions. Here's a toy example: find a Python function that sorts a list using as few comparisons as possible.
python # specification.py — the human-written part def evaluate(sort_fn): """Score a sorting function by comparison count.""" test_cases = [ [3, 1, 4, 1, 5], [9, 2, 6, 5, 3, 8, 7], list(range(20, 0, -1)), # worst case for naive sorts ] total_comparisons = 0 for case in test_cases: result, comps = sort_fn(case[:]) # copy to avoid mutation if result != sorted(case): return 0 # incorrect = score 0 total_comparisons += comps # Higher score = fewer comparisons (invert) return 1000 / (1 + total_comparisons) # Skeleton the LLM will evolve: SKELETON = """ def sort_fn(arr): comparisons = 0 # YOUR CODE HERE return arr, comparisons """
python import random from openai import OpenAI # or any LLM API client = OpenAI() class ProgramsDB: def __init__(self, n_islands=3, capacity=20): self.islands = [[] for _ in range(n_islands)] self.capacity = capacity def add(self, island_idx, program, score): island = self.islands[island_idx] island.append((score, program)) island.sort(reverse=True) if len(island) > self.capacity: island.pop() # evict worst def sample(self, island_idx, k=2): island = self.islands[island_idx] if len(island) < k: return [p for _, p in island] # Bias toward high-scoring programs weights = [1.0 / (i + 1) for i in range(len(island))] chosen = random.choices(island, weights=weights, k=k) return [p for _, p in chosen]
python def build_prompt(spec, examples): prompt = f"Problem specification:\n{spec}\n\n" for i, ex in enumerate(examples): prompt += f"Example solution {i+1}:\n{ex}\n\n" prompt += "Write a BETTER version. Return only the function." return prompt def run_funsearch(spec, skeleton, evaluate, n_iters=500): db = ProgramsDB(n_islands=3) # Seed each island with the skeleton for i in range(3): db.add(i, skeleton, evaluate(skeleton)) best_score = 0 for it in range(n_iters): island_idx = it % 3 # round-robin examples = db.sample(island_idx, k=2) prompt = build_prompt(spec, examples) # Call LLM with high temperature response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}], temperature=1.0, ) candidate = response.choices[0].message.content # Evaluate in sandbox (try/except as minimal sandbox) try: score = evaluate(candidate) except: score = 0 if score > 0: db.add(island_idx, candidate, score) if score > best_score: best_score = score print(f"Iter {it}: new best = {score:.3f}") return db
| Decision | Recommendation |
|---|---|
| LLM choice | Any code-generation model works. GPT-4, Claude, Codestral. Bigger models give more creative programs but cost more per call. |
| Temperature | Start at 1.0. Lower if too many failures, raise if solutions converge too fast. |
| Island count | 3-10 islands. More islands = more diversity = slower convergence. 3 is a good default. |
| Iterations | FunSearch ran millions of evaluations. For simpler problems, 500-5000 is enough. |
| Sandbox | At minimum, use subprocess with timeout. For production, use Docker or Firecracker. |
Now you get to be FunSearch. This simulation lets you run an evolutionary loop on a simplified heuristic optimization problem: finding a scoring function for online bin packing. Watch programs evolve, islands compete, and scores climb over generations.
Each row is an island. Dots represent programs (position = score). The yellow diamond is the global best. Adjust mutation rate and island count, then watch evolution in action.
Try cranking up islands to 6 and migration rate to 0.3. Notice how ideas spread quickly between islands — too quickly. The population homogenizes. Now try migration at 0.02. Islands develop unique "personalities" (solution strategies) but miss out on cross-pollination. The original FunSearch used infrequent migration to balance these forces.
FunSearch's success teaches us several deep lessons about how to use LLMs for discovery. These lessons apply far beyond cap sets and bin packing.
The single most important design decision in FunSearch is that nothing the LLM produces is trusted. Every program is verified. This is the opposite of how most people use LLMs (generate text, read it, maybe check it manually). FunSearch treats the LLM as a generator of hypotheses, never as a source of truth.
FunSearch doesn't evolve mathematical objects directly. It evolves programs that construct mathematical objects. Why? Because programs are compositional and transferable. A cap set of size 512 is just a list of numbers. But the program that constructs it captures the method — the insight, the trick, the structure. That program can be applied at any dimension, analyzed for patterns, and combined with other programs.
Without the island model, FunSearch converges prematurely. The best-performing setup used 10-15 islands with infrequent migration. This mirrors evolutionary biology: geographic isolation drives speciation, and hybrid vigor comes from crossing distinct lineages.
The LLM can generate thousands of candidate programs per hour. But every one needs to be evaluated. If evaluation takes 30 seconds, you can evaluate ~120 programs per hour per core. DeepMind ran evaluations in parallel on hundreds of cores. For your own FunSearch, the speed of your evaluator determines everything.
| Good Candidates | Bad Candidates |
|---|---|
| Combinatorial optimization (packing, scheduling, routing) | Tasks with no clear scoring metric |
| Algorithm design (sorting, hashing, compression) | Tasks requiring subjective judgment |
| Mathematical constructions (cap sets, Ramsey numbers) | Tasks where evaluation is as hard as the problem |
| Game playing heuristics (evaluation functions) | Tasks requiring real-world physical testing |
| ML hyperparameter search (fitness = validation loss) | Open-ended creative writing |
FunSearch represents a new paradigm: LLMs as hypothesis generators inside formal verification loops. This idea is spreading rapidly.
| System | Approach | Domain |
|---|---|---|
| AlphaCode / AlphaCode 2 | LLM + evaluator for competitive programming | Code contests |
| AlphaProof | LLM + formal proof verifier for IMO problems | Mathematics |
| AlphaGeometry | Neuro-symbolic system for geometry | Olympiad geometry |
| EvoPrompting | Evolutionary search over prompts (not programs) | Any LLM task |
| OpenELM | Open-source evolutionary LLM search | General optimization |
The common thread: separate generation from verification. Let the LLM be creative, let the evaluator be strict, and let evolution do the rest.
Can we search over proofs, not just programs? FunSearch finds constructions but doesn't prove optimality. Combining FunSearch-style search with formal proof systems (Lean, Coq, Isabelle) could yield not just new math but verified new math.
How far can we push the evaluator? The biggest limitation is that the evaluator must be automated and fast. Can we use approximate evaluators (learned reward models) for problems where exact evaluation is expensive? This bridges FunSearch toward RLHF-style methods.
What about multi-objective optimization? FunSearch optimizes a single score. Real-world problems often have trade-offs (speed vs memory, accuracy vs latency). Multi-objective FunSearch would maintain a Pareto front instead of a single best score.
If you found this lesson interesting, explore these related topics: