FunSearch (Applied) — From Absolute Zero to Mastery

Chapter 0: Why Search for Programs?

You have a hard math problem. Really hard — the kind where the best known solutions come from 40 years of human intuition, custom heuristics, and occasionally lucky guesses. You try asking an LLM to solve it. The LLM confidently generates an answer. The answer is wrong.

This is the fundamental limitation of LLMs on hard reasoning tasks: they produce plausible-looking text, but they cannot verify their own output. They hallucinate proofs. They invent numbers. They sound certain about things that are false.

But here is the key insight that makes FunSearch work: while LLMs are bad at producing correct answers, they are remarkably good at producing creative variations of programs. And programs, unlike natural language, can be automatically checked.

The FunSearch insight: Don't ask an LLM for answers. Ask it for programs that produce answers. Then check the answers with an automated evaluator. Keep the good programs, throw away the bad ones, and evolve toward solutions no human has ever found.

LLM Direct vs. LLM + Evaluator

Click "Generate" to see the difference. The red path (direct LLM) hallucinates answers. The green path (LLM + evaluator) only keeps verified solutions.

In December 2023, DeepMind published FunSearch (short for "searching in the Function space"). It used a pre-trained LLM — specifically a version of Codey, their code-generating model — paired with an automated evaluator inside an evolutionary loop. The result? New mathematical discoveries that surpassed decades of human effort.

Not incremental improvements. Not minor tweaks. FunSearch discovered genuinely new constructions for the cap set problem in combinatorics — structures that no mathematician had found before — and produced bin-packing heuristics that beat the state of the art on multiple benchmark distributions.

Why can't we just ask an LLM to solve hard math problems directly?

LLMs are too slow for math LLMs generate plausible-looking but unverified answers that may be wrong Math requires GPUs, not LLMs

Chapter 1: The Core Idea

FunSearch is built on three components that form a closed loop. Understanding this loop is the single most important thing in this lesson. Everything else is detail.

Component 1: The LLM (Idea Generator)

A pre-trained code-generation LLM (Codey, based on PaLM 2) receives a prompt containing the problem specification and a few existing solution programs. It generates new candidate programs — creative variations, recombinations, and novel approaches. The LLM never sees scores or fitness values. It just sees code and produces code.

Component 2: The Evaluator (Truth Filter)

Every program the LLM generates is executed in a sandbox. The evaluator runs the program on test inputs and scores the output. This is the critical piece: the evaluator is deterministic and automated. There is no ambiguity. A program either produces a large cap set or it doesn't. It either packs bins efficiently or it doesn't. No LLM judgment is needed here — just math.

Component 3: The Programs Database (Memory)

A structured database stores the best programs found so far, organized by score into "islands" to maintain diversity. When the LLM needs inspiration, the system samples a few high-scoring programs from the database and includes them in the prompt.

LLM (Codey)

Generates candidate Python functions

↓

Sandbox Execution

Run program, catch errors, measure output

↓

Evaluator

Score: how good is the solution?

↓

Programs Database

Store if score beats existing programs

↻ sample best → prompt LLM

Why this works: The LLM provides creativity (it can produce novel code patterns humans wouldn't think of). The evaluator provides correctness (it filters out the 99% of LLM output that is garbage). The database provides memory (good ideas accumulate over thousands of iterations). No single component could do this alone.

The name "FunSearch" is a play on words: it searches in the space of functions. The LLM doesn't generate solutions directly — it generates programs that generate solutions. This is a crucial distinction. A cap set of size 512 in F₃⁸ is a specific mathematical object. But the program that constructs such a set is a transferable recipe that can be analyzed, modified, and applied at larger scales.

The FunSearch Loop

Watch the evolutionary loop in action. Each dot is a candidate program. Green = accepted (beat threshold). Red = rejected. The yellow line tracks the best score found so far.

What does the LLM generate in FunSearch?

Direct mathematical solutions (numbers, sets) Natural language proofs Programs (Python functions) that produce solutions

Chapter 2: The Cap Set Breakthrough

Let's talk about the problem that made headlines. Imagine a game with cards. Each card has properties that take one of three values (red/green/blue, circle/square/triangle, solid/striped/empty). A "set" is a collection of three cards where, for each property, the three cards are either all the same or all different. The cap set problem asks: what is the largest collection of cards you can choose such that NO three cards form a "set"?

This is exactly the mathematical cap set problem, just dressed up. Formally: in F₃ⁿ (vectors of length n where each entry is 0, 1, or 2), a cap set is a subset where no three elements sum to zero modulo 3. The question is: how large can this set be?

Formal definition: A cap set in F₃ⁿ is a set S of vectors such that for any three distinct vectors a, b, c in S, we have a + b + c ≠ 0 (mod 3). Equivalently, no three points are collinear in the affine geometry over F₃.

Why it matters

The cap set problem connects to deep questions in additive combinatorics, coding theory, and complexity theory. The famous "cap set conjecture" (proved by Croot, Lev, Pach and Ellenberg-Gijswijt in 2016) showed that cap sets in F₃ⁿ can't be too large — they must be smaller than (2.756...)ⁿ. But the conjecture only gives an upper bound. For specific dimensions, the exact answer is unknown for n ≥ 7.

What FunSearch found

The best known cap set constructions for large n had stood for over 40 years, built from a technique called the "Hill cap" construction. FunSearch discovered new constructions that beat these bounds. Specifically:

Dimension n	Previous Best	FunSearch	Improvement
n = 8	496	512	+16 elements
Large n	Hill cap method	New asymptotic construction	Beats 40-year record

But the real breakthrough wasn't just the numbers. It was that the programs FunSearch found contained interpretable structure. Mathematicians could look at the code and extract new insights about why certain constructions work. The programs weren't black boxes — they were compressed mathematical knowledge.

Cap Set Explorer (F₃²)

In F₃², there are 9 points (a 3×3 grid). Click points to add them to your set. A red line appears if three selected points are collinear (form a "set"). Try to select as many points as possible without creating any red lines. The maximum is 4.

Selected: 0 / 9 — Lines: 0

The human analogy: Imagine you're a mathematician who has been trying to build large cap sets by hand for years. FunSearch is like having a tireless assistant who writes construction algorithms, tests millions of them, and hands you only the best ones. The constructions it found used a "symmetry-breaking" trick that human mathematicians had not considered.

What makes a cap set in F₃ⁿ?

A set of vectors where no three sum to zero (mod 3) The largest possible subset of F₃ⁿ A set of vectors that are all linearly independent

Chapter 3: Bin Packing — The Practical Win

Cap sets are beautiful mathematics, but FunSearch also tackled a problem with direct industrial impact: online bin packing. You work at a warehouse. Items arrive one at a time, each with a known size (between 0 and 1). You must place each item into a bin (capacity = 1) immediately — you can't wait to see future items. Your goal: use as few bins as possible.

This is the online bin packing problem, and it shows up everywhere: packing containers on ships, scheduling jobs on servers, allocating memory in operating systems, cutting stock in manufacturing. Any time you must assign incoming items to fixed-capacity resources without knowing the future, you're bin packing.

Classical heuristics

Decades of research have produced well-known heuristics:

Heuristic	Rule	Quality
First Fit	Place item in first bin that has room	Decent baseline
Best Fit	Place item in fullest bin that has room	Better on average
First Fit Decreasing	Sort items largest-first, then First Fit	Excellent (offline only)
Harmonic	Classify items by size into classes, pack by class	Good worst-case bound

But these are general-purpose. If you know the distribution that items are drawn from (say, uniform on [0,1], or a Weibull distribution skewed toward small items), you should be able to do better by exploiting that structure.

What FunSearch found

FunSearch discovered heuristics that outperformed all known algorithms on specific distributions. The key was that FunSearch could tailor its heuristic to the distribution. For uniform [0,1] items, FunSearch found a program that used adaptive bin-class boundaries — it assigned items to categories based on their size, but the category boundaries were not the obvious ones (1/2, 1/3, 1/4...). Instead, they were slightly asymmetric, tuned to the distribution in ways that human designers had not considered.

Verified result: On the OR-Library benchmark distributions, FunSearch heuristics used fewer bins than First Fit Decreasing, Best Fit, and even custom hand-tuned heuristics. The improvements ranged from 0.5% to over 1% — which at warehouse scale means thousands of fewer bins per day.

Online Bin Packing Simulator

Items arrive one at a time. Compare First Fit vs Best Fit vs FunSearch-style (adaptive boundaries). Watch how many bins each uses.

Speed 5

Why can FunSearch beat classical bin packing heuristics?

It uses more memory It discovers heuristics tailored to specific item distributions, unlike general-purpose rules It can look at future items before placing the current one

Chapter 4: The Evaluator — The Unsung Hero

Everyone focuses on the LLM. But the evaluator is what makes FunSearch actually work. Without it, you just have an LLM generating random programs — most of which are wrong, crash, or produce terrible solutions. The evaluator is the selection pressure that turns random variation into directed search.

What makes a good evaluator?

The evaluator must have four properties:

Automated. No human in the loop. The evaluator runs code, measures output, returns a number. Thousands of evaluations per hour.

Fast. Each evaluation should take seconds, not hours. This rules out problems where checking a solution requires expensive computation.

Deterministic. Same program, same score, every time. No stochasticity in evaluation (though the programs themselves may be stochastic — you average over multiple runs).

Discriminative. The score must distinguish good programs from great ones. A binary "correct/incorrect" isn't enough — you need a gradient to climb.

The specification pattern

In FunSearch, the user provides a specification — a Python file that defines:

python
def evaluate(program_output):
    """Score a candidate solution. Higher = better."""
    # For cap sets: return the size of the set
    #   IF no three elements sum to zero mod 3
    #   ELSE return 0 (invalid)
    if is_valid_cap_set(program_output):
        return len(program_output)
    return 0

def priority(item, bins):
    """The function the LLM must evolve."""
    # This is the "skeleton" - LLM fills in the body
    pass

The specification separates what we want (the evaluate function, written by the human) from how to achieve it (the priority function, evolved by the LLM). The human never writes the solution. They write the test.

The specification is the interface between human intent and machine creativity. It's the most important piece a human contributes to FunSearch. A well-designed spec makes the difference between FunSearch finding nothing and FunSearch making discoveries. The spec must encode the problem's constraints precisely enough that bad solutions score zero, while providing a smooth gradient from "mediocre" to "excellent."

Sandboxing

Because the LLM generates arbitrary code, every candidate program runs in a sandbox with strict timeouts (typically 30 seconds) and no network access. Programs that crash, hang, or produce malformed output simply score zero. This makes the system robust to the vast majority of LLM failures — bad programs are silently discarded.

Evaluator Scoring

Each bar represents a candidate program. Height = score. Click "Evaluate Batch" to generate a new batch and see how the evaluator filters them. Most programs fail (score 0) or are mediocre. Only a few are good.

Click to generate a batch

What does the human provide in a FunSearch specification?

The evaluation function (how to score solutions) and the skeleton of the function to evolve A complete solution for the LLM to improve Training data for fine-tuning the LLM

Chapter 5: Evolutionary Strategy

FunSearch doesn't just randomly generate programs and keep the best one. It uses an evolutionary strategy inspired by biological evolution. The key data structure is the programs database, which maintains a population of programs organized into islands.

Islands for diversity

Think of the Galapagos. Darwin's finches evolved different beak shapes on different islands because each island had different selection pressures. FunSearch does the same thing: it maintains multiple islands, each with its own population of programs. Within an island, programs are ranked by score. Periodically, the best program from one island migrates to another, cross-pollinating good ideas.

Why islands? Without them, the population quickly converges to variations of one good solution. With islands, different lineages explore different parts of the solution space. One island might discover a clever recursion trick. Another might find an efficient data structure. When they cross, the offspring might combine both.

Sampling and prompting

When it's time to generate a new candidate, FunSearch:

Step 1

Pick an island (round-robin or weighted)

↓

Step 2

Sample 2-3 programs from the island, biased toward high scores

↓

Step 3

Build a prompt: specification + sampled programs + "write a better version"

↓

Step 4

LLM generates a new program

↓

Step 5

Evaluate → if score is high enough, add to island

The prompt is constructed carefully. It includes the problem specification (so the LLM knows what the function signature should be), followed by 2-3 existing programs at different score levels. The LLM sees programs that are "pretty good" and "very good" and is implicitly challenged to produce something "even better." This is in-context learning as evolutionary recombination.

Temperature matters. FunSearch uses a relatively high temperature (around 1.0) for sampling from the LLM. Lower temperature would produce safe, incremental variations. Higher temperature increases the chance of creative leaps — and creative failures. The evaluator filters out the failures, so the high temperature is a net positive.

Population management

Each island holds at most K programs (typically K = 40). When a new high-scoring program arrives and the island is full, the lowest-scoring program is evicted. Additionally, programs are grouped by score into clusters, and sampling is biased toward higher clusters. This creates soft selection pressure: mediocre programs survive for a while (providing diversity) but are eventually displaced by better ones.

Island Evolution Sim

Watch 3 islands evolve independently. Each column is an island, each dot is a program (height = score). Occasionally a program migrates between islands (dashed line).

Why does FunSearch use multiple "islands" in its programs database?

To run faster on multiple GPUs To process different problem types simultaneously To maintain diversity and prevent premature convergence to one solution style

Chapter 6: Build Your Own FunSearch

You don't need DeepMind's infrastructure to build a FunSearch-style system. The core loop is surprisingly simple. Let's walk through a minimal implementation step by step.

Step 1: Define your specification

Pick a problem where you can automatically score solutions. Here's a toy example: find a Python function that sorts a list using as few comparisons as possible.

python
# specification.py — the human-written part

def evaluate(sort_fn):
    """Score a sorting function by comparison count."""
    test_cases = [
        [3, 1, 4, 1, 5],
        [9, 2, 6, 5, 3, 8, 7],
        list(range(20, 0, -1)),  # worst case for naive sorts
    ]
    total_comparisons = 0
    for case in test_cases:
        result, comps = sort_fn(case[:])  # copy to avoid mutation
        if result != sorted(case):
            return 0  # incorrect = score 0
        total_comparisons += comps
    # Higher score = fewer comparisons (invert)
    return 1000 / (1 + total_comparisons)

# Skeleton the LLM will evolve:
SKELETON = """
def sort_fn(arr):
    comparisons = 0
    # YOUR CODE HERE
    return arr, comparisons
"""

Step 2: The evolutionary loop

python
import random
from openai import OpenAI  # or any LLM API

client = OpenAI()

class ProgramsDB:
    def __init__(self, n_islands=3, capacity=20):
        self.islands = [[] for _ in range(n_islands)]
        self.capacity = capacity

    def add(self, island_idx, program, score):
        island = self.islands[island_idx]
        island.append((score, program))
        island.sort(reverse=True)
        if len(island) > self.capacity:
            island.pop()  # evict worst

    def sample(self, island_idx, k=2):
        island = self.islands[island_idx]
        if len(island) < k:
            return [p for _, p in island]
        # Bias toward high-scoring programs
        weights = [1.0 / (i + 1) for i in range(len(island))]
        chosen = random.choices(island, weights=weights, k=k)
        return [p for _, p in chosen]

Step 3: Prompt construction and main loop

python
def build_prompt(spec, examples):
    prompt = f"Problem specification:\n{spec}\n\n"
    for i, ex in enumerate(examples):
        prompt += f"Example solution {i+1}:\n{ex}\n\n"
    prompt += "Write a BETTER version. Return only the function."
    return prompt

def run_funsearch(spec, skeleton, evaluate, n_iters=500):
    db = ProgramsDB(n_islands=3)
    # Seed each island with the skeleton
    for i in range(3):
        db.add(i, skeleton, evaluate(skeleton))

    best_score = 0
    for it in range(n_iters):
        island_idx = it % 3  # round-robin
        examples = db.sample(island_idx, k=2)
        prompt = build_prompt(spec, examples)

        # Call LLM with high temperature
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=1.0,
        )
        candidate = response.choices[0].message.content

        # Evaluate in sandbox (try/except as minimal sandbox)
        try:
            score = evaluate(candidate)
        except:
            score = 0

        if score > 0:
            db.add(island_idx, candidate, score)
        if score > best_score:
            best_score = score
            print(f"Iter {it}: new best = {score:.3f}")

    return db

That's ~60 lines for the entire core loop. The real engineering challenges are in the evaluator design (making it fast and discriminative), the sandboxing (Docker containers, gVisor, or Firecracker), and scaling to thousands of parallel LLM calls. But the algorithm itself is straightforward.

Practical tips for your own FunSearch

Decision	Recommendation
LLM choice	Any code-generation model works. GPT-4, Claude, Codestral. Bigger models give more creative programs but cost more per call.
Temperature	Start at 1.0. Lower if too many failures, raise if solutions converge too fast.
Island count	3-10 islands. More islands = more diversity = slower convergence. 3 is a good default.
Iterations	FunSearch ran millions of evaluations. For simpler problems, 500-5000 is enough.
Sandbox	At minimum, use subprocess with timeout. For production, use Docker or Firecracker.

In a FunSearch-style system, what does the human write?

The LLM prompts for each iteration The initial solution to improve The evaluator function and the skeleton of the function to evolve

Chapter 7: Showcase — Evolve a Heuristic

Now you get to be FunSearch. This simulation lets you run an evolutionary loop on a simplified heuristic optimization problem: finding a scoring function for online bin packing. Watch programs evolve, islands compete, and scores climb over generations.

FunSearch Simulator

Each row is an island. Dots represent programs (position = score). The yellow diamond is the global best. Adjust mutation rate and island count, then watch evolution in action.

Mutation Rate 0.3

Islands 3

Migration Rate 0.08

Generation: 0 — Global best: 0.000

Experiment: Try setting mutation rate very low (0.05). Watch how all islands converge to the same score — they get stuck. Now crank it to 0.8. Scores jump wildly but don't improve reliably. The sweet spot is 0.2-0.4, where you get both exploration and steady improvement. This is exactly the temperature dilemma in FunSearch.

Try cranking up islands to 6 and migration rate to 0.3. Notice how ideas spread quickly between islands — too quickly. The population homogenizes. Now try migration at 0.02. Islands develop unique "personalities" (solution strategies) but miss out on cross-pollination. The original FunSearch used infrequent migration to balance these forces.

Chapter 8: Lessons Learned

FunSearch's success teaches us several deep lessons about how to use LLMs for discovery. These lessons apply far beyond cap sets and bin packing.

Lesson 1: Verify, don't trust

The single most important design decision in FunSearch is that nothing the LLM produces is trusted. Every program is verified. This is the opposite of how most people use LLMs (generate text, read it, maybe check it manually). FunSearch treats the LLM as a generator of hypotheses, never as a source of truth.

The verification principle: LLMs are useful in proportion to how cheaply you can verify their output. Problems where verification is hard (open-ended writing, ethical reasoning) are bad candidates for FunSearch. Problems where verification is cheap (math, code, optimization) are perfect.

Lesson 2: Programs beat solutions

FunSearch doesn't evolve mathematical objects directly. It evolves programs that construct mathematical objects. Why? Because programs are compositional and transferable. A cap set of size 512 is just a list of numbers. But the program that constructs it captures the method — the insight, the trick, the structure. That program can be applied at any dimension, analyzed for patterns, and combined with other programs.

Lesson 3: Diversity is essential

Without the island model, FunSearch converges prematurely. The best-performing setup used 10-15 islands with infrequent migration. This mirrors evolutionary biology: geographic isolation drives speciation, and hybrid vigor comes from crossing distinct lineages.

Lesson 4: The evaluator is the bottleneck

The LLM can generate thousands of candidate programs per hour. But every one needs to be evaluated. If evaluation takes 30 seconds, you can evaluate ~120 programs per hour per core. DeepMind ran evaluations in parallel on hundreds of cores. For your own FunSearch, the speed of your evaluator determines everything.

When to use (and not use) FunSearch

Good Candidates	Bad Candidates
Combinatorial optimization (packing, scheduling, routing)	Tasks with no clear scoring metric
Algorithm design (sorting, hashing, compression)	Tasks requiring subjective judgment
Mathematical constructions (cap sets, Ramsey numbers)	Tasks where evaluation is as hard as the problem
Game playing heuristics (evaluation functions)	Tasks requiring real-world physical testing
ML hyperparameter search (fitness = validation loss)	Open-ended creative writing

Which of these problems is the WORST fit for a FunSearch-style approach?

Finding efficient sorting networks Designing bin packing heuristics for a known distribution Writing compelling marketing copy for a product

Chapter 9: Beyond FunSearch

FunSearch represents a new paradigm: LLMs as hypothesis generators inside formal verification loops. This idea is spreading rapidly.

What came next

System	Approach	Domain
AlphaCode / AlphaCode 2	LLM + evaluator for competitive programming	Code contests
AlphaProof	LLM + formal proof verifier for IMO problems	Mathematics
AlphaGeometry	Neuro-symbolic system for geometry	Olympiad geometry
EvoPrompting	Evolutionary search over prompts (not programs)	Any LLM task
OpenELM	Open-source evolutionary LLM search	General optimization

The common thread: separate generation from verification. Let the LLM be creative, let the evaluator be strict, and let evolution do the rest.

Open questions

Can we search over proofs, not just programs? FunSearch finds constructions but doesn't prove optimality. Combining FunSearch-style search with formal proof systems (Lean, Coq, Isabelle) could yield not just new math but verified new math.

How far can we push the evaluator? The biggest limitation is that the evaluator must be automated and fast. Can we use approximate evaluators (learned reward models) for problems where exact evaluation is expensive? This bridges FunSearch toward RLHF-style methods.

What about multi-objective optimization? FunSearch optimizes a single score. Real-world problems often have trade-offs (speed vs memory, accuracy vs latency). Multi-objective FunSearch would maintain a Pareto front instead of a single best score.

The big picture: FunSearch shows that LLMs are not just text generators — they are components in larger search systems. The future of AI-assisted discovery is not "ask a model and trust the answer." It's "ask a model, verify the answer, evolve, repeat." The model provides the spark; the system provides the rigor.

Connections

If you found this lesson interesting, explore these related topics:

Reward Alignment — RLHF uses a similar loop: generate → evaluate → update. FunSearch is RLHF for programs, not chat. Read the lesson →

Transformers — The LLM at the heart of FunSearch is a Transformer model. Understanding attention and generation helps you tune the search. Read the lesson →

What is the central paradigm shift that FunSearch represents?

Using bigger LLMs for harder problems Using LLMs as hypothesis generators inside automated verification loops, not as direct answer machines Replacing mathematicians with AI

FunSearchApplied

Chapter 0: Why Search for Programs?

Chapter 1: The Core Idea

Component 1: The LLM (Idea Generator)

Component 2: The Evaluator (Truth Filter)

Component 3: The Programs Database (Memory)

Chapter 2: The Cap Set Breakthrough

Why it matters

What FunSearch found

Chapter 3: Bin Packing — The Practical Win

Classical heuristics

What FunSearch found

Chapter 4: The Evaluator — The Unsung Hero

What makes a good evaluator?

The specification pattern

Sandboxing

Chapter 5: Evolutionary Strategy

Islands for diversity

Sampling and prompting

Population management

Chapter 6: Build Your Own FunSearch

Step 1: Define your specification

Step 2: The evolutionary loop

Step 3: Prompt construction and main loop

Practical tips for your own FunSearch

Chapter 7: Showcase — Evolve a Heuristic

Chapter 8: Lessons Learned

Lesson 1: Verify, don't trust

Lesson 2: Programs beat solutions

Lesson 3: Diversity is essential

Lesson 4: The evaluator is the bottleneck

When to use (and not use) FunSearch

Chapter 9: Beyond FunSearch

What came next

Open questions

Connections

FunSearch
Applied