LLM-Guided Program Search

FunSearch
Applied

How DeepMind used LLMs not to answer questions, but to discover new mathematics — by searching the space of programs, not the space of text.

Prerequisites: Basic Python + What an LLM is. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Search for Programs?

You have a hard math problem. Really hard — the kind where the best known solutions come from 40 years of human intuition, custom heuristics, and occasionally lucky guesses. You try asking an LLM to solve it. The LLM confidently generates an answer. The answer is wrong.

This is the fundamental limitation of LLMs on hard reasoning tasks: they produce plausible-looking text, but they cannot verify their own output. They hallucinate proofs. They invent numbers. They sound certain about things that are false.

But here is the key insight that makes FunSearch work: while LLMs are bad at producing correct answers, they are remarkably good at producing creative variations of programs. And programs, unlike natural language, can be automatically checked.

The FunSearch insight: Don't ask an LLM for answers. Ask it for programs that produce answers. Then check the answers with an automated evaluator. Keep the good programs, throw away the bad ones, and evolve toward solutions no human has ever found.
LLM Direct vs. LLM + Evaluator

Click "Generate" to see the difference. The red path (direct LLM) hallucinates answers. The green path (LLM + evaluator) only keeps verified solutions.

In December 2023, DeepMind published FunSearch (short for "searching in the Function space"). It used a pre-trained LLM — specifically a version of Codey, their code-generating model — paired with an automated evaluator inside an evolutionary loop. The result? New mathematical discoveries that surpassed decades of human effort.

Not incremental improvements. Not minor tweaks. FunSearch discovered genuinely new constructions for the cap set problem in combinatorics — structures that no mathematician had found before — and produced bin-packing heuristics that beat the state of the art on multiple benchmark distributions.

Why can't we just ask an LLM to solve hard math problems directly?

Chapter 1: The Core Idea

FunSearch is built on three components that form a closed loop. Understanding this loop is the single most important thing in this lesson. Everything else is detail.

Component 1: The LLM (Idea Generator)

A pre-trained code-generation LLM (Codey, based on PaLM 2) receives a prompt containing the problem specification and a few existing solution programs. It generates new candidate programs — creative variations, recombinations, and novel approaches. The LLM never sees scores or fitness values. It just sees code and produces code.

Component 2: The Evaluator (Truth Filter)

Every program the LLM generates is executed in a sandbox. The evaluator runs the program on test inputs and scores the output. This is the critical piece: the evaluator is deterministic and automated. There is no ambiguity. A program either produces a large cap set or it doesn't. It either packs bins efficiently or it doesn't. No LLM judgment is needed here — just math.

Component 3: The Programs Database (Memory)

A structured database stores the best programs found so far, organized by score into "islands" to maintain diversity. When the LLM needs inspiration, the system samples a few high-scoring programs from the database and includes them in the prompt.

LLM (Codey)
Generates candidate Python functions
Sandbox Execution
Run program, catch errors, measure output
Evaluator
Score: how good is the solution?
Programs Database
Store if score beats existing programs
↻ sample best → prompt LLM
Why this works: The LLM provides creativity (it can produce novel code patterns humans wouldn't think of). The evaluator provides correctness (it filters out the 99% of LLM output that is garbage). The database provides memory (good ideas accumulate over thousands of iterations). No single component could do this alone.

The name "FunSearch" is a play on words: it searches in the space of functions. The LLM doesn't generate solutions directly — it generates programs that generate solutions. This is a crucial distinction. A cap set of size 512 in F38 is a specific mathematical object. But the program that constructs such a set is a transferable recipe that can be analyzed, modified, and applied at larger scales.

The FunSearch Loop

Watch the evolutionary loop in action. Each dot is a candidate program. Green = accepted (beat threshold). Red = rejected. The yellow line tracks the best score found so far.

What does the LLM generate in FunSearch?

Chapter 2: The Cap Set Breakthrough

Let's talk about the problem that made headlines. Imagine a game with cards. Each card has properties that take one of three values (red/green/blue, circle/square/triangle, solid/striped/empty). A "set" is a collection of three cards where, for each property, the three cards are either all the same or all different. The cap set problem asks: what is the largest collection of cards you can choose such that NO three cards form a "set"?

This is exactly the mathematical cap set problem, just dressed up. Formally: in F3n (vectors of length n where each entry is 0, 1, or 2), a cap set is a subset where no three elements sum to zero modulo 3. The question is: how large can this set be?

Formal definition: A cap set in F3n is a set S of vectors such that for any three distinct vectors a, b, c in S, we have a + b + c ≠ 0 (mod 3). Equivalently, no three points are collinear in the affine geometry over F3.

Why it matters

The cap set problem connects to deep questions in additive combinatorics, coding theory, and complexity theory. The famous "cap set conjecture" (proved by Croot, Lev, Pach and Ellenberg-Gijswijt in 2016) showed that cap sets in F3n can't be too large — they must be smaller than (2.756...)n. But the conjecture only gives an upper bound. For specific dimensions, the exact answer is unknown for n ≥ 7.

What FunSearch found

The best known cap set constructions for large n had stood for over 40 years, built from a technique called the "Hill cap" construction. FunSearch discovered new constructions that beat these bounds. Specifically:

Dimension nPrevious BestFunSearchImprovement
n = 8496512+16 elements
Large nHill cap methodNew asymptotic constructionBeats 40-year record

But the real breakthrough wasn't just the numbers. It was that the programs FunSearch found contained interpretable structure. Mathematicians could look at the code and extract new insights about why certain constructions work. The programs weren't black boxes — they were compressed mathematical knowledge.

Cap Set Explorer (F₃²)

In F32, there are 9 points (a 3×3 grid). Click points to add them to your set. A red line appears if three selected points are collinear (form a "set"). Try to select as many points as possible without creating any red lines. The maximum is 4.

Selected: 0 / 9 — Lines: 0
The human analogy: Imagine you're a mathematician who has been trying to build large cap sets by hand for years. FunSearch is like having a tireless assistant who writes construction algorithms, tests millions of them, and hands you only the best ones. The constructions it found used a "symmetry-breaking" trick that human mathematicians had not considered.
What makes a cap set in F3n?

Chapter 3: Bin Packing — The Practical Win

Cap sets are beautiful mathematics, but FunSearch also tackled a problem with direct industrial impact: online bin packing. You work at a warehouse. Items arrive one at a time, each with a known size (between 0 and 1). You must place each item into a bin (capacity = 1) immediately — you can't wait to see future items. Your goal: use as few bins as possible.

This is the online bin packing problem, and it shows up everywhere: packing containers on ships, scheduling jobs on servers, allocating memory in operating systems, cutting stock in manufacturing. Any time you must assign incoming items to fixed-capacity resources without knowing the future, you're bin packing.

Classical heuristics

Decades of research have produced well-known heuristics:

HeuristicRuleQuality
First FitPlace item in first bin that has roomDecent baseline
Best FitPlace item in fullest bin that has roomBetter on average
First Fit DecreasingSort items largest-first, then First FitExcellent (offline only)
HarmonicClassify items by size into classes, pack by classGood worst-case bound

But these are general-purpose. If you know the distribution that items are drawn from (say, uniform on [0,1], or a Weibull distribution skewed toward small items), you should be able to do better by exploiting that structure.

What FunSearch found

FunSearch discovered heuristics that outperformed all known algorithms on specific distributions. The key was that FunSearch could tailor its heuristic to the distribution. For uniform [0,1] items, FunSearch found a program that used adaptive bin-class boundaries — it assigned items to categories based on their size, but the category boundaries were not the obvious ones (1/2, 1/3, 1/4...). Instead, they were slightly asymmetric, tuned to the distribution in ways that human designers had not considered.

Verified result: On the OR-Library benchmark distributions, FunSearch heuristics used fewer bins than First Fit Decreasing, Best Fit, and even custom hand-tuned heuristics. The improvements ranged from 0.5% to over 1% — which at warehouse scale means thousands of fewer bins per day.
Online Bin Packing Simulator

Items arrive one at a time. Compare First Fit vs Best Fit vs FunSearch-style (adaptive boundaries). Watch how many bins each uses.

Speed 5
Why can FunSearch beat classical bin packing heuristics?

Chapter 4: The Evaluator — The Unsung Hero

Everyone focuses on the LLM. But the evaluator is what makes FunSearch actually work. Without it, you just have an LLM generating random programs — most of which are wrong, crash, or produce terrible solutions. The evaluator is the selection pressure that turns random variation into directed search.

What makes a good evaluator?

The evaluator must have four properties:

Automated. No human in the loop. The evaluator runs code, measures output, returns a number. Thousands of evaluations per hour.
Fast. Each evaluation should take seconds, not hours. This rules out problems where checking a solution requires expensive computation.
Deterministic. Same program, same score, every time. No stochasticity in evaluation (though the programs themselves may be stochastic — you average over multiple runs).
Discriminative. The score must distinguish good programs from great ones. A binary "correct/incorrect" isn't enough — you need a gradient to climb.

The specification pattern

In FunSearch, the user provides a specification — a Python file that defines:

python
def evaluate(program_output):
    """Score a candidate solution. Higher = better."""
    # For cap sets: return the size of the set
    #   IF no three elements sum to zero mod 3
    #   ELSE return 0 (invalid)
    if is_valid_cap_set(program_output):
        return len(program_output)
    return 0

def priority(item, bins):
    """The function the LLM must evolve."""
    # This is the "skeleton" - LLM fills in the body
    pass

The specification separates what we want (the evaluate function, written by the human) from how to achieve it (the priority function, evolved by the LLM). The human never writes the solution. They write the test.

The specification is the interface between human intent and machine creativity. It's the most important piece a human contributes to FunSearch. A well-designed spec makes the difference between FunSearch finding nothing and FunSearch making discoveries. The spec must encode the problem's constraints precisely enough that bad solutions score zero, while providing a smooth gradient from "mediocre" to "excellent."

Sandboxing

Because the LLM generates arbitrary code, every candidate program runs in a sandbox with strict timeouts (typically 30 seconds) and no network access. Programs that crash, hang, or produce malformed output simply score zero. This makes the system robust to the vast majority of LLM failures — bad programs are silently discarded.

Evaluator Scoring

Each bar represents a candidate program. Height = score. Click "Evaluate Batch" to generate a new batch and see how the evaluator filters them. Most programs fail (score 0) or are mediocre. Only a few are good.

Click to generate a batch
What does the human provide in a FunSearch specification?

Chapter 5: Evolutionary Strategy

FunSearch doesn't just randomly generate programs and keep the best one. It uses an evolutionary strategy inspired by biological evolution. The key data structure is the programs database, which maintains a population of programs organized into islands.

Islands for diversity

Think of the Galapagos. Darwin's finches evolved different beak shapes on different islands because each island had different selection pressures. FunSearch does the same thing: it maintains multiple islands, each with its own population of programs. Within an island, programs are ranked by score. Periodically, the best program from one island migrates to another, cross-pollinating good ideas.

Why islands? Without them, the population quickly converges to variations of one good solution. With islands, different lineages explore different parts of the solution space. One island might discover a clever recursion trick. Another might find an efficient data structure. When they cross, the offspring might combine both.

Sampling and prompting

When it's time to generate a new candidate, FunSearch:

Step 1
Pick an island (round-robin or weighted)
Step 2
Sample 2-3 programs from the island, biased toward high scores
Step 3
Build a prompt: specification + sampled programs + "write a better version"
Step 4
LLM generates a new program
Step 5
Evaluate → if score is high enough, add to island

The prompt is constructed carefully. It includes the problem specification (so the LLM knows what the function signature should be), followed by 2-3 existing programs at different score levels. The LLM sees programs that are "pretty good" and "very good" and is implicitly challenged to produce something "even better." This is in-context learning as evolutionary recombination.

Temperature matters. FunSearch uses a relatively high temperature (around 1.0) for sampling from the LLM. Lower temperature would produce safe, incremental variations. Higher temperature increases the chance of creative leaps — and creative failures. The evaluator filters out the failures, so the high temperature is a net positive.

Population management

Each island holds at most K programs (typically K = 40). When a new high-scoring program arrives and the island is full, the lowest-scoring program is evicted. Additionally, programs are grouped by score into clusters, and sampling is biased toward higher clusters. This creates soft selection pressure: mediocre programs survive for a while (providing diversity) but are eventually displaced by better ones.

Island Evolution Sim

Watch 3 islands evolve independently. Each column is an island, each dot is a program (height = score). Occasionally a program migrates between islands (dashed line).

Why does FunSearch use multiple "islands" in its programs database?

Chapter 6: Build Your Own FunSearch

You don't need DeepMind's infrastructure to build a FunSearch-style system. The core loop is surprisingly simple. Let's walk through a minimal implementation step by step.

Step 1: Define your specification

Pick a problem where you can automatically score solutions. Here's a toy example: find a Python function that sorts a list using as few comparisons as possible.

python
# specification.py — the human-written part

def evaluate(sort_fn):
    """Score a sorting function by comparison count."""
    test_cases = [
        [3, 1, 4, 1, 5],
        [9, 2, 6, 5, 3, 8, 7],
        list(range(20, 0, -1)),  # worst case for naive sorts
    ]
    total_comparisons = 0
    for case in test_cases:
        result, comps = sort_fn(case[:])  # copy to avoid mutation
        if result != sorted(case):
            return 0  # incorrect = score 0
        total_comparisons += comps
    # Higher score = fewer comparisons (invert)
    return 1000 / (1 + total_comparisons)

# Skeleton the LLM will evolve:
SKELETON = """
def sort_fn(arr):
    comparisons = 0
    # YOUR CODE HERE
    return arr, comparisons
"""

Step 2: The evolutionary loop

python
import random
from openai import OpenAI  # or any LLM API

client = OpenAI()

class ProgramsDB:
    def __init__(self, n_islands=3, capacity=20):
        self.islands = [[] for _ in range(n_islands)]
        self.capacity = capacity

    def add(self, island_idx, program, score):
        island = self.islands[island_idx]
        island.append((score, program))
        island.sort(reverse=True)
        if len(island) > self.capacity:
            island.pop()  # evict worst

    def sample(self, island_idx, k=2):
        island = self.islands[island_idx]
        if len(island) < k:
            return [p for _, p in island]
        # Bias toward high-scoring programs
        weights = [1.0 / (i + 1) for i in range(len(island))]
        chosen = random.choices(island, weights=weights, k=k)
        return [p for _, p in chosen]

Step 3: Prompt construction and main loop

python
def build_prompt(spec, examples):
    prompt = f"Problem specification:\n{spec}\n\n"
    for i, ex in enumerate(examples):
        prompt += f"Example solution {i+1}:\n{ex}\n\n"
    prompt += "Write a BETTER version. Return only the function."
    return prompt

def run_funsearch(spec, skeleton, evaluate, n_iters=500):
    db = ProgramsDB(n_islands=3)
    # Seed each island with the skeleton
    for i in range(3):
        db.add(i, skeleton, evaluate(skeleton))

    best_score = 0
    for it in range(n_iters):
        island_idx = it % 3  # round-robin
        examples = db.sample(island_idx, k=2)
        prompt = build_prompt(spec, examples)

        # Call LLM with high temperature
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=1.0,
        )
        candidate = response.choices[0].message.content

        # Evaluate in sandbox (try/except as minimal sandbox)
        try:
            score = evaluate(candidate)
        except:
            score = 0

        if score > 0:
            db.add(island_idx, candidate, score)
        if score > best_score:
            best_score = score
            print(f"Iter {it}: new best = {score:.3f}")

    return db
That's ~60 lines for the entire core loop. The real engineering challenges are in the evaluator design (making it fast and discriminative), the sandboxing (Docker containers, gVisor, or Firecracker), and scaling to thousands of parallel LLM calls. But the algorithm itself is straightforward.

Practical tips for your own FunSearch

DecisionRecommendation
LLM choiceAny code-generation model works. GPT-4, Claude, Codestral. Bigger models give more creative programs but cost more per call.
TemperatureStart at 1.0. Lower if too many failures, raise if solutions converge too fast.
Island count3-10 islands. More islands = more diversity = slower convergence. 3 is a good default.
IterationsFunSearch ran millions of evaluations. For simpler problems, 500-5000 is enough.
SandboxAt minimum, use subprocess with timeout. For production, use Docker or Firecracker.
In a FunSearch-style system, what does the human write?

Chapter 7: Showcase — Evolve a Heuristic

Now you get to be FunSearch. This simulation lets you run an evolutionary loop on a simplified heuristic optimization problem: finding a scoring function for online bin packing. Watch programs evolve, islands compete, and scores climb over generations.

FunSearch Simulator

Each row is an island. Dots represent programs (position = score). The yellow diamond is the global best. Adjust mutation rate and island count, then watch evolution in action.

Mutation Rate 0.3
Islands 3
Migration Rate 0.08
Generation: 0 — Global best: 0.000
Experiment: Try setting mutation rate very low (0.05). Watch how all islands converge to the same score — they get stuck. Now crank it to 0.8. Scores jump wildly but don't improve reliably. The sweet spot is 0.2-0.4, where you get both exploration and steady improvement. This is exactly the temperature dilemma in FunSearch.

Try cranking up islands to 6 and migration rate to 0.3. Notice how ideas spread quickly between islands — too quickly. The population homogenizes. Now try migration at 0.02. Islands develop unique "personalities" (solution strategies) but miss out on cross-pollination. The original FunSearch used infrequent migration to balance these forces.

Chapter 8: Lessons Learned

FunSearch's success teaches us several deep lessons about how to use LLMs for discovery. These lessons apply far beyond cap sets and bin packing.

Lesson 1: Verify, don't trust

The single most important design decision in FunSearch is that nothing the LLM produces is trusted. Every program is verified. This is the opposite of how most people use LLMs (generate text, read it, maybe check it manually). FunSearch treats the LLM as a generator of hypotheses, never as a source of truth.

The verification principle: LLMs are useful in proportion to how cheaply you can verify their output. Problems where verification is hard (open-ended writing, ethical reasoning) are bad candidates for FunSearch. Problems where verification is cheap (math, code, optimization) are perfect.

Lesson 2: Programs beat solutions

FunSearch doesn't evolve mathematical objects directly. It evolves programs that construct mathematical objects. Why? Because programs are compositional and transferable. A cap set of size 512 is just a list of numbers. But the program that constructs it captures the method — the insight, the trick, the structure. That program can be applied at any dimension, analyzed for patterns, and combined with other programs.

Lesson 3: Diversity is essential

Without the island model, FunSearch converges prematurely. The best-performing setup used 10-15 islands with infrequent migration. This mirrors evolutionary biology: geographic isolation drives speciation, and hybrid vigor comes from crossing distinct lineages.

Lesson 4: The evaluator is the bottleneck

The LLM can generate thousands of candidate programs per hour. But every one needs to be evaluated. If evaluation takes 30 seconds, you can evaluate ~120 programs per hour per core. DeepMind ran evaluations in parallel on hundreds of cores. For your own FunSearch, the speed of your evaluator determines everything.

When to use (and not use) FunSearch

Good CandidatesBad Candidates
Combinatorial optimization (packing, scheduling, routing)Tasks with no clear scoring metric
Algorithm design (sorting, hashing, compression)Tasks requiring subjective judgment
Mathematical constructions (cap sets, Ramsey numbers)Tasks where evaluation is as hard as the problem
Game playing heuristics (evaluation functions)Tasks requiring real-world physical testing
ML hyperparameter search (fitness = validation loss)Open-ended creative writing
Which of these problems is the WORST fit for a FunSearch-style approach?

Chapter 9: Beyond FunSearch

FunSearch represents a new paradigm: LLMs as hypothesis generators inside formal verification loops. This idea is spreading rapidly.

What came next

SystemApproachDomain
AlphaCode / AlphaCode 2LLM + evaluator for competitive programmingCode contests
AlphaProofLLM + formal proof verifier for IMO problemsMathematics
AlphaGeometryNeuro-symbolic system for geometryOlympiad geometry
EvoPromptingEvolutionary search over prompts (not programs)Any LLM task
OpenELMOpen-source evolutionary LLM searchGeneral optimization

The common thread: separate generation from verification. Let the LLM be creative, let the evaluator be strict, and let evolution do the rest.

Open questions

Can we search over proofs, not just programs? FunSearch finds constructions but doesn't prove optimality. Combining FunSearch-style search with formal proof systems (Lean, Coq, Isabelle) could yield not just new math but verified new math.

How far can we push the evaluator? The biggest limitation is that the evaluator must be automated and fast. Can we use approximate evaluators (learned reward models) for problems where exact evaluation is expensive? This bridges FunSearch toward RLHF-style methods.

What about multi-objective optimization? FunSearch optimizes a single score. Real-world problems often have trade-offs (speed vs memory, accuracy vs latency). Multi-objective FunSearch would maintain a Pareto front instead of a single best score.

The big picture: FunSearch shows that LLMs are not just text generators — they are components in larger search systems. The future of AI-assisted discovery is not "ask a model and trust the answer." It's "ask a model, verify the answer, evolve, repeat." The model provides the spark; the system provides the rigor.

Connections

If you found this lesson interesting, explore these related topics:

Reward Alignment — RLHF uses a similar loop: generate → evaluate → update. FunSearch is RLHF for programs, not chat. Read the lesson →
Transformers — The LLM at the heart of FunSearch is a Transformer model. Understanding attention and generation helps you tune the search. Read the lesson →
What is the central paradigm shift that FunSearch represents?