Amplifying Human Performance in Combinatorial Competitive Programming

Chapter 0: The Problem

You are a team of four in Google's Hash Code competition. You have four hours. The problem drops: assign servers to data center pools to maximize guaranteed capacity. Or schedule a fleet of self-driving cars across a city to maximize ride completions. Or design traffic light schedules for every intersection in a map.

These are NP-hard combinatorial optimization problems. There is no known polynomial-time algorithm that finds the optimal solution. Every team knows this going in. The game is not about finding the best solution — it is about finding the best heuristic that produces the highest-scoring solution within the time limit.

In practice, top teams follow a predictable pattern. First, they write a greedy algorithm — a backbone that builds a solution step by step, always making the locally best choice. Then they write a scoring function that tells the greedy algorithm what "locally best" means. Finally, they spend the remaining time hand-tuning the scoring function, sometimes wrapping it in randomized hill-climbing.

The core question: What if the scoring-function tuning — the creative, time-consuming part that separates good teams from great teams — could be automated by an LLM-guided evolutionary search? That is exactly what this paper shows. Human competitors write the backbone. FunSearch evolves the scoring function. Together, they outperform the top human teams.

The collaboration is asymmetric and deliberate. Humans are good at understanding the problem structure, designing data representations, and writing greedy frameworks. But searching the space of possible scoring functions — where a small constant or a subtle formula change can mean thousands of points — is tedious, error-prone, and perfectly suited for automated exploration.

The results are striking: across all eight Hash Code online qualification rounds from 2015 to 2022, their evolved solutions consistently rank in the top percentile. On five of eight contests, they outperform the highest-scoring human team. And these results hold even when limiting FunSearch to just two hours of evolution time.

Why are combinatorial optimization competitions a better fit for LLM-guided evolution than standard competitive programming (Codeforces-style)?

Optimization problems reward partial solutions and incremental improvements, so evolved scoring functions that are slightly better produce measurably higher scores — whereas Codeforces needs fully correct code with no partial credit LLMs are better at writing Python than C++ Hash Code has easier problems

Chapter 1: Codeforces vs Hash Code

To appreciate why this approach works, we need to understand the fundamental difference between two kinds of competitive programming.

Codeforces-style: correctness competitions

Codeforces-style problems require polynomial-time algorithms — you need to find the exact, correct answer. There is no reward for "close." The test cases often include hidden edge cases designed to break naive solutions. Many tasks follow well-known design patterns (segment trees, dynamic programming, network flow) that experienced competitors can recognize and apply.

This makes AI-based approaches fragile: the model needs to produce perfectly correct code that handles every edge case, including ones not visible to the competitor. A single off-by-one error means zero points.

Hash Code-style: optimization competitions

Hash Code problems are intractable. Optimal solutions are unknown. The test cases are visible upfront — no hidden inputs. Every valid solution scores something, and the score measures quality on a continuous scale. Even the best human teams produce suboptimal solutions and know it.

Property	Codeforces	Hash Code
Solution type	Exact, correct	Heuristic, approximate
Scoring	Binary (pass/fail per test)	Continuous (quality metric)
Test cases	Hidden, adversarial	Visible, realistic
Optimal known?	Yes (polynomial algorithm exists)	No (NP-hard)
Partial credit	No	Yes — every valid solution scores
Improvement style	Insight-based (find the right algorithm)	Incremental (tune the heuristic)

Why this matters for AI: Optimization competitions provide a smooth fitness landscape. A slightly better scoring function produces a slightly higher score. This is exactly what evolutionary search needs — a gradient to follow. In correctness competitions, the landscape is flat (zero points) with rare spikes (full credit), making search extremely difficult.

Hash Code mechanics

Hash Code was a global team programming competition organized by Google. At its peak, it attracted over 125,000 competitors. Each qualification round presented a single NP-hard optimization problem with multiple test inputs. Teams had four hours to submit solutions. The top ~50 teams advanced to the final round.

The competition ran from 2015 to 2022, giving us eight qualification rounds to benchmark against. The paper uses all eight, accessing the problems and competitor scores from the public archive.

Teams had two phases: (1) parse the input and design a backbone algorithm, (2) iterate on the scoring function to squeeze out more points. The authors split this identically: the humans handle phase 1, FunSearch handles phase 2. They budgeted two hours for the human backbone design and two hours for FunSearch evolution.

A Hash Code team writes a greedy algorithm that produces valid solutions but scores in the 30th percentile. What is the most likely bottleneck?

The input parser has bugs The greedy backbone is too slow The scoring function that guides greedy decisions is too simple — it does not capture the problem's nuances, so the greedy choices are locally reasonable but globally suboptimal

Chapter 2: The Workflow

The approach decomposes the competition into two cleanly separated roles: the human competitor designs the algorithmic backbone, and FunSearch evolves the scoring function that guides it.

1. Human writes backbone

Greedy algorithm + input parser + evaluator — ~400 lines of Python

↓

2. Human writes base scoring function

A simple heuristic: return 1 or a basic ratio — gets a baseline score

↓

3. FunSearch evolves scoring function

LLM proposes mutations; evaluator scores them; population database keeps the best

↓

4. (Optional) Analyze & iterate backbone

Human inspects evolved functions, adjusts backbone structure, re-runs FunSearch

↓

5. Submit best solution

The evolved scoring function + backbone produce the final output

What the backbone looks like

Each backbone has three components:

Input parser: reads the problem-specific input file into data structures (dataclasses, lists, dictionaries)
Greedy algorithm: builds a solution step by step, calling score_greedy() at each decision point to rank candidates
Evaluator: computes the final score of a complete solution (used as the fitness function for FunSearch)

The scoring function is the only part that FunSearch evolves. It has a fixed signature — it takes the current state and a candidate action, and returns a numeric score. The backbone is frozen during evolution.

Think of it this way: The backbone is a robot arm that can pick and place objects. The scoring function is the brain that decides which object to pick next. The arm's mechanics don't change — but a smarter brain makes enormously better choices.

Why this split works

The backbone requires deep understanding of the problem: what data structures to use, what order to iterate in, how to handle constraints. This is where human insight shines. The scoring function, in contrast, is a numerical fitness landscape — a function from problem state to a real number. Small changes to constants, added terms, or restructured conditionals can shift the score significantly. This is where automated search shines.

The combined implementations are modest: about 400 lines of Python including docstrings, plus roughly 200 lines for the evaluator. The backbone alone is insufficient to win — base scoring functions typically land between the 20th and 60th percentile. But FunSearch-evolved scoring functions push the same backbone into the top 1%.

Key constraint: Each test input within a qualification round is treated as a separate optimization problem. Different inputs may benefit from different scoring functions. The paper runs FunSearch independently per input, sometimes even using different backbone variants for different inputs within the same contest.

Why does FunSearch evolve only the scoring function and not the entire backbone?

The scoring function is a small, self-contained function whose output can be evaluated automatically — evolving the entire backbone would mean evolving hundreds of lines of tightly coupled code with data structures, parsing, and constraint handling that the LLM would likely break The backbone is too long for the LLM context window The LLM cannot write Python dataclasses

Chapter 3: FunSearch Internals

FunSearch (Romera-Paredes et al., 2024) is the engine that evolves scoring functions. It pairs a pre-trained LLM with a systematic evaluator inside an evolutionary loop. Let us trace through the complete mechanism.

The evolutionary loop

Sample parents

Pick best-performing programs from the population database

↓

Build LLM prompt

Show parent code + its score + task description

↓

LLM generates mutation

Proposes modifications or entirely new function body

↓

Evaluate candidate

Run backbone with new scoring function, measure fitness

↓

Update population

Insert if fitness exceeds threshold; island-based diversity maintenance

The evaluator is critical. It runs the entire backbone with the proposed scoring function and returns the actual competition score. This is not a proxy metric — it is the real thing. If the proposed function crashes, returns invalid types, or produces worse scores, the candidate is discarded.

Configuration for competitive programming

The paper makes several adaptations from the original FunSearch setup:

Parameter	Original FunSearch	This paper
LLM	PaLM 2 (fine-tuned on code)	Gemini 1.5 Flash 002 (general, off-the-shelf)
Evolution target	Single scoring function	Scoring function with switchable variable (project vs role)
Memory limit	Default	10 GB per program
Time limit	Default	1,800 seconds wall-clock per evaluation
Multiple choice points	No	Yes — one function with a boolean switch for two choice contexts

No fine-tuning needed: A key finding is that modern general-purpose LLMs (Gemini 1.5 Flash 002) have sufficient code-writing capabilities to serve as the mutation operator. The original FunSearch used a code-specialized fine-tuned PaLM 2. This paper shows that fine-tuning is no longer necessary — the evolutionary framework compensates by filtering out bad mutations automatically.

The switching variable trick

Many Hash Code problems have multiple decision points in the greedy algorithm. For example, the 2022 Mentorship and Teamwork problem requires choosing both which project to work on and which role to assign to each person. Rather than evolving two separate functions, the paper uses a single function with a boolean parameter rate_project:

def score_greedy(project, role_id, assignments,
                   rate_project: bool) -> int:
    """Score choosing a project (rate_project=True)
       or assigning a role (rate_project=False)."""
    if rate_project:
        return 1  # base: all projects equal
    else:
        skill, level = project.roles[role_id]
        return 1  # base: all roles equal

FunSearch evolves both branches simultaneously. The boolean switch lets a single function body encode two different scoring policies, and the evolutionary pressure optimizes both in tandem.

Island-based population

FunSearch maintains multiple "islands" — independent sub-populations that evolve separately. Periodically, the best programs from one island migrate to others. This prevents premature convergence: different islands may explore different regions of the scoring-function space, and cross-pollination combines their discoveries.

The prompt structure

What does the LLM actually see? The prompt contains:

The task description (docstring of the function being evolved)
The backbone code (frozen, for context)
The parent scoring function (the best-so-far version to improve)
Optionally, "inspiration" programs from other islands with their scores
An instruction to produce an improved version of the scoring function

The LLM responds with a modified function. FunSearch applies the modification, runs the backbone with the new function, and scores the result. If the score improves, the new function enters the population. If not, it is discarded. This is pure selection pressure — no gradient, no reward shaping, just "did the score go up?"

Computational budget

Within two hours, the system evaluates around 10,500 programs on average across the eight Hash Code problems. Each evaluation runs the full backbone + scoring function on the test input, which can take up to 30 minutes for the most complex problems. The LLM calls themselves are fast (Gemini 1.5 Flash generates a candidate in seconds); the bottleneck is evaluation. This is why the scoring function is kept small — a shorter function means faster evaluation means more generations in the same wall-clock budget.

Why does this paper use Gemini 1.5 Flash 002 instead of a code-specialized fine-tuned model?

Fine-tuned models are more expensive General-purpose LLMs now have sufficient code capabilities, and the evolutionary framework's fitness-based filtering compensates for the lack of specialization — bad mutations are simply discarded, so the LLM only needs to produce good code sometimes Gemini is faster than PaLM 2

Chapter 4: Backbone + Score Function

Let us look at the anatomy of a backbone in detail. Every backbone follows the same three-part pattern, regardless of the specific Hash Code problem.

Part 1: Input parser

The parser reads the competition input file and constructs typed Python objects. For the 2015 Data Center problem, this means Server dataclasses (with index, size, and capacity), row blocks (sorted lists of available slots), and pool counts. The parser is straightforward — read the file, parse integers, build lists.

Part 2: Greedy algorithm

The greedy algorithm iterates through decision points (e.g., "which server to place next" or "which ride to assign to this car") and at each point calls score_greedy() on every candidate. It picks the candidate with the highest score. This is the backbone's skeleton — it never changes during evolution.

# Simplified greedy loop (data center placement)
while open_rows and open_servers:
    for row in open_rows:
        break
    best_score = -1
    best_server = None
    for s_id in open_servers:
        curr_score = score_greedy(servers[s_id], row,
                                   None, None, True)
        if best_server is None or curr_score > best_score:
            best_score = curr_score
            best_server = s_id
    # Place best_server in current row...
    placement[best_server] = (row, col, best_pool)

Part 3: Evaluator

The evaluator runs the full solution and returns the official competition score. This becomes FunSearch's fitness function. Critically, the evaluator is not the scoring function — it is the ground truth. The scoring function guides the greedy algorithm's local decisions; the evaluator measures the global quality of the resulting solution.

The scoring function is NOT the fitness function. The scoring function tells the greedy algorithm "choose this server next because it scores 3.7." The evaluator says "the complete placement scores 348 points." FunSearch optimizes the scoring function to maximize the evaluator's output. This indirection is what makes the approach work — the scoring function shapes local decisions, and the evaluator judges their global consequence.

Evolution in action: from trivial to elaborate

Consider the 2015 Data Center problem. The base scoring function is as simple as possible:

if rate_server:
    return server.capacity / server.size
for c_row in pools_per_row:
    total_sum += pools_per_row[c_row][pool]
return -total_sum + pools_per_row[row][pool]

This achieves a score of 348. After two hours of FunSearch evolution, the scoring function becomes dramatically more elaborate — a chain of 15 LLM mutation calls from the original. The evolved version includes:

Assertions that guard against invalid inputs
An algebraic expression with evolved constants for scoring
Exception handlers that would trigger on invalid inputs (never reached in practice but inserted by the LLM)
A return -100 statement that penalizes empty pools

The evolved function scores 405 — sufficient for second place overall. With more evolution time, it reaches 413 — rank 1.

15 mutation calls. The final scoring function is the result of 15 successive LLM-proposed modifications, each one tested and accepted because it improved the evaluator score. No single mutation was a large leap — the improvement accumulated through many small, verified steps.

The base scoring function for the Data Center problem returns server.capacity / server.size. What is this heuristic doing?

Preferring servers with the highest capacity-to-size ratio — maximizing capacity per slot of row space consumed, which is a natural greedy heuristic for packing problems Preferring the largest servers Preferring servers that fit in the most rows

Chapter 5: Case Study — Data Center (2015)

The 2015 Hash Code qualification round asked teams to optimize the placement of servers in a data center. Let us trace through the entire problem, backbone, and evolution to see the approach in concrete detail.

The problem

A data center has R rows of slots. Some slots are unavailable (occupied by other hardware). Each row has contiguous blocks of available slots. You have S servers, each with a size (how many slots it occupies) and a capacity (how much compute it provides). You must assign servers to rows and organize them into P pools.

The objective is to maximize the guaranteed capacity of the worst-off pool. A pool's guaranteed capacity is its total capacity minus the maximum capacity it has in any single row — because if that row goes down, the pool loses that capacity. This is a min-max problem: you want to spread each pool's capacity across rows so that no single row failure is catastrophic.

guaranteed(pool) = total_capacity(pool) − max_row(capacity_in_row(pool, row))

score = min_pool guaranteed(pool)

Intuition: Imagine a pool with 100 total capacity, but 60 of it is in row 3. If row 3 fails, the pool drops to 40. The guaranteed capacity is 100 − 60 = 40. A better distribution — say 34, 33, 33 across three rows — gives guaranteed capacity 100 − 34 = 66. The scoring function must encode this spreading logic implicitly.

The backbone

The greedy algorithm makes two interleaved decisions:

Choose a server to place next (rate_server = True)
Choose a pool to assign it to (rate_server = False)

For each decision, it calls score_greedy() with the appropriate boolean flag and picks the highest-scoring option.

Evolution trace

The base scoring function for the server choice is simply capacity / size. For the pool choice, it returns a simple metric: -total_sum + pools_per_row[row][pool] — preferring pools that have less capacity in the current row (to spread capacity evenly).

After FunSearch runs for two hours, the evolved function balloons to dozens of lines with:

Dynamic thresholds: different scoring paths depending on whether server_capacity >= total_sum / 2.0
Logarithmic scaling: log(pools_per_row) * 0.0007 terms that gently discourage over-packing
Evolved magic numbers: constants like 0.0004, 1.006, 1.13 that were discovered through evolution, not human intuition
A hard threshold: return -100 when server size exceeds 125, which the LLM discovered penalizes allocating very large servers

# Evolved scoring function snippet (simplified)
assert server_size > 0
if rate_server:
    if server_size > 125:
        return -100  # Penalize very large servers
    cap = (0.7 - 4.96 * server_size / 125) + (
        server_capacity / 122)
    if server_capacity / server_size > 7.5:
        cap += 0.0
    if server_capacity / server_size > 7.96:
        cap += 11.0  # Bonus for high-efficiency servers
    return cap

The evolved function discovers a strategy humans did not: penalizing large servers with return -100 when server.size > 125. This is counter-intuitive — large servers often have high capacity. But the scoring function discovered that placing many small, efficient servers provides better spreading across pools than a few large ones. The return -100 acts as a hard filter that fundamentally changes the allocation strategy.

Walking through the math

Let us derive why capacity-per-slot is a reasonable starting heuristic but insufficient. Suppose you have two servers:

Server A: size = 2, capacity = 10 (ratio = 5.0)
Server B: size = 5, capacity = 20 (ratio = 4.0)

The ratio heuristic prefers A. But consider: if you place both in row 0 and both in pool 0, the guaranteed capacity is (10+20) − (10+20) = 0. If instead you spread them across rows, guaranteed capacity is (10+20) − max(10,20) = 10. The ratio heuristic says nothing about which pool or which row to use — that is what the pool-selection branch of the scoring function must learn.

guaranteed(pool) = ∑_r cap(pool,r) − max_r cap(pool,r)

Maximizing this requires minimizing the max-row concentration. An ideal allocation distributes each pool's servers evenly across rows. But rows have different amounts of available space, and servers have different sizes, creating a complex packing-and-spreading tradeoff that the evolved function must navigate.

The full evolved function anatomy

The final rank-1 scoring function (after extended evolution) has this structure:

# Evolved score_greedy (Data Center, rank-1 variant)
assert server_size > 0
assert server_capacity > 0
if pool is not None:
    total_sum = pools_per_row[row][pool]
    max_sum = 0
    pool_size = 0
    for p in pools_per_row:
        total_sum += pools_per_row[p][pool]
        max_sum = max(pools_per_row[p][pool], max_sum)
        if p == pool:
            pool_list = pools_per_row[p][pool]
    if total_sum == 0:
        return row_score  # Cold-start: pick by row
    else:
        pool_score = -0.5 + total_sum / server_size * \
                     max_sum / server_size
        pool_bonus = 0.15 + (total_sum - pool_size)
        # ... 30+ more lines of evolved logic

The key insight: the evolved function considers the current state of pool allocations (via pools_per_row) and makes decisions that account for the global distribution, not just the immediate server quality. This is emergent multi-step planning encoded in a single scoring function.

Interactive simulation: watch how FunSearch evolves a scoring function for a simplified data center placement. Click "Evolve" to run 5 generations. The bars show pool guaranteed capacities — the overall score is the minimum bar height (shown in warm/orange). The chart tracks score improvement over generations.

Why does the evolved scoring function include return -100 for servers with size > 125?

Large servers crash the evaluator Large servers consume many slots but concentrating capacity in fewer, bigger servers makes pools more vulnerable to single-row failures — the evolved heuristic discovered that many small efficient servers provide better guaranteed capacity through better spreading It is a bug that happens to help

Chapter 6: Case Study — Self-Driving Rides (2018)

The 2018 Hash Code problem takes us from data centers to city streets. You manage a fleet of self-driving cars, and you need to assign ride requests to maximize the total score.

The problem

You are given F cars and a set of ride orders. Each ride has a start point, an end point, an earliest start time, and a latest finish time. Completing a ride earns points equal to the Manhattan distance of the ride. There is a bonus B for starting a ride at exactly its earliest start time. The task: assign rides to cars to maximize the sum of distances plus bonuses, subject to the constraint that rides must finish before their deadline.

score(ride) = manhattan(start, end) + B · [pickup_time = earliest_start]

The backbone

The greedy algorithm maintains a priority queue of cars, keyed by the time at which each car finishes its latest ride. At each step, the car that becomes free earliest is popped from the queue, and the scoring function chooses the best available ride for it.

# Base scoring function: pick the first feasible ride
for i, r in enumerate(rides):
    pickup_time = time + r.distance_to_start(coords)
    if (pickup_time >= r.earliest_start) and (
        pickup_time + r.length() < r.latest_finish):
        return i  # First feasible ride
return -1  # No feasible ride

This baseline scores 3,528,556 points on the d_metropolis test case. After two hours of evolution, the function becomes dramatically more sophisticated.

What the LLM discovers

The first meaningful evolution corrects a subtle inefficiency in the base function:

Discovery 1: relaxing the pickup constraint. The base function requires pickup_time >= r.earliest_start. The evolved version removes this constraint — realizing that the car can simply wait if it arrives before the earliest start. This means the car arrives early and waits, but this is often better than skipping the ride entirely. Score jumps to 11,739,630.

After more evolution, the scoring function develops an elaborate multi-factor scoring system:

# Evolved scoring function (simplified)
rides_by_length = [(r.length(), distance(coords, r.start))
                    for i, r in enumerate(rides)
                    if r.latest_finish >= time // 2]
rides_by_length.sort(reverse=True)

for (ride_len, dist_to_start) in rides_by_length:
    r = rides[i]
    bonus_points = 20000 if time < 3600 or time > 20900 else 0
    if r.latest_finish <= time + 1.5 * ride_len:
        bonus_points += 76 + free_time
    # Distance penalty from ending location
    score = 1100 * r.distance_to_start(coords) / 1200
    # Time-of-day penalties
    if rides_by_length[:5]:
        score *= 25 * r.distance_to_start(coords) / 1200

This evolved function scores 12,296,845 on the metropolis input — enough to outperform the rank-1 human team across all inputs combined.

Evolved strategies: The function develops time-of-day awareness (different bonuses for early vs late in the simulation), ride-length sorting (prioritize long rides), distance penalties, and tight-deadline bonuses. None of these were in the original function or suggested by the human competitor. The LLM synthesized them through iterative mutation and selection.

Dissecting the evolution trace

The evolution from 3.5M to 12.3M points happens in several distinct phases, each corresponding to a qualitative shift in the scoring function's strategy:

Phase	Score	Key mutation	Strategy shift
Base	3,528,556	—	Pick first feasible ride
Phase 1	11,739,630	Remove early-arrival filter	Allow waiting for rides
Phase 2	~11,900,000	Sort by ride length	Prioritize long (high-value) rides
Phase 3	~12,100,000	Add distance penalty	Minimize deadheading (empty travel)
Phase 4	12,296,845	Time-aware bonuses	Different behavior early vs late

Notice the pattern: Phase 1 is a single correction to a logic bug (the overly strict feasibility check). This alone triples the score. Phases 2–4 are refinements that add maybe 5% each. The big win is always the first structural insight; the rest is polish. But in competition, that polish separates finalists from the top team.

Why the priority queue matters

The backbone uses a min-heap keyed by the time each car becomes free. This is not just a data structure choice — it encodes a scheduling policy. The car that finishes earliest gets the next ride. This means busy downtown cars cycle quickly through short rides, while suburban cars may take one long ride and sit idle. The scoring function must work within this constraint: it cannot re-assign rides across cars, only choose which ride a specific car takes next.

# Backbone priority queue (simplified)
cars = [(0, (0, 0))] * f  # (free_time, coords) for each car
while True:
    time, coords = heapq.heappop(cars)
    if time >= tot_time: break
    idx = pick_ride(coords, time, tuple(rides))
    if idx < 0: continue  # No feasible ride
    r = rides.pop(idx)
    pickup_time = time + r.distance_to_start(coords)
    pickup_time = max(pickup_time, r.earliest_start)
    free_time = pickup_time + r.length()
    if free_time <= r.latest_finish:
        score += r.length()
        if pickup_time == r.earliest_start:
            score += b  # Bonus for on-time pickup
    heapq.heappush(cars, (free_time, r.end))

Interactive: a simplified ride-assignment simulation. Cars (circles) pick up rides (lines). Click "Evolve" to improve the scoring function. Watch how evolved strategies reduce deadheading (empty travel shown as dashed lines).

What insight did FunSearch discover about the self-driving rides problem that was not in the human-written base function?

Cars can arrive before the earliest start time and wait — the base function incorrectly filtered out these rides as infeasible, missing a huge class of profitable assignments that tripled the score Cars should always pick the shortest ride The bonus is not worth pursuing

Chapter 7: Case Study — Traffic Signaling (2021)

The 2021 Hash Code problem is about designing traffic light schedules for an entire city. This is where the evolved scoring functions become truly intricate.

The problem

You are given a graph of one-way streets connecting intersections, plus a set of cars that each follow a predetermined route. At each intersection, there is a traffic light that cycles through its incoming streets one at a time, giving each street a green light for some duration. You decide: in what order and for how long each street gets a green light.

Cars that complete their route by the deadline earn points: a fixed bonus plus a time bonus inversely proportional to how long the trip took. The objective is to maximize the total score across all cars.

The three-phase backbone

The backbone is more complex than the previous examples. It runs a full simulation in three phases:

Phase 1: Position simulation. Run all cars under the assumption that all green lights last the same duration. Record which streets are used and where cars end up.
Phase 2: Duration prediction. For every incoming street at every intersection, predict the optimal green light duration using the scoring function.
Phase 3: Final simulation. Run the simulation again with the predicted durations. Remove any streets with only failed cars (to reduce congestion).

The scoring function is called with a boolean give_pos that switches between two roles:

give_pos = True: predict the position (order) of a street in the traffic light schedule
give_pos = False: predict the duration (green light length) for a street

What the LLM discovers

The base scoring function is deliberately simple:

if give_pos:
    return 1  # All positions equal
else:
    return int(len(used_streets[street.name]) / 1000 + 0.33)

This achieves 1,038,868 points on the testcase f_forever_jammed. After evolution, the position formula becomes:

pos = max(0, min(N̂_str − 1, ⌊N_cars/200⌋ + 2))

where N̂_str is the total number of streets that have entered this intersection so far and N_cars is the number of cars using this street. This formula gives precedence to streets encountered earlier at a particular intersection (smaller N̂_str) and to streets with higher traffic (larger N_cars).

Logarithmic green durations. The evolved duration formula is: ⌊(N_cars·N̂_str/1000 + 1/10) · ln(N_cars + 1) + 1⌋. This means green light duration should grow logarithmically with expected congestion. This is a principled heuristic — doubling the traffic does not justify doubling the green time, because other streets also need their turn.

The evolved function also develops biases: position allocation favors later positions for heavily used streets (load of at least 500 cars needed before any shift), while green durations scale with both the number of cars and the diversity of streets at each intersection.

With all evolved functions across individual inputs, the method achieves a comfortable rank-1 result on this problem, scoring 1,465,888 total across all inputs.

The position formula in detail

Let us unpack the evolved position formula step by step. The function must assign each incoming street at each intersection a position in the green-light cycle (0, 1, 2, ..., n-1 where n is the number of incoming streets with traffic).

pos(street) = max(0, min(N̂_str − 1, ⌊N_cars / 200⌋ + 2))

Where:

N_cars = total number of cars using this street across the simulation
N̂_str = running count of streets that have been assigned to this intersection so far

The min ensures the position cannot exceed the current count of streets (you cannot be position 5 if only 3 streets have been scheduled). The max(0, ...) prevents negative positions. The core term ⌊N_cars/200⌋ + 2 means:

Streets with fewer than 200 cars get position 2 (mid-cycle — not first, not last)
Streets with 200–399 cars get position 3
Streets with 400–599 cars get position 4
Very busy streets get pushed to later positions in the cycle

Counter-intuitive: You might expect busy streets to go first in the cycle. But the evolved function puts them later. The likely reason: by the time a busy street's green phase starts, cars from other streets have cleared the intersection, reducing congestion. Busy streets need more green time (handled by the duration function) but not necessarily an early slot.

The duration formula in detail

For wider green-light shifts, the evolved formula is even more interesting:

duration(street) = ⌊(N_cars · N̂_str / 1000 + 1/10) · ln(N_cars + 1) + 1⌋

This has three interacting terms:

N_cars · N̂_str / 1000: a product of traffic volume and intersection complexity. Busier intersections with more streets get longer greens. The /1000 scaling keeps this term manageable.
ln(N_cars + 1): the logarithmic damping. Doubling traffic from 100 to 200 adds ln(2) ≈ 0.7 seconds, not a full doubling. This prevents any single street from monopolizing the cycle.
+1 floor: every street gets at least 1 time unit of green, preventing starvation.

The combination is sophisticated. At a simple intersection with 2 streets and 50 cars each, durations are roughly ⌊(50·2/1000 + 0.1) · ln(51) + 1⌋ = ⌊0.2 · 3.9 + 1⌋ = 1. At a complex intersection with 8 streets and 500 cars, it is ⌊(500·8/1000 + 0.1) · ln(501) + 1⌋ = ⌊4.1 · 6.2 + 1⌋ = 26. The scaling is dramatic but appropriate.

Why should green light duration grow logarithmically rather than linearly with traffic volume?

Linear growth would starve low-traffic streets — if a street with 1000 cars gets a 10-second green while a street with 100 gets 1 second, the 100-car street's vehicles will bottleneck, creating cascading congestion. Logarithmic scaling gives busy streets more time but not proportionally more, maintaining fairness Logarithms are cheaper to compute The problem requires it

Chapter 8: Results & Rankings

Now let us step back and look at the aggregate results across all eight Hash Code qualification rounds.

Headline numbers

Year	Problem	Backbone %ile	FunSearch 2h %ile	FunSearch ∞ %ile	Beat #1?
2015	Data Center	~50th	Top 1%	Top 1%	Yes
2016	Delivery	~80th	Top 1%	Top 1%	No
2017	Streaming Videos	~55th	Top 1%	Top 1%	No
2018	Self-Driving Rides	~20th	Top 1%	Top 1%	Yes
2019	Photo Slideshow	~40th	Top 1%	Top 1%	No
2020	Book Scanning	~60th	Top 1%	Top 1%	Yes
2021	Traffic Signaling	~45th	Top 1%	Top 1%	Yes
2022	Mentorship	~90th	Top 1%	Top 1%	Yes

Five out of eight rank-1 results. On five problems (2015, 2018, 2020, 2021, 2022), the evolved solutions outperform the best human team. On the other three, they still land in the top percentile. The backbone alone never reaches the top percentile on any problem — the scoring function evolution is always the decisive factor.

Two key questions answered

Can FunSearch deliver meaningful improvements? Yes. Across all eight contests, evolving the scoring function pushed the backbone from mid-range to top percentile. The backbone alone cannot achieve the top percentile or qualify for finals in any year.

Are the improvements obtainable under contest conditions? Yes. Even limiting FunSearch to just two hours of evolution time, the results are sufficient for the top percentile in all eight contests. In most cases, the two-hour results are close to the unlimited-time results.

The starting-point variance

An interesting pattern: the backbone's starting percentile varies dramatically across years. In 2018, the backbone starts below the 20th percentile. In 2022, it starts above the 90th percentile. This reflects the varying complexity of writing a good backbone — some problems are harder to parse and structure than others. But FunSearch consistently amplifies whatever starting point it receives into the top percentile.

Visualization: backbone starting percentile (gray) vs FunSearch evolved percentile (warm) across all eight Hash Code years. The green line marks the finalist cutoff. Hover over bars for details.

AtCoder Heuristic Contest 039

To test generalization beyond Hash Code, the authors applied their method to a recent AtCoder Heuristic Contest (AHC 039) — a "Purse Seine Fishing" problem about building polygons on 2D grids to maximize enclosed "mackerel" points while minimizing "sardine" points.

This is a fundamentally different setting: test cases are private (not visible), the code must be self-contained (no external tools), and the evaluation metric differs from the internal fitness function. Despite these challenges, FunSearch achieves an estimated rank between 9th and 17th place out of 150+ participants — demonstrating that the approach transfers to substantially different competition formats.

Post-hoc evaluation note: The Hash Code scores were computed post-hoc using the public archive. The AtCoder contest was held in November 2024, and the FunSearch approach was applied to a variant of the contest problems. The Gemini model used had no prior exposure to contest-specific strategies.

What is the most significant limitation of the Hash Code evaluation?

The problems are too easy FunSearch is too slow The evaluation is post-hoc, not real-time — the contest environment cannot be perfectly replicated (time pressure, team dynamics, limited compute), and the LLM may have seen Hash Code-related content in its training data

Chapter 9: Connections

This paper sits at the intersection of several active research threads. Let us map the connections.

FunSearch lineage

This work is a direct application of FunSearch (Romera-Paredes et al., 2024), which demonstrated LLM-guided evolution for mathematical discovery (finding large cap sets and new bin-packing heuristics). The key advance here is showing that FunSearch works not just on abstract mathematical problems but on practical competitive programming tasks with complex backbones.

The successor, AlphaEvolve (Novikov et al., 2025), takes this further by evolving entire codebases (not just single functions), using multi-objective optimization, and applying to Google's production infrastructure. This paper is the bridge between FunSearch's proof-of-concept and AlphaEvolve's generality.

FunSearch (2024)

Evolve a single function for math/combinatorics

↓

This paper (2024)

Evolve scoring functions inside human-written backbones for competitions

↓

AlphaEvolve (2025)

Evolve entire codebases, multi-file, multi-objective, production-grade

AlphaCode comparison

AlphaCode (Li et al., 2022) and AlphaCode 2 tackle Codeforces-style problems by generating thousands of candidate solutions and filtering them. This approach differs fundamentally: it does not generate complete solutions but evolves a component (the scoring function) within a human-designed system. The collaboration is the point — neither the human backbone nor the evolved function alone is competitive.

System	Competition type	Human role	AI role	Output
AlphaCode	Correctness (Codeforces)	None	Generate + filter complete solutions	Full programs
This paper	Optimization (Hash Code)	Design backbone	Evolve scoring function	Function within backbone
AlphaEvolve	Any optimization	Design initial code	Evolve any marked function(s)	Evolved codebase

The collaboration thesis

The paper makes a broader argument about human-AI collaboration in programming. Rather than replacing human programmers, the most productive near-term approach is amplification: humans design the architecture and constraints, AI searches the space of solutions within those constraints. The human provides structure and understanding; the AI provides tireless exploration and the occasional non-obvious insight (like the return -100 penalty or the logarithmic green duration).

The deeper lesson: The scoring functions evolved by FunSearch are not just "better tuned" versions of the human originals. They develop genuinely novel strategies — time-of-day awareness, logarithmic scaling laws, hard filters on server sizes — that human competitors did not discover. The LLM is not just optimizing parameters; it is discovering new algorithmic ideas through the combination of code understanding and evolutionary pressure.

Open questions

Backbone sensitivity: How much does the quality of the backbone matter? Could a very weak backbone still be amplified to competitive levels?
Real-time application: Could this approach work within the actual four-hour competition window, including backbone design time?
Transfer learning: Can evolved scoring functions from one year's problem help on the next year's?
Beyond greedy: Most backbones are greedy. Could this work with more complex algorithmic frameworks (branch-and-bound, constraint programming)?

What is the key structural difference between this paper's approach and AlphaCode?

This paper evolves a single scoring function within a human-designed backbone via evolutionary search, while AlphaCode generates complete standalone programs via massive sampling and filtering — the collaboration between human structure and AI search is the fundamental innovation This paper uses a better LLM AlphaCode solves harder problems