A collaborative human-AI system where competitors write greedy algorithm backbones and FunSearch evolves the scoring functions. Applied to all eight Google Hash Code qualification rounds (2015–2022), the evolved solutions reach the top percentile and outperform the best human teams on five of eight contests — all within two hours of evolution.
You are a team of four in Google's Hash Code competition. You have four hours. The problem drops: assign servers to data center pools to maximize guaranteed capacity. Or schedule a fleet of self-driving cars across a city to maximize ride completions. Or design traffic light schedules for every intersection in a map.
These are NP-hard combinatorial optimization problems. There is no known polynomial-time algorithm that finds the optimal solution. Every team knows this going in. The game is not about finding the best solution — it is about finding the best heuristic that produces the highest-scoring solution within the time limit.
In practice, top teams follow a predictable pattern. First, they write a greedy algorithm — a backbone that builds a solution step by step, always making the locally best choice. Then they write a scoring function that tells the greedy algorithm what "locally best" means. Finally, they spend the remaining time hand-tuning the scoring function, sometimes wrapping it in randomized hill-climbing.
The collaboration is asymmetric and deliberate. Humans are good at understanding the problem structure, designing data representations, and writing greedy frameworks. But searching the space of possible scoring functions — where a small constant or a subtle formula change can mean thousands of points — is tedious, error-prone, and perfectly suited for automated exploration.
The results are striking: across all eight Hash Code online qualification rounds from 2015 to 2022, their evolved solutions consistently rank in the top percentile. On five of eight contests, they outperform the highest-scoring human team. And these results hold even when limiting FunSearch to just two hours of evolution time.
To appreciate why this approach works, we need to understand the fundamental difference between two kinds of competitive programming.
Codeforces-style problems require polynomial-time algorithms — you need to find the exact, correct answer. There is no reward for "close." The test cases often include hidden edge cases designed to break naive solutions. Many tasks follow well-known design patterns (segment trees, dynamic programming, network flow) that experienced competitors can recognize and apply.
This makes AI-based approaches fragile: the model needs to produce perfectly correct code that handles every edge case, including ones not visible to the competitor. A single off-by-one error means zero points.
Hash Code problems are intractable. Optimal solutions are unknown. The test cases are visible upfront — no hidden inputs. Every valid solution scores something, and the score measures quality on a continuous scale. Even the best human teams produce suboptimal solutions and know it.
| Property | Codeforces | Hash Code |
|---|---|---|
| Solution type | Exact, correct | Heuristic, approximate |
| Scoring | Binary (pass/fail per test) | Continuous (quality metric) |
| Test cases | Hidden, adversarial | Visible, realistic |
| Optimal known? | Yes (polynomial algorithm exists) | No (NP-hard) |
| Partial credit | No | Yes — every valid solution scores |
| Improvement style | Insight-based (find the right algorithm) | Incremental (tune the heuristic) |
Hash Code was a global team programming competition organized by Google. At its peak, it attracted over 125,000 competitors. Each qualification round presented a single NP-hard optimization problem with multiple test inputs. Teams had four hours to submit solutions. The top ~50 teams advanced to the final round.
The competition ran from 2015 to 2022, giving us eight qualification rounds to benchmark against. The paper uses all eight, accessing the problems and competitor scores from the public archive.
Teams had two phases: (1) parse the input and design a backbone algorithm, (2) iterate on the scoring function to squeeze out more points. The authors split this identically: the humans handle phase 1, FunSearch handles phase 2. They budgeted two hours for the human backbone design and two hours for FunSearch evolution.
The approach decomposes the competition into two cleanly separated roles: the human competitor designs the algorithmic backbone, and FunSearch evolves the scoring function that guides it.
return 1 or a basic ratio — gets a baseline scoreEach backbone has three components:
score_greedy() at each decision point to rank candidatesThe scoring function is the only part that FunSearch evolves. It has a fixed signature — it takes the current state and a candidate action, and returns a numeric score. The backbone is frozen during evolution.
The backbone requires deep understanding of the problem: what data structures to use, what order to iterate in, how to handle constraints. This is where human insight shines. The scoring function, in contrast, is a numerical fitness landscape — a function from problem state to a real number. Small changes to constants, added terms, or restructured conditionals can shift the score significantly. This is where automated search shines.
The combined implementations are modest: about 400 lines of Python including docstrings, plus roughly 200 lines for the evaluator. The backbone alone is insufficient to win — base scoring functions typically land between the 20th and 60th percentile. But FunSearch-evolved scoring functions push the same backbone into the top 1%.
FunSearch (Romera-Paredes et al., 2024) is the engine that evolves scoring functions. It pairs a pre-trained LLM with a systematic evaluator inside an evolutionary loop. Let us trace through the complete mechanism.
The evaluator is critical. It runs the entire backbone with the proposed scoring function and returns the actual competition score. This is not a proxy metric — it is the real thing. If the proposed function crashes, returns invalid types, or produces worse scores, the candidate is discarded.
The paper makes several adaptations from the original FunSearch setup:
| Parameter | Original FunSearch | This paper |
|---|---|---|
| LLM | PaLM 2 (fine-tuned on code) | Gemini 1.5 Flash 002 (general, off-the-shelf) |
| Evolution target | Single scoring function | Scoring function with switchable variable (project vs role) |
| Memory limit | Default | 10 GB per program |
| Time limit | Default | 1,800 seconds wall-clock per evaluation |
| Multiple choice points | No | Yes — one function with a boolean switch for two choice contexts |
Many Hash Code problems have multiple decision points in the greedy algorithm. For example, the 2022 Mentorship and Teamwork problem requires choosing both which project to work on and which role to assign to each person. Rather than evolving two separate functions, the paper uses a single function with a boolean parameter rate_project:
def score_greedy(project, role_id, assignments, rate_project: bool) -> int: """Score choosing a project (rate_project=True) or assigning a role (rate_project=False).""" if rate_project: return 1 # base: all projects equal else: skill, level = project.roles[role_id] return 1 # base: all roles equal
FunSearch evolves both branches simultaneously. The boolean switch lets a single function body encode two different scoring policies, and the evolutionary pressure optimizes both in tandem.
FunSearch maintains multiple "islands" — independent sub-populations that evolve separately. Periodically, the best programs from one island migrate to others. This prevents premature convergence: different islands may explore different regions of the scoring-function space, and cross-pollination combines their discoveries.
What does the LLM actually see? The prompt contains:
The LLM responds with a modified function. FunSearch applies the modification, runs the backbone with the new function, and scores the result. If the score improves, the new function enters the population. If not, it is discarded. This is pure selection pressure — no gradient, no reward shaping, just "did the score go up?"
Within two hours, the system evaluates around 10,500 programs on average across the eight Hash Code problems. Each evaluation runs the full backbone + scoring function on the test input, which can take up to 30 minutes for the most complex problems. The LLM calls themselves are fast (Gemini 1.5 Flash generates a candidate in seconds); the bottleneck is evaluation. This is why the scoring function is kept small — a shorter function means faster evaluation means more generations in the same wall-clock budget.
Let us look at the anatomy of a backbone in detail. Every backbone follows the same three-part pattern, regardless of the specific Hash Code problem.
The parser reads the competition input file and constructs typed Python objects. For the 2015 Data Center problem, this means Server dataclasses (with index, size, and capacity), row blocks (sorted lists of available slots), and pool counts. The parser is straightforward — read the file, parse integers, build lists.
The greedy algorithm iterates through decision points (e.g., "which server to place next" or "which ride to assign to this car") and at each point calls score_greedy() on every candidate. It picks the candidate with the highest score. This is the backbone's skeleton — it never changes during evolution.
# Simplified greedy loop (data center placement) while open_rows and open_servers: for row in open_rows: break best_score = -1 best_server = None for s_id in open_servers: curr_score = score_greedy(servers[s_id], row, None, None, True) if best_server is None or curr_score > best_score: best_score = curr_score best_server = s_id # Place best_server in current row... placement[best_server] = (row, col, best_pool)
The evaluator runs the full solution and returns the official competition score. This becomes FunSearch's fitness function. Critically, the evaluator is not the scoring function — it is the ground truth. The scoring function guides the greedy algorithm's local decisions; the evaluator measures the global quality of the resulting solution.
Consider the 2015 Data Center problem. The base scoring function is as simple as possible:
if rate_server: return server.capacity / server.size for c_row in pools_per_row: total_sum += pools_per_row[c_row][pool] return -total_sum + pools_per_row[row][pool]
This achieves a score of 348. After two hours of FunSearch evolution, the scoring function becomes dramatically more elaborate — a chain of 15 LLM mutation calls from the original. The evolved version includes:
return -100 statement that penalizes empty poolsThe evolved function scores 405 — sufficient for second place overall. With more evolution time, it reaches 413 — rank 1.
server.capacity / server.size. What is this heuristic doing?The 2015 Hash Code qualification round asked teams to optimize the placement of servers in a data center. Let us trace through the entire problem, backbone, and evolution to see the approach in concrete detail.
A data center has R rows of slots. Some slots are unavailable (occupied by other hardware). Each row has contiguous blocks of available slots. You have S servers, each with a size (how many slots it occupies) and a capacity (how much compute it provides). You must assign servers to rows and organize them into P pools.
The objective is to maximize the guaranteed capacity of the worst-off pool. A pool's guaranteed capacity is its total capacity minus the maximum capacity it has in any single row — because if that row goes down, the pool loses that capacity. This is a min-max problem: you want to spread each pool's capacity across rows so that no single row failure is catastrophic.
The greedy algorithm makes two interleaved decisions:
For each decision, it calls score_greedy() with the appropriate boolean flag and picks the highest-scoring option.
The base scoring function for the server choice is simply capacity / size. For the pool choice, it returns a simple metric: -total_sum + pools_per_row[row][pool] — preferring pools that have less capacity in the current row (to spread capacity evenly).
After FunSearch runs for two hours, the evolved function balloons to dozens of lines with:
server_capacity >= total_sum / 2.0log(pools_per_row) * 0.0007 terms that gently discourage over-packing0.0004, 1.006, 1.13 that were discovered through evolution, not human intuitionreturn -100 when server size exceeds 125, which the LLM discovered penalizes allocating very large servers# Evolved scoring function snippet (simplified) assert server_size > 0 if rate_server: if server_size > 125: return -100 # Penalize very large servers cap = (0.7 - 4.96 * server_size / 125) + ( server_capacity / 122) if server_capacity / server_size > 7.5: cap += 0.0 if server_capacity / server_size > 7.96: cap += 11.0 # Bonus for high-efficiency servers return cap
return -100 when server.size > 125. This is counter-intuitive — large servers often have high capacity. But the scoring function discovered that placing many small, efficient servers provides better spreading across pools than a few large ones. The return -100 acts as a hard filter that fundamentally changes the allocation strategy.Let us derive why capacity-per-slot is a reasonable starting heuristic but insufficient. Suppose you have two servers:
The ratio heuristic prefers A. But consider: if you place both in row 0 and both in pool 0, the guaranteed capacity is (10+20) − (10+20) = 0. If instead you spread them across rows, guaranteed capacity is (10+20) − max(10,20) = 10. The ratio heuristic says nothing about which pool or which row to use — that is what the pool-selection branch of the scoring function must learn.
Maximizing this requires minimizing the max-row concentration. An ideal allocation distributes each pool's servers evenly across rows. But rows have different amounts of available space, and servers have different sizes, creating a complex packing-and-spreading tradeoff that the evolved function must navigate.
The final rank-1 scoring function (after extended evolution) has this structure:
# Evolved score_greedy (Data Center, rank-1 variant) assert server_size > 0 assert server_capacity > 0 if pool is not None: total_sum = pools_per_row[row][pool] max_sum = 0 pool_size = 0 for p in pools_per_row: total_sum += pools_per_row[p][pool] max_sum = max(pools_per_row[p][pool], max_sum) if p == pool: pool_list = pools_per_row[p][pool] if total_sum == 0: return row_score # Cold-start: pick by row else: pool_score = -0.5 + total_sum / server_size * \ max_sum / server_size pool_bonus = 0.15 + (total_sum - pool_size) # ... 30+ more lines of evolved logic
The key insight: the evolved function considers the current state of pool allocations (via pools_per_row) and makes decisions that account for the global distribution, not just the immediate server quality. This is emergent multi-step planning encoded in a single scoring function.
Interactive simulation: watch how FunSearch evolves a scoring function for a simplified data center placement. Click "Evolve" to run 5 generations. The bars show pool guaranteed capacities — the overall score is the minimum bar height (shown in warm/orange). The chart tracks score improvement over generations.
return -100 for servers with size > 125?The 2018 Hash Code problem takes us from data centers to city streets. You manage a fleet of self-driving cars, and you need to assign ride requests to maximize the total score.
You are given F cars and a set of ride orders. Each ride has a start point, an end point, an earliest start time, and a latest finish time. Completing a ride earns points equal to the Manhattan distance of the ride. There is a bonus B for starting a ride at exactly its earliest start time. The task: assign rides to cars to maximize the sum of distances plus bonuses, subject to the constraint that rides must finish before their deadline.
The greedy algorithm maintains a priority queue of cars, keyed by the time at which each car finishes its latest ride. At each step, the car that becomes free earliest is popped from the queue, and the scoring function chooses the best available ride for it.
# Base scoring function: pick the first feasible ride for i, r in enumerate(rides): pickup_time = time + r.distance_to_start(coords) if (pickup_time >= r.earliest_start) and ( pickup_time + r.length() < r.latest_finish): return i # First feasible ride return -1 # No feasible ride
This baseline scores 3,528,556 points on the d_metropolis test case. After two hours of evolution, the function becomes dramatically more sophisticated.
The first meaningful evolution corrects a subtle inefficiency in the base function:
pickup_time >= r.earliest_start. The evolved version removes this constraint — realizing that the car can simply wait if it arrives before the earliest start. This means the car arrives early and waits, but this is often better than skipping the ride entirely. Score jumps to 11,739,630.After more evolution, the scoring function develops an elaborate multi-factor scoring system:
# Evolved scoring function (simplified) rides_by_length = [(r.length(), distance(coords, r.start)) for i, r in enumerate(rides) if r.latest_finish >= time // 2] rides_by_length.sort(reverse=True) for (ride_len, dist_to_start) in rides_by_length: r = rides[i] bonus_points = 20000 if time < 3600 or time > 20900 else 0 if r.latest_finish <= time + 1.5 * ride_len: bonus_points += 76 + free_time # Distance penalty from ending location score = 1100 * r.distance_to_start(coords) / 1200 # Time-of-day penalties if rides_by_length[:5]: score *= 25 * r.distance_to_start(coords) / 1200
This evolved function scores 12,296,845 on the metropolis input — enough to outperform the rank-1 human team across all inputs combined.
The evolution from 3.5M to 12.3M points happens in several distinct phases, each corresponding to a qualitative shift in the scoring function's strategy:
| Phase | Score | Key mutation | Strategy shift |
|---|---|---|---|
| Base | 3,528,556 | — | Pick first feasible ride |
| Phase 1 | 11,739,630 | Remove early-arrival filter | Allow waiting for rides |
| Phase 2 | ~11,900,000 | Sort by ride length | Prioritize long (high-value) rides |
| Phase 3 | ~12,100,000 | Add distance penalty | Minimize deadheading (empty travel) |
| Phase 4 | 12,296,845 | Time-aware bonuses | Different behavior early vs late |
Notice the pattern: Phase 1 is a single correction to a logic bug (the overly strict feasibility check). This alone triples the score. Phases 2–4 are refinements that add maybe 5% each. The big win is always the first structural insight; the rest is polish. But in competition, that polish separates finalists from the top team.
The backbone uses a min-heap keyed by the time each car becomes free. This is not just a data structure choice — it encodes a scheduling policy. The car that finishes earliest gets the next ride. This means busy downtown cars cycle quickly through short rides, while suburban cars may take one long ride and sit idle. The scoring function must work within this constraint: it cannot re-assign rides across cars, only choose which ride a specific car takes next.
# Backbone priority queue (simplified) cars = [(0, (0, 0))] * f # (free_time, coords) for each car while True: time, coords = heapq.heappop(cars) if time >= tot_time: break idx = pick_ride(coords, time, tuple(rides)) if idx < 0: continue # No feasible ride r = rides.pop(idx) pickup_time = time + r.distance_to_start(coords) pickup_time = max(pickup_time, r.earliest_start) free_time = pickup_time + r.length() if free_time <= r.latest_finish: score += r.length() if pickup_time == r.earliest_start: score += b # Bonus for on-time pickup heapq.heappush(cars, (free_time, r.end))
Interactive: a simplified ride-assignment simulation. Cars (circles) pick up rides (lines). Click "Evolve" to improve the scoring function. Watch how evolved strategies reduce deadheading (empty travel shown as dashed lines).
The 2021 Hash Code problem is about designing traffic light schedules for an entire city. This is where the evolved scoring functions become truly intricate.
You are given a graph of one-way streets connecting intersections, plus a set of cars that each follow a predetermined route. At each intersection, there is a traffic light that cycles through its incoming streets one at a time, giving each street a green light for some duration. You decide: in what order and for how long each street gets a green light.
Cars that complete their route by the deadline earn points: a fixed bonus plus a time bonus inversely proportional to how long the trip took. The objective is to maximize the total score across all cars.
The backbone is more complex than the previous examples. It runs a full simulation in three phases:
The scoring function is called with a boolean give_pos that switches between two roles:
give_pos = True: predict the position (order) of a street in the traffic light schedulegive_pos = False: predict the duration (green light length) for a streetThe base scoring function is deliberately simple:
if give_pos: return 1 # All positions equal else: return int(len(used_streets[street.name]) / 1000 + 0.33)
This achieves 1,038,868 points on the testcase f_forever_jammed. After evolution, the position formula becomes:
where N̂str is the total number of streets that have entered this intersection so far and Ncars is the number of cars using this street. This formula gives precedence to streets encountered earlier at a particular intersection (smaller N̂str) and to streets with higher traffic (larger Ncars).
The evolved function also develops biases: position allocation favors later positions for heavily used streets (load of at least 500 cars needed before any shift), while green durations scale with both the number of cars and the diversity of streets at each intersection.
With all evolved functions across individual inputs, the method achieves a comfortable rank-1 result on this problem, scoring 1,465,888 total across all inputs.
Let us unpack the evolved position formula step by step. The function must assign each incoming street at each intersection a position in the green-light cycle (0, 1, 2, ..., n-1 where n is the number of incoming streets with traffic).
Where:
The min ensures the position cannot exceed the current count of streets (you cannot be position 5 if only 3 streets have been scheduled). The max(0, ...) prevents negative positions. The core term ⌊Ncars/200⌋ + 2 means:
For wider green-light shifts, the evolved formula is even more interesting:
This has three interacting terms:
The combination is sophisticated. At a simple intersection with 2 streets and 50 cars each, durations are roughly ⌊(50·2/1000 + 0.1) · ln(51) + 1⌋ = ⌊0.2 · 3.9 + 1⌋ = 1. At a complex intersection with 8 streets and 500 cars, it is ⌊(500·8/1000 + 0.1) · ln(501) + 1⌋ = ⌊4.1 · 6.2 + 1⌋ = 26. The scaling is dramatic but appropriate.
Now let us step back and look at the aggregate results across all eight Hash Code qualification rounds.
| Year | Problem | Backbone %ile | FunSearch 2h %ile | FunSearch ∞ %ile | Beat #1? |
|---|---|---|---|---|---|
| 2015 | Data Center | ~50th | Top 1% | Top 1% | Yes |
| 2016 | Delivery | ~80th | Top 1% | Top 1% | No |
| 2017 | Streaming Videos | ~55th | Top 1% | Top 1% | No |
| 2018 | Self-Driving Rides | ~20th | Top 1% | Top 1% | Yes |
| 2019 | Photo Slideshow | ~40th | Top 1% | Top 1% | No |
| 2020 | Book Scanning | ~60th | Top 1% | Top 1% | Yes |
| 2021 | Traffic Signaling | ~45th | Top 1% | Top 1% | Yes |
| 2022 | Mentorship | ~90th | Top 1% | Top 1% | Yes |
Can FunSearch deliver meaningful improvements? Yes. Across all eight contests, evolving the scoring function pushed the backbone from mid-range to top percentile. The backbone alone cannot achieve the top percentile or qualify for finals in any year.
Are the improvements obtainable under contest conditions? Yes. Even limiting FunSearch to just two hours of evolution time, the results are sufficient for the top percentile in all eight contests. In most cases, the two-hour results are close to the unlimited-time results.
An interesting pattern: the backbone's starting percentile varies dramatically across years. In 2018, the backbone starts below the 20th percentile. In 2022, it starts above the 90th percentile. This reflects the varying complexity of writing a good backbone — some problems are harder to parse and structure than others. But FunSearch consistently amplifies whatever starting point it receives into the top percentile.
Visualization: backbone starting percentile (gray) vs FunSearch evolved percentile (warm) across all eight Hash Code years. The green line marks the finalist cutoff. Hover over bars for details.
To test generalization beyond Hash Code, the authors applied their method to a recent AtCoder Heuristic Contest (AHC 039) — a "Purse Seine Fishing" problem about building polygons on 2D grids to maximize enclosed "mackerel" points while minimizing "sardine" points.
This is a fundamentally different setting: test cases are private (not visible), the code must be self-contained (no external tools), and the evaluation metric differs from the internal fitness function. Despite these challenges, FunSearch achieves an estimated rank between 9th and 17th place out of 150+ participants — demonstrating that the approach transfers to substantially different competition formats.
This paper sits at the intersection of several active research threads. Let us map the connections.
This work is a direct application of FunSearch (Romera-Paredes et al., 2024), which demonstrated LLM-guided evolution for mathematical discovery (finding large cap sets and new bin-packing heuristics). The key advance here is showing that FunSearch works not just on abstract mathematical problems but on practical competitive programming tasks with complex backbones.
The successor, AlphaEvolve (Novikov et al., 2025), takes this further by evolving entire codebases (not just single functions), using multi-objective optimization, and applying to Google's production infrastructure. This paper is the bridge between FunSearch's proof-of-concept and AlphaEvolve's generality.
AlphaCode (Li et al., 2022) and AlphaCode 2 tackle Codeforces-style problems by generating thousands of candidate solutions and filtering them. This approach differs fundamentally: it does not generate complete solutions but evolves a component (the scoring function) within a human-designed system. The collaboration is the point — neither the human backbone nor the evolved function alone is competitive.
| System | Competition type | Human role | AI role | Output |
|---|---|---|---|---|
| AlphaCode | Correctness (Codeforces) | None | Generate + filter complete solutions | Full programs |
| This paper | Optimization (Hash Code) | Design backbone | Evolve scoring function | Function within backbone |
| AlphaEvolve | Any optimization | Design initial code | Evolve any marked function(s) | Evolved codebase |
The paper makes a broader argument about human-AI collaboration in programming. Rather than replacing human programmers, the most productive near-term approach is amplification: humans design the architecture and constraints, AI searches the space of solutions within those constraints. The human provides structure and understanding; the AI provides tireless exploration and the occasional non-obvious insight (like the return -100 penalty or the logarithmic green duration).