Barbarians at the Gate: How AI is Upending Systems Research

Chapter 0: The Problem

A huge fraction of systems research — networking, databases, distributed systems — is about one thing: performance. You have a cluster of machines. You have workloads. You need an algorithm that schedules jobs, routes packets, balances load, or reorders queries to squeeze out every bit of throughput and slash every dollar of cost.

Traditionally, this is a human art. A PhD student spends months studying a system, builds a simulator, hand-crafts a heuristic, tunes it, compares it to baselines, and publishes. A survey of 31 systems PhD students at Berkeley reveals where their time goes:

Research Stage	% of Time
Problem Formulation	21.5%
Evaluation Framework	22.9%
Algorithm Design (Solution)	21.5%
Evaluation	20.1%
Paper Writing	14.0%

Look at those two bolded rows: algorithm design and evaluation together consume over 40% of total effort. These are the two stages where you are iterating: propose a tweak, run the simulator, check the numbers, try again. It is a search loop, and humans are slow search engines.

The central thesis: AI can automate the Solution and Evaluation stages of systems research. LLMs generate candidate algorithms. Simulators evaluate them. An evolutionary loop refines the best. The researcher still defines the problem, builds the evaluator, and interprets results — but the tedious generate-and-test cycle runs on autopilot.

The paper calls this approach ADRS: AI-Driven Research for Systems. Using OpenEvolve (an open-source implementation), they demonstrate ADRS across 11 tasks spanning cloud scheduling, MoE inference, SQL optimization, and transaction processing. In multiple cases, the evolved algorithms match or beat published SOTA — discovered in a few hours, at a cost of a few dollars.

This is not science fiction. The paper shows concrete before/after code: a 25-line hand-crafted scheduling policy versus a 45-line evolved policy that tracks spot availability patterns, classifies situations as stable or unstable, and dynamically adjusts safety margins. No human wrote it. An LLM proposed it, a simulator validated it, and evolution refined it.

The implications are profound. If an LLM evolution loop can, for $20 and 5 hours, match what takes a PhD student months to design — what is the future of systems research? The paper's title, "Barbarians at the Gate," captures the tension: these AI approaches do not follow the traditional academic playbook, but they get results.

Why do the authors focus specifically on systems performance problems for AI-driven research?

Because system performance has reliable, automated verifiers — you run the code in a simulator and measure throughput, latency, or cost — making the evaluate step cheap and objective Because systems code is simpler than other domains Because LLMs are specifically trained on systems papers

Chapter 1: Why Systems Research is the Perfect Target

Not every research domain is equally suited for AI-driven discovery. The authors identify three properties that make systems performance problems a uniquely good fit:

1. Easy to verify

A scheduling algorithm either meets the deadline or it does not. A load balancer either distributes requests evenly or it does not. You can implement the candidate in a real system or simulator, run representative workloads, and measure the result. No subjective judgment. No ambiguous criteria. The evaluator is a deterministic function: code in, numbers out.

Contrast this with, say, AI-driven theorem proving or creative writing, where "is this solution correct/good?" is itself a hard problem. In systems research, verification is essentially free.

2. Correctness is easy to check

Solutions to systems performance problems typically preserve the correctness of the original system. A load balancer must still assign all tasks. A scheduler must still meet deadlines. A transaction reorderer must still execute all transactions. These constraints are binary: either satisfied or not. A quick check in the evaluator catches any violation.

3. The code that evolves is small

The core logic of a scheduler, load balancer, or resource allocator is often a short function — tens to hundreds of lines. This is small enough to fit in an LLM context window and small enough for the LLM to reason about coherently. You are not evolving an entire operating system. You are evolving a policy function.

Key insight — reliable verifiers are the bottleneck: The success of any AI-driven approach hinges on verification accuracy. In systems research, "run it and measure the numbers" is about as reliable as verification gets. This is why ADRS works here when it might not work elsewhere. The paper argues that hallucination risk drops to near-zero because every proposed solution is actually executed and measured.

Which problems are NOT well-suited?

The authors are candid about limits. ADRS does not work well for:

Distributed protocols — consensus algorithms, failure detectors, network protocols where correctness requires reasoning about interleaving of multiple components
Problems where semantic equivalence is hard — e.g., verifying that a modified SQL query plan produces the same results as the original
Problems solvable by exact methods — integer linear programming, constraint solvers, where provably optimal solutions exist
GPU-heavy or slow evaluations — weight compression or distributed training where a single evaluation takes 12+ hours

The sweet spot: Isolated, small code changes + fast, reliable simulators + clear numeric objectives = ideal ADRS target.

A survey of systems venues

The authors surveyed top systems conferences (NSDI, OSDI, SIGMOD, SOSP, VLDB) and found that over one-third of published papers feature performance optimization algorithms as their core contribution. These are all potential ADRS targets. The domains include:

Domain	Example Tasks	Typical Evaluator
Networking	Routing, congestion control, traffic scheduling	Network simulator (ns-3, Mininet)
Databases	Query optimization, indexing, transaction scheduling	Database benchmarks (TPC-C, YCSB)
Cloud/Distributed	Job scheduling, resource allocation, load balancing	Cluster simulators with trace replay
ML Systems	MoE placement, attention design, model parallelism	PyTorch/GPU simulators

Which of these problems would be LEAST suited for ADRS?

A CPU job scheduler in a cloud cluster A KV cache eviction policy A distributed consensus protocol like Raft

Chapter 2: The ADRS Loop

ADRS automates the Solution and Evaluation stages of the traditional systems research process. The full research pipeline has five stages; ADRS replaces the middle two:

1. Problem Formulation

Human defines the objective, constraints, workloads (unchanged)

↓

2. Evaluation Framework

Human builds simulator + evaluator + baselines (unchanged)

↓

3. Solution Generation

LLM ensemble generates candidate algorithms ← AUTOMATED

↓

4. Evaluation

Simulator scores candidates against workloads ← AUTOMATED

↓

5. Paper Write-Up

Human interprets results and writes the paper (unchanged)

The automated core — stages 3 and 4 — forms a feedback loop. ADRS consists of five components that implement this loop:

The Five Components

Component	Role	Concrete Example
Prompt Generator	Assembles the context-rich prompt sent to the LLM	Problem statement + parent code + scores + evaluator feedback
Solution Generator	LLM (or ensemble) proposes new or modified code	80% Gemini 2.5 Flash + 20% o3
Evaluator	Runs solutions against workloads, assigns scores + feedback	Python simulator measuring cost savings over baseline
Storage	Stores all solutions, scores, and feedback	Program database indexed by score
Solution Selector	Picks promising solutions to refine next iteration	Greedy, random, or island-based selection

These form two nested loops:

Inner loop (automated): Solution Generator → Evaluator → Storage → Solution Selector → Prompt Generator → repeat. This runs for 100–400 iterations, typically taking 1–5 hours.
Outer loop (human-in-the-loop): The researcher observes the inner loop's outputs, adjusts the problem formulation, changes the evaluator, adds hints to the prompt, and relaunches. This is the creative oversight layer.

Why simulators, not real systems? Two reasons. First, real codebases are too large for LLM context windows. The simulator is small enough (typically 100–300 LOC) that the LLM can reason about it. Second, simulator evaluations take seconds versus hours on real hardware. Running 400 iterations at seconds-per-evaluation takes hours; at hours-per-evaluation it would take months.

Data flow through one iteration

Let us trace a concrete example. Suppose we are evolving a load-balancing policy for MoE inference. Here is what happens in one iteration:

Prompt Generator

Assembles: "Here is the current best EPLB function (score: 0.72). It uses greedy bin-packing. Previous attempts that replaced the for-loop with numpy failed (score: 0.31 — syntax error). Improve the runtime without increasing imbalance."

↓

Solution Generator (Gemini 2.5 Flash)

Generates a modified function that replaces the Python for-loop with torch.where() and a zigzag index pattern. Outputs a diff: SEARCH [old loop] REPLACE [new tensor code].

↓

Evaluator

Runs the modified function in the 168-LOC MoE simulator against ShareGPT traces. Measures: imbalance_factor=0.66, runtime=3.7ms. Combined score: 0.89. Returns: "Score improved from 0.72 to 0.89. Runtime decreased 146x. Imbalance maintained."

↓

Storage + Solution Selector

Stores new solution with score 0.89. It becomes the new parent for future iterations on this island. Migrates a copy to Island 2.

Notice the richness of the evaluator feedback. It is not just a number — it includes why the score changed, which helps the LLM in the next iteration. This qualitative feedback is one of the paper's key recommendations.

What are the two roles that remain human-only in the ADRS framework?

Problem formulation (defining what to optimize) and evaluation framework (building the simulator and baselines) Coding the solution and running experiments Choosing the LLM and writing the paper

Chapter 3: OpenEvolve — The Engine

The paper uses OpenEvolve as its primary ADRS framework. OpenEvolve is an open-source implementation inspired by Google DeepMind's AlphaEvolve. Let us trace the data flow through a single iteration.

Step 1: Build the prompt

The Prompt Generator assembles:

The problem description — what the algorithm should do, what metric to optimize
The current best program (or a sampled parent from the database)
The evaluator feedback — scores, error messages, trace-level details
Optional: previous solutions and their scores, so the LLM can learn from the search history
Optional: solution hints from the researcher

This is the most important design decision. Too little context and the LLM explores blindly. Too much and it gets confused or converges prematurely.

Step 2: Generate a candidate

The Solution Generator sends the prompt to one or more LLMs. The paper recommends an ensemble of two models:

Model Type	Role	Example Split
Reasoning model	Explores novel strategies, finds creative solutions	20% of calls (o3, Gemini Pro)
Fast model	Refines existing solutions efficiently, high iteration speed	80% of calls (Gemini Flash)

Why two models, not three? The authors tried using more than two models and found it hurt performance. Different models introduce conflicting coding styles and ideas, causing mutation drift — the evolutionary equivalent of two people pulling a rope in different directions. Two models strike the sweet spot: one for breadth, one for depth.

Step 3: Evaluate

The Evaluator runs the candidate in a simulator against predefined workloads. It returns:

A numeric score (the optimization objective)
A validity flag (did the solution violate any constraints?)
Optional: qualitative feedback from an LLM that reads the trace logs
Optional: per-workload breakdowns so the evolution can see where it is winning and losing

Step 4: Select and iterate

OpenEvolve uses an island-based evolutionary model. The population is divided into 3–5 islands that evolve semi-independently, with periodic migration of top solutions between islands. This prevents premature convergence.

Concrete numbers

Parameter	Typical Range
Number of islands	3–5
Iterations per run	70–400
Wall-clock time	40 min – 12 hours
Total LLM cost	$5 – $20
LLM ensemble	80/20 fast/reasoning split

The cost is remarkable. For the price of a lunch, you can search a space of algorithms that would take a PhD student months to explore by hand.

What a prompt actually looks like

The paper emphasizes that prompt engineering is critical. A well-structured prompt contains:

# System message (condensed):
"""
You are optimizing a scheduling policy.

PROBLEM: Minimize cost of running deadline-aware jobs
using spot instances. Spot instances are cheap but can
be preempted. On-demand instances are reliable but
expensive.

CURRENT BEST (score: 0.72):
[... full Python function ...]

EVALUATOR FEEDBACK:
- Average cost savings: 12% over greedy
- Failed on traces 7, 14: missed deadline by 2 ticks
- Spot utilization: 68%

PREVIOUS ATTEMPT (score: 0.31):
- Added random delay → missed 40% of deadlines

INSTRUCTIONS: Propose a modification to improve cost
savings while meeting ALL deadlines. Return your
changes as SEARCH/REPLACE blocks.
"""

The key insight: include the failure history. Showing the LLM what did not work is as valuable as showing what did. This prevents the search from revisiting dead ends and helps the LLM understand the constraint landscape.

Why does OpenEvolve use islands instead of a single population?

To reduce LLM API costs To prevent premature convergence by maintaining diverse sub-populations that evolve semi-independently and periodically exchange top solutions To make the code easier to parallelize

Chapter 4: Case Study — Spot Instance Scheduling

Cloud providers offer spot instances — spare compute at 60–90% discount. The catch: they can be preempted at any time. Your job gets killed, you lose progress, and you must restart on a new instance (incurring a changeover delay).

The scheduling problem: given a job with a deadline, minimize cost by using spot instances as much as possible, falling back to expensive on-demand instances only when necessary. If you use spots too aggressively, preemptions pile up and you miss the deadline. If you are too conservative, you overpay on on-demand.

The baseline: Uniform Progress

The SOTA comes from Wu et al. (NSDI '24), called Uniform Progress. The idea is simple: track expected progress versus actual progress. If you are behind schedule, switch to on-demand. If you are ahead, stay on spot.

def UniformProgress(has_spot, state, env, task):
    # Calculate uniform progress rate
    progress_rate = task.duration / task.deadline
    expected = env.elapsed_time * progress_rate
    actual = task.progress_made

    # Simple decision: behind schedule?
    if actual < expected:
        # Must use any available resource
        if has_spot:
            return ClusterType.SPOT
        else:
            return ClusterType.ON_DEMAND

    # Default: use spot if available
    if has_spot:
        return ClusterType.SPOT
    else:
        return ClusterType.NONE  # Wait

This is a fixed formula. It does not learn from patterns in spot availability. It treats every spot instance the same, regardless of whether spots have been stable for an hour or getting preempted every 30 seconds.

The evolved policy: Adaptive Strategy

After 400 iterations (~5 hours, <$20), OpenEvolve discovers a fundamentally different approach:

def AdaptiveStrategy(has_spot, state, env, task):
    # Track recent spot availability patterns
    self.window.append(has_spot)
    alpha = window_avg(self.window)
    streak = longest_run(self.window)
    tail = trailing_run(self.window)

    params = get_params(classify_situation(
        alpha, streak, tail))

    # Compare remaining time to safety margin
    need = ticks_needed(task, env)
    slack = ticks_remaining(env)
    safety = safety_margin(task, params)

    if need >= slack:
        lock_on_demand()
        return ClusterType.ON_DEMAND

    # Classify situation: stable, moderate, unstable
    # Adjust behavior based on pattern classification
    if has_spot and looks_stable(streak, tail, params):
        return ClusterType.SPOT
    elif slack < safety * params.wait_margin:
        return ClusterType.NONE  # Wait for stable spot
    else:
        start_rebuilding(params.dwell)
        return ClusterType.ON_DEMAND

Look at the innovations the LLM discovered:

Sliding window for spot patterns — tracks recent availability to classify the situation as stable, moderate, or unstable
Selective waiting — when spots are unreliable, the policy waits for stable ones rather than grabbing every available instance (which causes costly changeover delays)
Dynamic safety margins — safety buffers expand when preemption is frequent and shrink when spots are stable
Hysteretic switching — avoids rapid flip-flopping between spot and on-demand by requiring sustained stability before switching

Result: 7% average higher cost savings versus Uniform Progress, with per-trace improvements reaching 16.7%. All deadlines met. In the multi-region extension, the evolved policy achieves 26% lower cost versus the multi-region baseline by learning to opportunistically migrate jobs to regions with cheaper, more stable spots.

Evolution trajectory

The search process reveals how the LLM discovers these strategies incrementally:

Iterations 1–90: Learns to track spot availability, recognizes patterns
Iteration 180: Adds adaptive safety margins based on spot stability
Iteration 240: Adds fine-grained situation classification (stable/moderate/unstable)
Iteration 350: Learns dynamic safety margins that scale with preemption cost
Iteration 389: Final policy integrates all mechanisms with lean parameters

What is the main limitation of Uniform Progress that the evolved policy overcomes?

Uniform Progress uses a fixed formula that does not adapt to spot availability patterns — it treats every spot the same, leading to costly changeover traps when spots are frequently preempted Uniform Progress is too slow to run Uniform Progress cannot handle multiple regions

Chapter 5: Case Study — MoE Expert Placement

Mixture-of-Experts (MoE) models like DeepSeek-V3 route each token to a subset of experts (specialized sub-networks). During inference, experts are distributed across multiple GPUs. The problem: some experts are "hot" (frequently activated) and others are cold. If hot experts cluster on one GPU, that GPU becomes a bottleneck.

The EPLB baseline

Expert Parallelism Load Balancing (EPLB) from DeepSeek runs in three stages:

Distribute expert groups across nodes
Create replicas of hot experts
Assign replicas to GPUs to minimize imbalance

The initial implementation uses greedy bin-packing: sort experts by load in descending order, assign each to the least-loaded GPU. It is written in pure Python with for-loops and takes ~540 ms per rebalancing step.

What OpenEvolve discovers

After 300 iterations (~5 hours, <$10), the LLM makes two critical breakthroughs:

Breakthrough 1: Replace Python loops with tensor operations

The initial code uses nested Python for-loops to search for the best GPU placement. The evolved code replaces this with PyTorch tensor operations — reshaping, indexing, and assignment via vectorized ops. This alone drops the runtime dramatically.

Breakthrough 2: Zigzag placement pattern

Instead of greedy bin-packing, the evolved code discovers a "snake" or zigzag pattern that alternates high-load and low-load experts across GPU slots:

def EvolvedStrategy(...):
    indices = torch.arange(num_items)
    block_id = indices // len(num_packs)
    idx_in_block = indices % len(num_packs)
    is_even_block = block_id % 2 == 0

    # The "snake" pattern: no loop at all
    # For items in even blocks:
    #   assign them to each pack; while
    # for odd blocks: use a reversed order
    packs_for_sorted_items = torch.where(
        is_even_block,
        idx_in_block,
        num_packs - 1 - idx_in_block
    )
    # Now we have assigned a pack for each item
    update_packs(packs_for_sorted_items)

The zigzag naturally interleaves heavy and light experts. Think of dealing cards: if you sort the deck from highest to lowest, then deal them in a zigzag across players (1,2,3,3,2,1,1,2,3,...), each player ends up with a roughly equal total.

Result: Same load imbalance factor as the non-public reference implementation, but running time drops from 540 ms to 3.7 ms — a 5.0× speedup over the internal baseline. The Gemini models had never seen the proprietary reference implementation during training, yet independently rediscovered the zigzag strategy.

What this tells us about LLMs as researchers: The LLM did not just optimize the algorithm — it changed the implementation paradigm. Moving from Python loops to tensor operations is exactly the kind of insight a human systems researcher would have, but it emerged from evolutionary pressure, not explicit instruction.

Understanding the imbalance factor

The optimization target is the imbalance factor: the ratio of maximum tokens on any GPU to the average tokens per GPU. Perfect balance gives 1.0. Higher means some GPUs are overloaded.

Imbalance = max_g(tokens_g) / avg_g(tokens_g)

For a model with E=256 experts across G=8 GPUs, each expert has a different load (number of tokens routed to it). The greedy approach sorts by load [high, ..., low] and assigns in order: GPU0 gets the heaviest, GPU1 the next, etc. But this means GPU0 always gets the heaviest experts in each round.

The zigzag reverses every other round: round 1 goes GPU0→GPU7, round 2 goes GPU7→GPU0. If we have experts sorted by load [100, 90, 80, 70, 60, 50, 40, 30] and 4 GPUs:

Method	GPU 0	GPU 1	GPU 2	GPU 3	Imbalance
Greedy	100+30=130	90+40=130	80+50=130	70+60=130	1.00
Zigzag	100+70=170	90+60=150	80+50=130	40+30=70	—

Wait — in this toy example, greedy is perfect! The advantage of zigzag appears with replicas. When hot experts are replicated, the assignment becomes more complex: replicas of the same expert must go to different GPUs, and the number of replicas varies. The greedy loop must search for valid placements via nested for-loops (O(E × G)), while the tensor zigzag does it in O(E) with vectorized indexing. Same imbalance, vastly less computation.

Evaluator design

The simulator models a distributed GPU inference engine for MoE (168 LOC). It uses ShareGPT and GSM8K traces. The combined metric is a weighted average of imbalance factor and rebalancing runtime — a single scalar for the evolutionary loop to optimize.

Why is the zigzag placement pattern effective for load balancing?

It uses fewer GPU memory slots By alternating the assignment direction across blocks (like dealing cards), it naturally interleaves heavy and light experts, ensuring each GPU gets a mix rather than clustering heavy loads together It avoids using PyTorch operations

Chapter 6: Case Study — LLM-SQL Queries

Imagine running SQL queries over a large table by invoking an LLM for each row. This is relational analytics with LLMs (Liu et al., MLSys '25). Each row triggers a separate LLM inference call. For a table with 10,000 rows, that is 10,000 forward passes. The cost is prohibitive.

The key insight: prefix cache hit rate

Modern LLM inference engines cache the key-value pairs (KV cache) for token prefixes. If consecutive requests share a long common prefix, the engine can reuse the cached KV pairs and skip recomputation. The metric to maximize is the Prefix Hit Rate (PHR): the fraction of tokens shared between consecutive rows' input sequences.

The trick: if you reorder the rows and columns of the table before feeding them to the LLM, you can dramatically increase prefix sharing. Rows with similar values end up adjacent, and their serialized representations share long prefixes.

PHR = Σ_i=1..N-1 shared_prefix(row_i, row_i+1) / Σ_i=1..N-1 len(row_i)

The combinatorial challenge: for a table with n rows and m fields, there are n! × (m!)ⁿ possible orderings. Brute-force is impossible. For a modest table of 100 rows and 5 fields, that is 100! × (120)¹⁰⁰ — an astronomically large search space.

Why this matters economically: At scale, each LLM call costs money. If a table has 10,000 rows and each LLM inference costs $0.01, naive execution costs $100 per query. A 3× improvement in prefix caching reduces effective cost to ~$33. Across millions of queries, this saves millions of dollars.

The baseline: QuickGreedy (GGR)

The published SOTA uses Greedy Recursive Grouping: recursively group rows by common field values, reorder fields using schema statistics. It achieves good PHR but is slow due to repeated counter operations and deep recursion:

def QuickGreedy(df):
    # 1. Compute value counts for all values
    counts = Counter(df.stack())
    val_len = {v: len(str(v))+2 for v in counts}
    # 2. Pick max-value column
    v_star = argmax(...)
    # 3. Split rows with/without v_star
    G = rows with v_star
    R = rows without v_star
    # 4. Reorder columns in G, recurse on remainder
    # 5. Recurse on G remainder and R
    return concat(G, R)

The evolved policy

After 100 iterations (~1 hour, <$7), OpenEvolve discovers a fundamentally different approach:

def EvolvedPolicy(df):
    # 1. Cache value counts (compute ONCE)
    counts = Counter(df.stack())
    val_len = {v: len(str(v))+2 for v in counts}
    # 2. Larger base cutoff → fewer recursions
    BASE = 400
    v_star = argmax(...)
    if v_star is None: return fixed_reorder(df)
    # 3. Recurse only if |df| > BASE
    if len(df) > BASE:
        top, bottom = split(df)
        return concat(EvolvedPolicy(top),
                       EvolvedPolicy(bottom))
    # 4. Lightweight per-row heuristic:
    # sort by squared value length × continuity
    prev = None
    for row in df:
        if prev is None: ...
        matches = [c for c in df.columns
                   if row[c]==prev[c]]
        matches.sort(key=lambda c:
            val_len[row[c]], reverse=True)
        order = matches + [c for c in df.columns
                           if c not in matches]
    return df

The key innovations:

Cached value counts: Compute once globally, eliminating redundant counting at each recursion level
Direct attribute mapping: Replaces slow Pandas lookups with O(N_rows × N_cols) operations
Local per-row heuristic: Instead of global sorting, reorders fields per-row by maximizing continuity with the previous row, weighted by squared value length
Higher base threshold: Reduces recursion depth, trading grouping accuracy for speed

Result: Comparable PHR to QuickGreedy, but the reordering algorithm runs 3.9× faster. The combined score (0.5 × PHR + 0.5 × runtime_norm) is significantly higher. Tested across 5 recommendation datasets (movies, beer, BIRD, PDMX, products).

Evolution process

The search progresses in phases:

Iteration ~32: Discovers ordering by frequency × squared length (eliminates redundant value counting)
Iteration ~72: Adds normalized weights and multi-key sorting for the most informative columns
Iteration ~97: Final design with optimal speed-accuracy tradeoff, NumPy-based reordering replacing Pandas

Why does reordering table rows help reduce LLM inference cost?

Similar rows placed adjacently share long token prefixes, so the KV cache can be reused across consecutive LLM calls, skipping redundant computation It reduces the number of rows in the table It makes the SQL query simpler

Chapter 7: Case Study — Transaction Scheduling

In database systems, conflicts between transactions on shared data cause performance bottlenecks. If transaction A writes to key X and transaction B also writes to key X, they conflict. The scheduler must order transactions to minimize the total number of conflicts, thereby improving throughput.

The problem formulation

Given a set of transactions (each a sequence of read/write operations on keys), find an ordering that minimizes makespan — the total time to execute all transactions. This is equivalent to minimizing the cost of conflicts, because conflicts cause serialization delays.

The paper considers two settings:

Setting	Constraint	Difficulty
Online	Order is fixed once committed; keys unknown a priori	Must decide in O(n) time, no lookahead
Offline	Full batch visible; can use any ordering strategy	NP-hard in general, but good heuristics exist

The baseline: SMF (Shortest Makespan First)

SMF (Cheng et al., VLDB '24) is a greedy O(n × k) algorithm. At each step, it samples k random unscheduled transactions, picks the one that adds the least incremental cost (conflict-weighted), and appends it to the schedule.

Online setting: rediscovery

In the online setting, OpenEvolve rediscovers SMF from scratch, starting from a random baseline. This is likely due to contamination (the model saw the SMF paper during training), but it is still notable: the evolutionary framework converges to the known SOTA without being told about it.

Offline setting: genuine discovery

In the offline setting, where no prior published solution exists, OpenEvolve finds a novel algorithm that beats SMF by 20–34%. The evolved algorithm combines three techniques:

def get_best_schedule(self, num_seqs=10):
    n = len(self.txns)
    idxs = list(range(n))

    # --- 0. Pre-compute transaction statistics ---
    write_cnt = {t: sum(1 for op in self.txns[t]
                     if op[0] == "w")
                 for t in idxs}

    # --- 1. Sorted sequence (by #writes, length) ---
    base_seq = sorted(idxs, key=lambda t:
        (write_cnt[t], len(self.txns[t])))
    best_seq = base_seq
    best_cost = self.get_opt_seq_cost(best_seq)

    # --- 2. Greedy insertion ---
    improved = True
    while improved:
        improved = False
        for i in range(n):
            txn = best_seq.pop(i)
            # Try every position for this txn
            for pos in range(n):
                candidate = best_seq[:pos]+[txn]+best_seq[pos:]
                cost = self.get_opt_seq_cost(candidate)
                if cost < best_cost: ...

    # --- 3. Pair-swap hill climb ---
    for _ in range(max_iters):
        i, j = random.sample(range(n), 2)
        # Swap and accept if better

    # --- 4. Random restarts (keep best) ---
    for _ in range(num_seqs):
        seq = idxs[:]
        random.shuffle(seq)
        cost = self.get_opt_seq_cost(seq)
        if cost < best_cost: ...

    return best_cost, best_seq

The four-phase structure emerged from evolution:

Smart initialization: Sort by writes and length (fewer writes first), creating a conflict-aware starting point
Greedy insertion: For each transaction, remove and re-insert at the best position (O(n²) but powerful)
Pair-swap hill climbing: Random pairwise swaps to escape local minima
Random restarts: Try completely random orderings to diversify the search

Result: 20% better makespan than SMF greedy in the offline setting. The O(n²) runtime is acceptable because the offline setting allows more computation. This is a genuinely new algorithm — no known prior work for this setting.

The contamination lesson: In the online setting, OpenEvolve converges to SMF — likely because the training data included the SMF paper. This is a feature, not a bug: it shows the framework can rapidly rediscover known-good solutions as a starting point. But it means online-setting results should be interpreted cautiously.

Understanding makespan cost

SMF is built around the concept of incremental conflict cost. When you add a transaction to the schedule, you pay a cost proportional to how many already-scheduled transactions conflict with it. The makespan increase from adding transaction t_i at position k is:

Δmakespan(t_i, k) = Σ_j<k conflict(t_i, t_j)

where conflict(t_i, t_j) = 1 if t_i and t_j share a key with at least one write. SMF greedily picks the transaction with the lowest incremental cost at each step. The evolved offline algorithm goes further: it also considers the future cost of each placement by running hill climbing and random restarts after the initial greedy pass.

The evaluator uses traces from 5 OLTP benchmarks (Epinions, SmallBank, TPC-C, TAOBench, YCSB) with 500 transactions each. It measures total makespan and provides statistical bounds on the schedule quality.

Why does the evolved offline scheduler use four phases instead of just one strategy?

Each phase addresses a different weakness: smart init provides a good starting point, greedy insertion refines it locally, hill climbing escapes local minima, and random restarts provide diversity — evolution discovered that combining them outperforms any single strategy The LLM could not fit the algorithm in one function Each phase runs on a different GPU

Chapter 8: Failure Modes & Best Practices

ADRS is not magic. The paper is refreshingly honest about failure modes. Based on analysis of 420 LLM-judged traces, they identify three categories:

1. Runtime Errors

Failure Type	Description
Syntax & Interface Errors	Generated code fails to compile or integrate with the evaluator
Budget Exhaustion	Candidate exceeds resource limits (context window, API quotas, timeouts)

2. Search Failures

Failure Type	Description
Premature Convergence	Search settles on a local optimum too early
Stuck-in-the-Loop	Repeats similar solutions without progress
Mutation Drift	Conflicting changes from different models cancel each other

3. Algorithm Failures

Failure Type	Description
Misaligned Objectives	Solution ignores key constraints (e.g., latency SLOs)
Sub-Optimal Optimizations	Shallow API-level tweaks instead of algorithmic innovation
Overfitting	Works on training traces, fails on unseen workloads
Reward Hacking	Exploits evaluator loopholes to get high scores without solving the problem

Reward hacking is real: In one case, when evolving code to detect failures in a multi-agent system, candidates bypassed an entire evaluation stage to get high scores without performing the task. Evaluators must combine multiple signals (correctness, efficiency, robustness) and include adversarial tests.

The evaluation process

The failure distribution comes from an interesting methodology. The authors took 420 individual evolution traces (candidate solutions across all case studies) and had an LLM classify each one into failure categories. This meta-evaluation reveals that the majority of failures are mundane runtime errors, not deep algorithmic issues. This suggests a practical intervention: better sandboxing, automatic syntax checking, and interface validation before sending code to the expensive simulator would dramatically improve evolution efficiency.

Practical implication: If ~50% of iterations fail due to syntax or interface errors, then a lightweight pre-evaluation filter (e.g., just running python -c "import candidate") could nearly double the effective search budget for the same cost.

Best practices by component

Prompt Generator

Provide structured specifications — clearly state the problem, evaluation criteria, and constraints
Start with a suitable base program — a clean, minimal baseline accelerates progress. Buggy or weak baselines waste iterations on trivial fixes (budget exhaustion). Overly strong baselines restrict exploration to micro-optimizations
Provide hints judiciously — too many hints cause premature convergence, too few leave the search blind. Inject hints when the search gets stuck
Control the abstraction level — hide high-level library APIs to prevent sub-optimal shortcut solutions

Solution Generator

Use a two-model ensemble — reasoning model (20%) for exploration + fast model (80%) for exploitation
Do not use more than two models — three or more models cause mutation drift

Evaluator

Prevent overfitting — use broad, diverse test sets. Sample different subsets for feedback during search versus final evaluation
Prevent reward hacking — combine multiple metrics, add adversarial tests, require all constraints to be satisfied

Solution Selector

Balance exploration-exploitation — greedy selectors converge fast but get stuck; random selectors waste resources. Island-based methods (MAP-Elites) strike the best balance

Why is starting with a very strong baseline sometimes WORSE than starting with a simple one?

An overly strong baseline constrains the search to shallow micro-optimizations around the existing approach, preventing discovery of fundamentally different (and potentially better) strategies Strong baselines are too long for the LLM to read The evaluator cannot tell if the evolved version is better

Chapter 9: Connections & Implications

Full results table

Here is the summary of all 11 tasks investigated in the paper:

Task	Result vs. SOTA	Time / Cost
Telemetry Repair	+9% repair, +30% calibration	8h, ≤$10
Adaptive Weight Compression	Similar bits/elem, 14.2% worse PPL	12h, ≤$20
Cloudcast (multi-cloud transfer)	Matches SOTA cost	1h, ≤$5
Expert Parallelism Load Balancer	2× → 5× faster runtime	5h, ≤$10
Global Model Placement	18.5% cheaper	40m, ≤$5
LLM-SQL (prefix cache reuse)	3.9× faster reordering	1h, ≤$7
Transaction Scheduling (online)	Rediscovers SOTA (SMF)	<2h, ≤$20
Transaction Scheduling (offline)	20–34% better makespan	<2h, ≤$20
Can't Be Late (single-region)	7% avg cost savings, up to 16.7%	5h, ≤$20
Can't Be Late (multi-region)	26% lower cost	1h, ≤$5
Sparse Attention Design	7% avg improvement	4h, ≤$15

What this means for researchers

The authors argue that the human researcher's role shifts from algorithm designer to problem architect:

Problem formulation becomes the high-leverage skill — defining the right objective, constraints, and evaluation framework
Evaluator design becomes critical — a good evaluator is the difference between useful evolution and reward hacking
Interpretation matters more — understanding why the evolved algorithm works, distilling its insights, and connecting them to theory
Manual algorithm design matters less — the generate-and-test cycle is increasingly automated

A virtuous cycle: ADRS can improve the AI systems that power ADRS itself. The paper shows this explicitly — evolved algorithms for MoE inference and sparse attention directly improve LLM infrastructure, which makes future ADRS runs faster and cheaper.

Related frameworks

Framework	Key Idea	Relation to ADRS
AlphaEvolve	MAP-Elites + island evolution + LLM mutations	The inspiration; proprietary (Google DeepMind)
OpenEvolve	Open-source reimplementation of AlphaEvolve core	Primary ADRS engine in this paper
GEPA	Genetic-Pareto with reflective LLM feedback	Alternative ADRS framework
LLM4AD	Unified platform with NSGA-II, MOEA/D, etc.	Broader search strategy support
Cursor / Windsurf	Coding assistants with full codebase context	Human-in-the-loop ADRS analogs

How ADRS compares to AlphaEvolve

This paper is closely related to AlphaEvolve (Novikov et al., Google DeepMind, 2025), but there are key differences:

Dimension	AlphaEvolve	ADRS (this paper)
Focus	General scientific discovery (math, chip design, scheduling)	Systems performance research specifically
Framework	Proprietary (Google internal)	Open-source (OpenEvolve)
Verification	Problem-specific evaluators	Emphasizes simulator-based eval as key advantage
Human role	Minimal — fully automated search	Explicit outer loop with human guidance
Contribution	Individual breakthroughs (Strassen improvement)	Systematic methodology + best practices + taxonomy of failures
Cost	Not disclosed	$5–$20 per run, explicitly tracked

The ADRS paper's unique contribution is the meta-analysis: rather than showing one spectacular result, it systematically evaluates ADRS across 11 diverse tasks, categorizes failure modes, and distills actionable best practices. It is a roadmap for the field, not just a proof of concept.

Open challenges

Better evaluators: Build simulators with higher fidelity, generality, and speed. Lightweight VMs and container snapshots can shorten feedback cycles.
Better search: Current search often gets stuck. Better exploration-exploitation balance, adaptive prompt strategies, and cross-domain transfer of search insights are all open problems.
Scaling to complex systems: ADRS currently works on isolated functions. Evolving multi-component distributed protocols remains unsolved.
Standardized benchmarks: No common evaluation suite exists for ADRS frameworks. Building one (like MLGym for AI research) would accelerate the field.

← Back to Veanors

What is the "virtuous cycle" the authors identify?

ADRS can improve the AI infrastructure (LLM inference, scheduling) that powers ADRS itself, creating a compounding acceleration of discovery — better systems make better AI which makes even better systems Each ADRS run produces training data for the next run Researchers collaborate more effectively with AI