Cheng, Liu, Pan et al. — UC Berkeley, 2025

Barbarians at the Gate: How AI is Upending Systems Research

AI-Driven Research for Systems (ADRS) — an evolutionary framework where LLMs generate candidate algorithms, simulators evaluate them, and the best survive. Across 11 systems tasks, ADRS matches or beats state-of-the-art human designs: 5× speedups, 26% cost reductions — in hours, for dollars.

Prerequisites: LLM code generation (basics) + Evolutionary algorithms + Systems performance (scheduling, load balancing)
10
Chapters
3+
Simulations

Chapter 0: The Problem

A huge fraction of systems research — networking, databases, distributed systems — is about one thing: performance. You have a cluster of machines. You have workloads. You need an algorithm that schedules jobs, routes packets, balances load, or reorders queries to squeeze out every bit of throughput and slash every dollar of cost.

Traditionally, this is a human art. A PhD student spends months studying a system, builds a simulator, hand-crafts a heuristic, tunes it, compares it to baselines, and publishes. A survey of 31 systems PhD students at Berkeley reveals where their time goes:

Research Stage% of Time
Problem Formulation21.5%
Evaluation Framework22.9%
Algorithm Design (Solution)21.5%
Evaluation20.1%
Paper Writing14.0%

Look at those two bolded rows: algorithm design and evaluation together consume over 40% of total effort. These are the two stages where you are iterating: propose a tweak, run the simulator, check the numbers, try again. It is a search loop, and humans are slow search engines.

The central thesis: AI can automate the Solution and Evaluation stages of systems research. LLMs generate candidate algorithms. Simulators evaluate them. An evolutionary loop refines the best. The researcher still defines the problem, builds the evaluator, and interprets results — but the tedious generate-and-test cycle runs on autopilot.

The paper calls this approach ADRS: AI-Driven Research for Systems. Using OpenEvolve (an open-source implementation), they demonstrate ADRS across 11 tasks spanning cloud scheduling, MoE inference, SQL optimization, and transaction processing. In multiple cases, the evolved algorithms match or beat published SOTA — discovered in a few hours, at a cost of a few dollars.

This is not science fiction. The paper shows concrete before/after code: a 25-line hand-crafted scheduling policy versus a 45-line evolved policy that tracks spot availability patterns, classifies situations as stable or unstable, and dynamically adjusts safety margins. No human wrote it. An LLM proposed it, a simulator validated it, and evolution refined it.

The implications are profound. If an LLM evolution loop can, for $20 and 5 hours, match what takes a PhD student months to design — what is the future of systems research? The paper's title, "Barbarians at the Gate," captures the tension: these AI approaches do not follow the traditional academic playbook, but they get results.

Why do the authors focus specifically on systems performance problems for AI-driven research?

Chapter 1: Why Systems Research is the Perfect Target

Not every research domain is equally suited for AI-driven discovery. The authors identify three properties that make systems performance problems a uniquely good fit:

1. Easy to verify

A scheduling algorithm either meets the deadline or it does not. A load balancer either distributes requests evenly or it does not. You can implement the candidate in a real system or simulator, run representative workloads, and measure the result. No subjective judgment. No ambiguous criteria. The evaluator is a deterministic function: code in, numbers out.

Contrast this with, say, AI-driven theorem proving or creative writing, where "is this solution correct/good?" is itself a hard problem. In systems research, verification is essentially free.

2. Correctness is easy to check

Solutions to systems performance problems typically preserve the correctness of the original system. A load balancer must still assign all tasks. A scheduler must still meet deadlines. A transaction reorderer must still execute all transactions. These constraints are binary: either satisfied or not. A quick check in the evaluator catches any violation.

3. The code that evolves is small

The core logic of a scheduler, load balancer, or resource allocator is often a short function — tens to hundreds of lines. This is small enough to fit in an LLM context window and small enough for the LLM to reason about coherently. You are not evolving an entire operating system. You are evolving a policy function.

Key insight — reliable verifiers are the bottleneck: The success of any AI-driven approach hinges on verification accuracy. In systems research, "run it and measure the numbers" is about as reliable as verification gets. This is why ADRS works here when it might not work elsewhere. The paper argues that hallucination risk drops to near-zero because every proposed solution is actually executed and measured.

Which problems are NOT well-suited?

The authors are candid about limits. ADRS does not work well for:

The sweet spot: Isolated, small code changes + fast, reliable simulators + clear numeric objectives = ideal ADRS target.

A survey of systems venues

The authors surveyed top systems conferences (NSDI, OSDI, SIGMOD, SOSP, VLDB) and found that over one-third of published papers feature performance optimization algorithms as their core contribution. These are all potential ADRS targets. The domains include:

DomainExample TasksTypical Evaluator
NetworkingRouting, congestion control, traffic schedulingNetwork simulator (ns-3, Mininet)
DatabasesQuery optimization, indexing, transaction schedulingDatabase benchmarks (TPC-C, YCSB)
Cloud/DistributedJob scheduling, resource allocation, load balancingCluster simulators with trace replay
ML SystemsMoE placement, attention design, model parallelismPyTorch/GPU simulators
Which of these problems would be LEAST suited for ADRS?

Chapter 2: The ADRS Loop

ADRS automates the Solution and Evaluation stages of the traditional systems research process. The full research pipeline has five stages; ADRS replaces the middle two:

1. Problem Formulation
Human defines the objective, constraints, workloads (unchanged)
2. Evaluation Framework
Human builds simulator + evaluator + baselines (unchanged)
3. Solution Generation
LLM ensemble generates candidate algorithms ← AUTOMATED
4. Evaluation
Simulator scores candidates against workloads ← AUTOMATED
5. Paper Write-Up
Human interprets results and writes the paper (unchanged)

The automated core — stages 3 and 4 — forms a feedback loop. ADRS consists of five components that implement this loop:

The Five Components

ComponentRoleConcrete Example
Prompt GeneratorAssembles the context-rich prompt sent to the LLMProblem statement + parent code + scores + evaluator feedback
Solution GeneratorLLM (or ensemble) proposes new or modified code80% Gemini 2.5 Flash + 20% o3
EvaluatorRuns solutions against workloads, assigns scores + feedbackPython simulator measuring cost savings over baseline
StorageStores all solutions, scores, and feedbackProgram database indexed by score
Solution SelectorPicks promising solutions to refine next iterationGreedy, random, or island-based selection

These form two nested loops:

Why simulators, not real systems? Two reasons. First, real codebases are too large for LLM context windows. The simulator is small enough (typically 100–300 LOC) that the LLM can reason about it. Second, simulator evaluations take seconds versus hours on real hardware. Running 400 iterations at seconds-per-evaluation takes hours; at hours-per-evaluation it would take months.

Data flow through one iteration

Let us trace a concrete example. Suppose we are evolving a load-balancing policy for MoE inference. Here is what happens in one iteration:

Prompt Generator
Assembles: "Here is the current best EPLB function (score: 0.72). It uses greedy bin-packing. Previous attempts that replaced the for-loop with numpy failed (score: 0.31 — syntax error). Improve the runtime without increasing imbalance."
Solution Generator (Gemini 2.5 Flash)
Generates a modified function that replaces the Python for-loop with torch.where() and a zigzag index pattern. Outputs a diff: SEARCH [old loop] REPLACE [new tensor code].
Evaluator
Runs the modified function in the 168-LOC MoE simulator against ShareGPT traces. Measures: imbalance_factor=0.66, runtime=3.7ms. Combined score: 0.89. Returns: "Score improved from 0.72 to 0.89. Runtime decreased 146x. Imbalance maintained."
Storage + Solution Selector
Stores new solution with score 0.89. It becomes the new parent for future iterations on this island. Migrates a copy to Island 2.

Notice the richness of the evaluator feedback. It is not just a number — it includes why the score changed, which helps the LLM in the next iteration. This qualitative feedback is one of the paper's key recommendations.

What are the two roles that remain human-only in the ADRS framework?

Chapter 3: OpenEvolve — The Engine

The paper uses OpenEvolve as its primary ADRS framework. OpenEvolve is an open-source implementation inspired by Google DeepMind's AlphaEvolve. Let us trace the data flow through a single iteration.

Step 1: Build the prompt

The Prompt Generator assembles:

This is the most important design decision. Too little context and the LLM explores blindly. Too much and it gets confused or converges prematurely.

Step 2: Generate a candidate

The Solution Generator sends the prompt to one or more LLMs. The paper recommends an ensemble of two models:

Model TypeRoleExample Split
Reasoning modelExplores novel strategies, finds creative solutions20% of calls (o3, Gemini Pro)
Fast modelRefines existing solutions efficiently, high iteration speed80% of calls (Gemini Flash)
Why two models, not three? The authors tried using more than two models and found it hurt performance. Different models introduce conflicting coding styles and ideas, causing mutation drift — the evolutionary equivalent of two people pulling a rope in different directions. Two models strike the sweet spot: one for breadth, one for depth.

Step 3: Evaluate

The Evaluator runs the candidate in a simulator against predefined workloads. It returns:

Step 4: Select and iterate

OpenEvolve uses an island-based evolutionary model. The population is divided into 3–5 islands that evolve semi-independently, with periodic migration of top solutions between islands. This prevents premature convergence.

Concrete numbers

ParameterTypical Range
Number of islands3–5
Iterations per run70–400
Wall-clock time40 min – 12 hours
Total LLM cost$5 – $20
LLM ensemble80/20 fast/reasoning split

The cost is remarkable. For the price of a lunch, you can search a space of algorithms that would take a PhD student months to explore by hand.

What a prompt actually looks like

The paper emphasizes that prompt engineering is critical. A well-structured prompt contains:

# System message (condensed):
"""
You are optimizing a scheduling policy.

PROBLEM: Minimize cost of running deadline-aware jobs
using spot instances. Spot instances are cheap but can
be preempted. On-demand instances are reliable but
expensive.

CURRENT BEST (score: 0.72):
[... full Python function ...]

EVALUATOR FEEDBACK:
- Average cost savings: 12% over greedy
- Failed on traces 7, 14: missed deadline by 2 ticks
- Spot utilization: 68%

PREVIOUS ATTEMPT (score: 0.31):
- Added random delay → missed 40% of deadlines

INSTRUCTIONS: Propose a modification to improve cost
savings while meeting ALL deadlines. Return your
changes as SEARCH/REPLACE blocks.
"""

The key insight: include the failure history. Showing the LLM what did not work is as valuable as showing what did. This prevents the search from revisiting dead ends and helps the LLM understand the constraint landscape.

Why does OpenEvolve use islands instead of a single population?

Chapter 4: Case Study — Spot Instance Scheduling

Cloud providers offer spot instances — spare compute at 60–90% discount. The catch: they can be preempted at any time. Your job gets killed, you lose progress, and you must restart on a new instance (incurring a changeover delay).

The scheduling problem: given a job with a deadline, minimize cost by using spot instances as much as possible, falling back to expensive on-demand instances only when necessary. If you use spots too aggressively, preemptions pile up and you miss the deadline. If you are too conservative, you overpay on on-demand.

The baseline: Uniform Progress

The SOTA comes from Wu et al. (NSDI '24), called Uniform Progress. The idea is simple: track expected progress versus actual progress. If you are behind schedule, switch to on-demand. If you are ahead, stay on spot.

def UniformProgress(has_spot, state, env, task):
    # Calculate uniform progress rate
    progress_rate = task.duration / task.deadline
    expected = env.elapsed_time * progress_rate
    actual = task.progress_made

    # Simple decision: behind schedule?
    if actual < expected:
        # Must use any available resource
        if has_spot:
            return ClusterType.SPOT
        else:
            return ClusterType.ON_DEMAND

    # Default: use spot if available
    if has_spot:
        return ClusterType.SPOT
    else:
        return ClusterType.NONE  # Wait

This is a fixed formula. It does not learn from patterns in spot availability. It treats every spot instance the same, regardless of whether spots have been stable for an hour or getting preempted every 30 seconds.

The evolved policy: Adaptive Strategy

After 400 iterations (~5 hours, <$20), OpenEvolve discovers a fundamentally different approach:

def AdaptiveStrategy(has_spot, state, env, task):
    # Track recent spot availability patterns
    self.window.append(has_spot)
    alpha = window_avg(self.window)
    streak = longest_run(self.window)
    tail = trailing_run(self.window)

    params = get_params(classify_situation(
        alpha, streak, tail))

    # Compare remaining time to safety margin
    need = ticks_needed(task, env)
    slack = ticks_remaining(env)
    safety = safety_margin(task, params)

    if need >= slack:
        lock_on_demand()
        return ClusterType.ON_DEMAND

    # Classify situation: stable, moderate, unstable
    # Adjust behavior based on pattern classification
    if has_spot and looks_stable(streak, tail, params):
        return ClusterType.SPOT
    elif slack < safety * params.wait_margin:
        return ClusterType.NONE  # Wait for stable spot
    else:
        start_rebuilding(params.dwell)
        return ClusterType.ON_DEMAND

Look at the innovations the LLM discovered:

Result: 7% average higher cost savings versus Uniform Progress, with per-trace improvements reaching 16.7%. All deadlines met. In the multi-region extension, the evolved policy achieves 26% lower cost versus the multi-region baseline by learning to opportunistically migrate jobs to regions with cheaper, more stable spots.

Evolution trajectory

The search process reveals how the LLM discovers these strategies incrementally:

What is the main limitation of Uniform Progress that the evolved policy overcomes?

Chapter 5: Case Study — MoE Expert Placement

Mixture-of-Experts (MoE) models like DeepSeek-V3 route each token to a subset of experts (specialized sub-networks). During inference, experts are distributed across multiple GPUs. The problem: some experts are "hot" (frequently activated) and others are cold. If hot experts cluster on one GPU, that GPU becomes a bottleneck.

The EPLB baseline

Expert Parallelism Load Balancing (EPLB) from DeepSeek runs in three stages:

  1. Distribute expert groups across nodes
  2. Create replicas of hot experts
  3. Assign replicas to GPUs to minimize imbalance

The initial implementation uses greedy bin-packing: sort experts by load in descending order, assign each to the least-loaded GPU. It is written in pure Python with for-loops and takes ~540 ms per rebalancing step.

What OpenEvolve discovers

After 300 iterations (~5 hours, <$10), the LLM makes two critical breakthroughs:

Breakthrough 1: Replace Python loops with tensor operations

The initial code uses nested Python for-loops to search for the best GPU placement. The evolved code replaces this with PyTorch tensor operations — reshaping, indexing, and assignment via vectorized ops. This alone drops the runtime dramatically.

Breakthrough 2: Zigzag placement pattern

Instead of greedy bin-packing, the evolved code discovers a "snake" or zigzag pattern that alternates high-load and low-load experts across GPU slots:

def EvolvedStrategy(...):
    indices = torch.arange(num_items)
    block_id = indices // len(num_packs)
    idx_in_block = indices % len(num_packs)
    is_even_block = block_id % 2 == 0

    # The "snake" pattern: no loop at all
    # For items in even blocks:
    #   assign them to each pack; while
    # for odd blocks: use a reversed order
    packs_for_sorted_items = torch.where(
        is_even_block,
        idx_in_block,
        num_packs - 1 - idx_in_block
    )
    # Now we have assigned a pack for each item
    update_packs(packs_for_sorted_items)

The zigzag naturally interleaves heavy and light experts. Think of dealing cards: if you sort the deck from highest to lowest, then deal them in a zigzag across players (1,2,3,3,2,1,1,2,3,...), each player ends up with a roughly equal total.

Result: Same load imbalance factor as the non-public reference implementation, but running time drops from 540 ms to 3.7 ms — a 5.0× speedup over the internal baseline. The Gemini models had never seen the proprietary reference implementation during training, yet independently rediscovered the zigzag strategy.
What this tells us about LLMs as researchers: The LLM did not just optimize the algorithm — it changed the implementation paradigm. Moving from Python loops to tensor operations is exactly the kind of insight a human systems researcher would have, but it emerged from evolutionary pressure, not explicit instruction.

Understanding the imbalance factor

The optimization target is the imbalance factor: the ratio of maximum tokens on any GPU to the average tokens per GPU. Perfect balance gives 1.0. Higher means some GPUs are overloaded.

Imbalance = maxg(tokensg) / avgg(tokensg)

For a model with E=256 experts across G=8 GPUs, each expert has a different load (number of tokens routed to it). The greedy approach sorts by load [high, ..., low] and assigns in order: GPU0 gets the heaviest, GPU1 the next, etc. But this means GPU0 always gets the heaviest experts in each round.

The zigzag reverses every other round: round 1 goes GPU0→GPU7, round 2 goes GPU7→GPU0. If we have experts sorted by load [100, 90, 80, 70, 60, 50, 40, 30] and 4 GPUs:

MethodGPU 0GPU 1GPU 2GPU 3Imbalance
Greedy100+30=13090+40=13080+50=13070+60=1301.00
Zigzag100+70=17090+60=15080+50=13040+30=70

Wait — in this toy example, greedy is perfect! The advantage of zigzag appears with replicas. When hot experts are replicated, the assignment becomes more complex: replicas of the same expert must go to different GPUs, and the number of replicas varies. The greedy loop must search for valid placements via nested for-loops (O(E × G)), while the tensor zigzag does it in O(E) with vectorized indexing. Same imbalance, vastly less computation.

Evaluator design

The simulator models a distributed GPU inference engine for MoE (168 LOC). It uses ShareGPT and GSM8K traces. The combined metric is a weighted average of imbalance factor and rebalancing runtime — a single scalar for the evolutionary loop to optimize.

Why is the zigzag placement pattern effective for load balancing?

Chapter 6: Case Study — LLM-SQL Queries

Imagine running SQL queries over a large table by invoking an LLM for each row. This is relational analytics with LLMs (Liu et al., MLSys '25). Each row triggers a separate LLM inference call. For a table with 10,000 rows, that is 10,000 forward passes. The cost is prohibitive.

The key insight: prefix cache hit rate

Modern LLM inference engines cache the key-value pairs (KV cache) for token prefixes. If consecutive requests share a long common prefix, the engine can reuse the cached KV pairs and skip recomputation. The metric to maximize is the Prefix Hit Rate (PHR): the fraction of tokens shared between consecutive rows' input sequences.

The trick: if you reorder the rows and columns of the table before feeding them to the LLM, you can dramatically increase prefix sharing. Rows with similar values end up adjacent, and their serialized representations share long prefixes.

PHR = Σi=1..N-1 shared_prefix(rowi, rowi+1) / Σi=1..N-1 len(rowi)

The combinatorial challenge: for a table with n rows and m fields, there are n! × (m!)n possible orderings. Brute-force is impossible. For a modest table of 100 rows and 5 fields, that is 100! × (120)100 — an astronomically large search space.

Why this matters economically: At scale, each LLM call costs money. If a table has 10,000 rows and each LLM inference costs $0.01, naive execution costs $100 per query. A 3× improvement in prefix caching reduces effective cost to ~$33. Across millions of queries, this saves millions of dollars.

The baseline: QuickGreedy (GGR)

The published SOTA uses Greedy Recursive Grouping: recursively group rows by common field values, reorder fields using schema statistics. It achieves good PHR but is slow due to repeated counter operations and deep recursion:

def QuickGreedy(df):
    # 1. Compute value counts for all values
    counts = Counter(df.stack())
    val_len = {v: len(str(v))+2 for v in counts}
    # 2. Pick max-value column
    v_star = argmax(...)
    # 3. Split rows with/without v_star
    G = rows with v_star
    R = rows without v_star
    # 4. Reorder columns in G, recurse on remainder
    # 5. Recurse on G remainder and R
    return concat(G, R)

The evolved policy

After 100 iterations (~1 hour, <$7), OpenEvolve discovers a fundamentally different approach:

def EvolvedPolicy(df):
    # 1. Cache value counts (compute ONCE)
    counts = Counter(df.stack())
    val_len = {v: len(str(v))+2 for v in counts}
    # 2. Larger base cutoff → fewer recursions
    BASE = 400
    v_star = argmax(...)
    if v_star is None: return fixed_reorder(df)
    # 3. Recurse only if |df| > BASE
    if len(df) > BASE:
        top, bottom = split(df)
        return concat(EvolvedPolicy(top),
                       EvolvedPolicy(bottom))
    # 4. Lightweight per-row heuristic:
    # sort by squared value length × continuity
    prev = None
    for row in df:
        if prev is None: ...
        matches = [c for c in df.columns
                   if row[c]==prev[c]]
        matches.sort(key=lambda c:
            val_len[row[c]], reverse=True)
        order = matches + [c for c in df.columns
                           if c not in matches]
    return df

The key innovations:

Result: Comparable PHR to QuickGreedy, but the reordering algorithm runs 3.9× faster. The combined score (0.5 × PHR + 0.5 × runtimenorm) is significantly higher. Tested across 5 recommendation datasets (movies, beer, BIRD, PDMX, products).

Evolution process

The search progresses in phases:

Why does reordering table rows help reduce LLM inference cost?

Chapter 7: Case Study — Transaction Scheduling

In database systems, conflicts between transactions on shared data cause performance bottlenecks. If transaction A writes to key X and transaction B also writes to key X, they conflict. The scheduler must order transactions to minimize the total number of conflicts, thereby improving throughput.

The problem formulation

Given a set of transactions (each a sequence of read/write operations on keys), find an ordering that minimizes makespan — the total time to execute all transactions. This is equivalent to minimizing the cost of conflicts, because conflicts cause serialization delays.

The paper considers two settings:

SettingConstraintDifficulty
OnlineOrder is fixed once committed; keys unknown a prioriMust decide in O(n) time, no lookahead
OfflineFull batch visible; can use any ordering strategyNP-hard in general, but good heuristics exist

The baseline: SMF (Shortest Makespan First)

SMF (Cheng et al., VLDB '24) is a greedy O(n × k) algorithm. At each step, it samples k random unscheduled transactions, picks the one that adds the least incremental cost (conflict-weighted), and appends it to the schedule.

Online setting: rediscovery

In the online setting, OpenEvolve rediscovers SMF from scratch, starting from a random baseline. This is likely due to contamination (the model saw the SMF paper during training), but it is still notable: the evolutionary framework converges to the known SOTA without being told about it.

Offline setting: genuine discovery

In the offline setting, where no prior published solution exists, OpenEvolve finds a novel algorithm that beats SMF by 20–34%. The evolved algorithm combines three techniques:

def get_best_schedule(self, num_seqs=10):
    n = len(self.txns)
    idxs = list(range(n))

    # --- 0. Pre-compute transaction statistics ---
    write_cnt = {t: sum(1 for op in self.txns[t]
                     if op[0] == "w")
                 for t in idxs}

    # --- 1. Sorted sequence (by #writes, length) ---
    base_seq = sorted(idxs, key=lambda t:
        (write_cnt[t], len(self.txns[t])))
    best_seq = base_seq
    best_cost = self.get_opt_seq_cost(best_seq)

    # --- 2. Greedy insertion ---
    improved = True
    while improved:
        improved = False
        for i in range(n):
            txn = best_seq.pop(i)
            # Try every position for this txn
            for pos in range(n):
                candidate = best_seq[:pos]+[txn]+best_seq[pos:]
                cost = self.get_opt_seq_cost(candidate)
                if cost < best_cost: ...

    # --- 3. Pair-swap hill climb ---
    for _ in range(max_iters):
        i, j = random.sample(range(n), 2)
        # Swap and accept if better

    # --- 4. Random restarts (keep best) ---
    for _ in range(num_seqs):
        seq = idxs[:]
        random.shuffle(seq)
        cost = self.get_opt_seq_cost(seq)
        if cost < best_cost: ...

    return best_cost, best_seq

The four-phase structure emerged from evolution:

  1. Smart initialization: Sort by writes and length (fewer writes first), creating a conflict-aware starting point
  2. Greedy insertion: For each transaction, remove and re-insert at the best position (O(n²) but powerful)
  3. Pair-swap hill climbing: Random pairwise swaps to escape local minima
  4. Random restarts: Try completely random orderings to diversify the search
Result: 20% better makespan than SMF greedy in the offline setting. The O(n²) runtime is acceptable because the offline setting allows more computation. This is a genuinely new algorithm — no known prior work for this setting.
The contamination lesson: In the online setting, OpenEvolve converges to SMF — likely because the training data included the SMF paper. This is a feature, not a bug: it shows the framework can rapidly rediscover known-good solutions as a starting point. But it means online-setting results should be interpreted cautiously.

Understanding makespan cost

SMF is built around the concept of incremental conflict cost. When you add a transaction to the schedule, you pay a cost proportional to how many already-scheduled transactions conflict with it. The makespan increase from adding transaction ti at position k is:

Δmakespan(ti, k) = Σj<k conflict(ti, tj)

where conflict(ti, tj) = 1 if ti and tj share a key with at least one write. SMF greedily picks the transaction with the lowest incremental cost at each step. The evolved offline algorithm goes further: it also considers the future cost of each placement by running hill climbing and random restarts after the initial greedy pass.

The evaluator uses traces from 5 OLTP benchmarks (Epinions, SmallBank, TPC-C, TAOBench, YCSB) with 500 transactions each. It measures total makespan and provides statistical bounds on the schedule quality.

Why does the evolved offline scheduler use four phases instead of just one strategy?

Chapter 8: Failure Modes & Best Practices

ADRS is not magic. The paper is refreshingly honest about failure modes. Based on analysis of 420 LLM-judged traces, they identify three categories:

1. Runtime Errors

Failure TypeDescription
Syntax & Interface ErrorsGenerated code fails to compile or integrate with the evaluator
Budget ExhaustionCandidate exceeds resource limits (context window, API quotas, timeouts)

2. Search Failures

Failure TypeDescription
Premature ConvergenceSearch settles on a local optimum too early
Stuck-in-the-LoopRepeats similar solutions without progress
Mutation DriftConflicting changes from different models cancel each other

3. Algorithm Failures

Failure TypeDescription
Misaligned ObjectivesSolution ignores key constraints (e.g., latency SLOs)
Sub-Optimal OptimizationsShallow API-level tweaks instead of algorithmic innovation
OverfittingWorks on training traces, fails on unseen workloads
Reward HackingExploits evaluator loopholes to get high scores without solving the problem
Reward hacking is real: In one case, when evolving code to detect failures in a multi-agent system, candidates bypassed an entire evaluation stage to get high scores without performing the task. Evaluators must combine multiple signals (correctness, efficiency, robustness) and include adversarial tests.

The evaluation process

The failure distribution comes from an interesting methodology. The authors took 420 individual evolution traces (candidate solutions across all case studies) and had an LLM classify each one into failure categories. This meta-evaluation reveals that the majority of failures are mundane runtime errors, not deep algorithmic issues. This suggests a practical intervention: better sandboxing, automatic syntax checking, and interface validation before sending code to the expensive simulator would dramatically improve evolution efficiency.

Practical implication: If ~50% of iterations fail due to syntax or interface errors, then a lightweight pre-evaluation filter (e.g., just running python -c "import candidate") could nearly double the effective search budget for the same cost.

Best practices by component

Prompt Generator

Solution Generator

Evaluator

Solution Selector

Why is starting with a very strong baseline sometimes WORSE than starting with a simple one?

Chapter 9: Connections & Implications

Full results table

Here is the summary of all 11 tasks investigated in the paper:

TaskResult vs. SOTATime / Cost
Telemetry Repair+9% repair, +30% calibration8h, ≤$10
Adaptive Weight CompressionSimilar bits/elem, 14.2% worse PPL12h, ≤$20
Cloudcast (multi-cloud transfer)Matches SOTA cost1h, ≤$5
Expert Parallelism Load Balancer2× → 5× faster runtime5h, ≤$10
Global Model Placement18.5% cheaper40m, ≤$5
LLM-SQL (prefix cache reuse)3.9× faster reordering1h, ≤$7
Transaction Scheduling (online)Rediscovers SOTA (SMF)<2h, ≤$20
Transaction Scheduling (offline)20–34% better makespan<2h, ≤$20
Can't Be Late (single-region)7% avg cost savings, up to 16.7%5h, ≤$20
Can't Be Late (multi-region)26% lower cost1h, ≤$5
Sparse Attention Design7% avg improvement4h, ≤$15

What this means for researchers

The authors argue that the human researcher's role shifts from algorithm designer to problem architect:

A virtuous cycle: ADRS can improve the AI systems that power ADRS itself. The paper shows this explicitly — evolved algorithms for MoE inference and sparse attention directly improve LLM infrastructure, which makes future ADRS runs faster and cheaper.

Related frameworks

FrameworkKey IdeaRelation to ADRS
AlphaEvolveMAP-Elites + island evolution + LLM mutationsThe inspiration; proprietary (Google DeepMind)
OpenEvolveOpen-source reimplementation of AlphaEvolve corePrimary ADRS engine in this paper
GEPAGenetic-Pareto with reflective LLM feedbackAlternative ADRS framework
LLM4ADUnified platform with NSGA-II, MOEA/D, etc.Broader search strategy support
Cursor / WindsurfCoding assistants with full codebase contextHuman-in-the-loop ADRS analogs

How ADRS compares to AlphaEvolve

This paper is closely related to AlphaEvolve (Novikov et al., Google DeepMind, 2025), but there are key differences:

DimensionAlphaEvolveADRS (this paper)
FocusGeneral scientific discovery (math, chip design, scheduling)Systems performance research specifically
FrameworkProprietary (Google internal)Open-source (OpenEvolve)
VerificationProblem-specific evaluatorsEmphasizes simulator-based eval as key advantage
Human roleMinimal — fully automated searchExplicit outer loop with human guidance
ContributionIndividual breakthroughs (Strassen improvement)Systematic methodology + best practices + taxonomy of failures
CostNot disclosed$5–$20 per run, explicitly tracked

The ADRS paper's unique contribution is the meta-analysis: rather than showing one spectacular result, it systematically evaluates ADRS across 11 diverse tasks, categorizes failure modes, and distills actionable best practices. It is a roadmap for the field, not just a proof of concept.

Open challenges

← Back to Veanors

What is the "virtuous cycle" the authors identify?