AI-Driven Research for Systems (ADRS) — an evolutionary framework where LLMs generate candidate algorithms, simulators evaluate them, and the best survive. Across 11 systems tasks, ADRS matches or beats state-of-the-art human designs: 5× speedups, 26% cost reductions — in hours, for dollars.
A huge fraction of systems research — networking, databases, distributed systems — is about one thing: performance. You have a cluster of machines. You have workloads. You need an algorithm that schedules jobs, routes packets, balances load, or reorders queries to squeeze out every bit of throughput and slash every dollar of cost.
Traditionally, this is a human art. A PhD student spends months studying a system, builds a simulator, hand-crafts a heuristic, tunes it, compares it to baselines, and publishes. A survey of 31 systems PhD students at Berkeley reveals where their time goes:
| Research Stage | % of Time |
|---|---|
| Problem Formulation | 21.5% |
| Evaluation Framework | 22.9% |
| Algorithm Design (Solution) | 21.5% |
| Evaluation | 20.1% |
| Paper Writing | 14.0% |
Look at those two bolded rows: algorithm design and evaluation together consume over 40% of total effort. These are the two stages where you are iterating: propose a tweak, run the simulator, check the numbers, try again. It is a search loop, and humans are slow search engines.
The paper calls this approach ADRS: AI-Driven Research for Systems. Using OpenEvolve (an open-source implementation), they demonstrate ADRS across 11 tasks spanning cloud scheduling, MoE inference, SQL optimization, and transaction processing. In multiple cases, the evolved algorithms match or beat published SOTA — discovered in a few hours, at a cost of a few dollars.
This is not science fiction. The paper shows concrete before/after code: a 25-line hand-crafted scheduling policy versus a 45-line evolved policy that tracks spot availability patterns, classifies situations as stable or unstable, and dynamically adjusts safety margins. No human wrote it. An LLM proposed it, a simulator validated it, and evolution refined it.
The implications are profound. If an LLM evolution loop can, for $20 and 5 hours, match what takes a PhD student months to design — what is the future of systems research? The paper's title, "Barbarians at the Gate," captures the tension: these AI approaches do not follow the traditional academic playbook, but they get results.
Not every research domain is equally suited for AI-driven discovery. The authors identify three properties that make systems performance problems a uniquely good fit:
A scheduling algorithm either meets the deadline or it does not. A load balancer either distributes requests evenly or it does not. You can implement the candidate in a real system or simulator, run representative workloads, and measure the result. No subjective judgment. No ambiguous criteria. The evaluator is a deterministic function: code in, numbers out.
Contrast this with, say, AI-driven theorem proving or creative writing, where "is this solution correct/good?" is itself a hard problem. In systems research, verification is essentially free.
Solutions to systems performance problems typically preserve the correctness of the original system. A load balancer must still assign all tasks. A scheduler must still meet deadlines. A transaction reorderer must still execute all transactions. These constraints are binary: either satisfied or not. A quick check in the evaluator catches any violation.
The core logic of a scheduler, load balancer, or resource allocator is often a short function — tens to hundreds of lines. This is small enough to fit in an LLM context window and small enough for the LLM to reason about coherently. You are not evolving an entire operating system. You are evolving a policy function.
The authors are candid about limits. ADRS does not work well for:
The authors surveyed top systems conferences (NSDI, OSDI, SIGMOD, SOSP, VLDB) and found that over one-third of published papers feature performance optimization algorithms as their core contribution. These are all potential ADRS targets. The domains include:
| Domain | Example Tasks | Typical Evaluator |
|---|---|---|
| Networking | Routing, congestion control, traffic scheduling | Network simulator (ns-3, Mininet) |
| Databases | Query optimization, indexing, transaction scheduling | Database benchmarks (TPC-C, YCSB) |
| Cloud/Distributed | Job scheduling, resource allocation, load balancing | Cluster simulators with trace replay |
| ML Systems | MoE placement, attention design, model parallelism | PyTorch/GPU simulators |
ADRS automates the Solution and Evaluation stages of the traditional systems research process. The full research pipeline has five stages; ADRS replaces the middle two:
The automated core — stages 3 and 4 — forms a feedback loop. ADRS consists of five components that implement this loop:
| Component | Role | Concrete Example |
|---|---|---|
| Prompt Generator | Assembles the context-rich prompt sent to the LLM | Problem statement + parent code + scores + evaluator feedback |
| Solution Generator | LLM (or ensemble) proposes new or modified code | 80% Gemini 2.5 Flash + 20% o3 |
| Evaluator | Runs solutions against workloads, assigns scores + feedback | Python simulator measuring cost savings over baseline |
| Storage | Stores all solutions, scores, and feedback | Program database indexed by score |
| Solution Selector | Picks promising solutions to refine next iteration | Greedy, random, or island-based selection |
These form two nested loops:
Let us trace a concrete example. Suppose we are evolving a load-balancing policy for MoE inference. Here is what happens in one iteration:
Notice the richness of the evaluator feedback. It is not just a number — it includes why the score changed, which helps the LLM in the next iteration. This qualitative feedback is one of the paper's key recommendations.
The paper uses OpenEvolve as its primary ADRS framework. OpenEvolve is an open-source implementation inspired by Google DeepMind's AlphaEvolve. Let us trace the data flow through a single iteration.
The Prompt Generator assembles:
This is the most important design decision. Too little context and the LLM explores blindly. Too much and it gets confused or converges prematurely.
The Solution Generator sends the prompt to one or more LLMs. The paper recommends an ensemble of two models:
| Model Type | Role | Example Split |
|---|---|---|
| Reasoning model | Explores novel strategies, finds creative solutions | 20% of calls (o3, Gemini Pro) |
| Fast model | Refines existing solutions efficiently, high iteration speed | 80% of calls (Gemini Flash) |
The Evaluator runs the candidate in a simulator against predefined workloads. It returns:
OpenEvolve uses an island-based evolutionary model. The population is divided into 3–5 islands that evolve semi-independently, with periodic migration of top solutions between islands. This prevents premature convergence.
| Parameter | Typical Range |
|---|---|
| Number of islands | 3–5 |
| Iterations per run | 70–400 |
| Wall-clock time | 40 min – 12 hours |
| Total LLM cost | $5 – $20 |
| LLM ensemble | 80/20 fast/reasoning split |
The cost is remarkable. For the price of a lunch, you can search a space of algorithms that would take a PhD student months to explore by hand.
The paper emphasizes that prompt engineering is critical. A well-structured prompt contains:
# System message (condensed): """ You are optimizing a scheduling policy. PROBLEM: Minimize cost of running deadline-aware jobs using spot instances. Spot instances are cheap but can be preempted. On-demand instances are reliable but expensive. CURRENT BEST (score: 0.72): [... full Python function ...] EVALUATOR FEEDBACK: - Average cost savings: 12% over greedy - Failed on traces 7, 14: missed deadline by 2 ticks - Spot utilization: 68% PREVIOUS ATTEMPT (score: 0.31): - Added random delay → missed 40% of deadlines INSTRUCTIONS: Propose a modification to improve cost savings while meeting ALL deadlines. Return your changes as SEARCH/REPLACE blocks. """
The key insight: include the failure history. Showing the LLM what did not work is as valuable as showing what did. This prevents the search from revisiting dead ends and helps the LLM understand the constraint landscape.
Cloud providers offer spot instances — spare compute at 60–90% discount. The catch: they can be preempted at any time. Your job gets killed, you lose progress, and you must restart on a new instance (incurring a changeover delay).
The scheduling problem: given a job with a deadline, minimize cost by using spot instances as much as possible, falling back to expensive on-demand instances only when necessary. If you use spots too aggressively, preemptions pile up and you miss the deadline. If you are too conservative, you overpay on on-demand.
The SOTA comes from Wu et al. (NSDI '24), called Uniform Progress. The idea is simple: track expected progress versus actual progress. If you are behind schedule, switch to on-demand. If you are ahead, stay on spot.
def UniformProgress(has_spot, state, env, task): # Calculate uniform progress rate progress_rate = task.duration / task.deadline expected = env.elapsed_time * progress_rate actual = task.progress_made # Simple decision: behind schedule? if actual < expected: # Must use any available resource if has_spot: return ClusterType.SPOT else: return ClusterType.ON_DEMAND # Default: use spot if available if has_spot: return ClusterType.SPOT else: return ClusterType.NONE # Wait
This is a fixed formula. It does not learn from patterns in spot availability. It treats every spot instance the same, regardless of whether spots have been stable for an hour or getting preempted every 30 seconds.
After 400 iterations (~5 hours, <$20), OpenEvolve discovers a fundamentally different approach:
def AdaptiveStrategy(has_spot, state, env, task): # Track recent spot availability patterns self.window.append(has_spot) alpha = window_avg(self.window) streak = longest_run(self.window) tail = trailing_run(self.window) params = get_params(classify_situation( alpha, streak, tail)) # Compare remaining time to safety margin need = ticks_needed(task, env) slack = ticks_remaining(env) safety = safety_margin(task, params) if need >= slack: lock_on_demand() return ClusterType.ON_DEMAND # Classify situation: stable, moderate, unstable # Adjust behavior based on pattern classification if has_spot and looks_stable(streak, tail, params): return ClusterType.SPOT elif slack < safety * params.wait_margin: return ClusterType.NONE # Wait for stable spot else: start_rebuilding(params.dwell) return ClusterType.ON_DEMAND
Look at the innovations the LLM discovered:
The search process reveals how the LLM discovers these strategies incrementally:
Mixture-of-Experts (MoE) models like DeepSeek-V3 route each token to a subset of experts (specialized sub-networks). During inference, experts are distributed across multiple GPUs. The problem: some experts are "hot" (frequently activated) and others are cold. If hot experts cluster on one GPU, that GPU becomes a bottleneck.
Expert Parallelism Load Balancing (EPLB) from DeepSeek runs in three stages:
The initial implementation uses greedy bin-packing: sort experts by load in descending order, assign each to the least-loaded GPU. It is written in pure Python with for-loops and takes ~540 ms per rebalancing step.
After 300 iterations (~5 hours, <$10), the LLM makes two critical breakthroughs:
The initial code uses nested Python for-loops to search for the best GPU placement. The evolved code replaces this with PyTorch tensor operations — reshaping, indexing, and assignment via vectorized ops. This alone drops the runtime dramatically.
Instead of greedy bin-packing, the evolved code discovers a "snake" or zigzag pattern that alternates high-load and low-load experts across GPU slots:
def EvolvedStrategy(...): indices = torch.arange(num_items) block_id = indices // len(num_packs) idx_in_block = indices % len(num_packs) is_even_block = block_id % 2 == 0 # The "snake" pattern: no loop at all # For items in even blocks: # assign them to each pack; while # for odd blocks: use a reversed order packs_for_sorted_items = torch.where( is_even_block, idx_in_block, num_packs - 1 - idx_in_block ) # Now we have assigned a pack for each item update_packs(packs_for_sorted_items)
The zigzag naturally interleaves heavy and light experts. Think of dealing cards: if you sort the deck from highest to lowest, then deal them in a zigzag across players (1,2,3,3,2,1,1,2,3,...), each player ends up with a roughly equal total.
The optimization target is the imbalance factor: the ratio of maximum tokens on any GPU to the average tokens per GPU. Perfect balance gives 1.0. Higher means some GPUs are overloaded.
For a model with E=256 experts across G=8 GPUs, each expert has a different load (number of tokens routed to it). The greedy approach sorts by load [high, ..., low] and assigns in order: GPU0 gets the heaviest, GPU1 the next, etc. But this means GPU0 always gets the heaviest experts in each round.
The zigzag reverses every other round: round 1 goes GPU0→GPU7, round 2 goes GPU7→GPU0. If we have experts sorted by load [100, 90, 80, 70, 60, 50, 40, 30] and 4 GPUs:
| Method | GPU 0 | GPU 1 | GPU 2 | GPU 3 | Imbalance |
|---|---|---|---|---|---|
| Greedy | 100+30=130 | 90+40=130 | 80+50=130 | 70+60=130 | 1.00 |
| Zigzag | 100+70=170 | 90+60=150 | 80+50=130 | 40+30=70 | — |
Wait — in this toy example, greedy is perfect! The advantage of zigzag appears with replicas. When hot experts are replicated, the assignment becomes more complex: replicas of the same expert must go to different GPUs, and the number of replicas varies. The greedy loop must search for valid placements via nested for-loops (O(E × G)), while the tensor zigzag does it in O(E) with vectorized indexing. Same imbalance, vastly less computation.
The simulator models a distributed GPU inference engine for MoE (168 LOC). It uses ShareGPT and GSM8K traces. The combined metric is a weighted average of imbalance factor and rebalancing runtime — a single scalar for the evolutionary loop to optimize.
Imagine running SQL queries over a large table by invoking an LLM for each row. This is relational analytics with LLMs (Liu et al., MLSys '25). Each row triggers a separate LLM inference call. For a table with 10,000 rows, that is 10,000 forward passes. The cost is prohibitive.
Modern LLM inference engines cache the key-value pairs (KV cache) for token prefixes. If consecutive requests share a long common prefix, the engine can reuse the cached KV pairs and skip recomputation. The metric to maximize is the Prefix Hit Rate (PHR): the fraction of tokens shared between consecutive rows' input sequences.
The trick: if you reorder the rows and columns of the table before feeding them to the LLM, you can dramatically increase prefix sharing. Rows with similar values end up adjacent, and their serialized representations share long prefixes.
The combinatorial challenge: for a table with n rows and m fields, there are n! × (m!)n possible orderings. Brute-force is impossible. For a modest table of 100 rows and 5 fields, that is 100! × (120)100 — an astronomically large search space.
The published SOTA uses Greedy Recursive Grouping: recursively group rows by common field values, reorder fields using schema statistics. It achieves good PHR but is slow due to repeated counter operations and deep recursion:
def QuickGreedy(df): # 1. Compute value counts for all values counts = Counter(df.stack()) val_len = {v: len(str(v))+2 for v in counts} # 2. Pick max-value column v_star = argmax(...) # 3. Split rows with/without v_star G = rows with v_star R = rows without v_star # 4. Reorder columns in G, recurse on remainder # 5. Recurse on G remainder and R return concat(G, R)
After 100 iterations (~1 hour, <$7), OpenEvolve discovers a fundamentally different approach:
def EvolvedPolicy(df): # 1. Cache value counts (compute ONCE) counts = Counter(df.stack()) val_len = {v: len(str(v))+2 for v in counts} # 2. Larger base cutoff → fewer recursions BASE = 400 v_star = argmax(...) if v_star is None: return fixed_reorder(df) # 3. Recurse only if |df| > BASE if len(df) > BASE: top, bottom = split(df) return concat(EvolvedPolicy(top), EvolvedPolicy(bottom)) # 4. Lightweight per-row heuristic: # sort by squared value length × continuity prev = None for row in df: if prev is None: ... matches = [c for c in df.columns if row[c]==prev[c]] matches.sort(key=lambda c: val_len[row[c]], reverse=True) order = matches + [c for c in df.columns if c not in matches] return df
The key innovations:
The search progresses in phases:
In database systems, conflicts between transactions on shared data cause performance bottlenecks. If transaction A writes to key X and transaction B also writes to key X, they conflict. The scheduler must order transactions to minimize the total number of conflicts, thereby improving throughput.
Given a set of transactions (each a sequence of read/write operations on keys), find an ordering that minimizes makespan — the total time to execute all transactions. This is equivalent to minimizing the cost of conflicts, because conflicts cause serialization delays.
The paper considers two settings:
| Setting | Constraint | Difficulty |
|---|---|---|
| Online | Order is fixed once committed; keys unknown a priori | Must decide in O(n) time, no lookahead |
| Offline | Full batch visible; can use any ordering strategy | NP-hard in general, but good heuristics exist |
SMF (Cheng et al., VLDB '24) is a greedy O(n × k) algorithm. At each step, it samples k random unscheduled transactions, picks the one that adds the least incremental cost (conflict-weighted), and appends it to the schedule.
In the online setting, OpenEvolve rediscovers SMF from scratch, starting from a random baseline. This is likely due to contamination (the model saw the SMF paper during training), but it is still notable: the evolutionary framework converges to the known SOTA without being told about it.
In the offline setting, where no prior published solution exists, OpenEvolve finds a novel algorithm that beats SMF by 20–34%. The evolved algorithm combines three techniques:
def get_best_schedule(self, num_seqs=10): n = len(self.txns) idxs = list(range(n)) # --- 0. Pre-compute transaction statistics --- write_cnt = {t: sum(1 for op in self.txns[t] if op[0] == "w") for t in idxs} # --- 1. Sorted sequence (by #writes, length) --- base_seq = sorted(idxs, key=lambda t: (write_cnt[t], len(self.txns[t]))) best_seq = base_seq best_cost = self.get_opt_seq_cost(best_seq) # --- 2. Greedy insertion --- improved = True while improved: improved = False for i in range(n): txn = best_seq.pop(i) # Try every position for this txn for pos in range(n): candidate = best_seq[:pos]+[txn]+best_seq[pos:] cost = self.get_opt_seq_cost(candidate) if cost < best_cost: ... # --- 3. Pair-swap hill climb --- for _ in range(max_iters): i, j = random.sample(range(n), 2) # Swap and accept if better # --- 4. Random restarts (keep best) --- for _ in range(num_seqs): seq = idxs[:] random.shuffle(seq) cost = self.get_opt_seq_cost(seq) if cost < best_cost: ... return best_cost, best_seq
The four-phase structure emerged from evolution:
SMF is built around the concept of incremental conflict cost. When you add a transaction to the schedule, you pay a cost proportional to how many already-scheduled transactions conflict with it. The makespan increase from adding transaction ti at position k is:
where conflict(ti, tj) = 1 if ti and tj share a key with at least one write. SMF greedily picks the transaction with the lowest incremental cost at each step. The evolved offline algorithm goes further: it also considers the future cost of each placement by running hill climbing and random restarts after the initial greedy pass.
The evaluator uses traces from 5 OLTP benchmarks (Epinions, SmallBank, TPC-C, TAOBench, YCSB) with 500 transactions each. It measures total makespan and provides statistical bounds on the schedule quality.
ADRS is not magic. The paper is refreshingly honest about failure modes. Based on analysis of 420 LLM-judged traces, they identify three categories:
| Failure Type | Description |
|---|---|
| Syntax & Interface Errors | Generated code fails to compile or integrate with the evaluator |
| Budget Exhaustion | Candidate exceeds resource limits (context window, API quotas, timeouts) |
| Failure Type | Description |
|---|---|
| Premature Convergence | Search settles on a local optimum too early |
| Stuck-in-the-Loop | Repeats similar solutions without progress |
| Mutation Drift | Conflicting changes from different models cancel each other |
| Failure Type | Description |
|---|---|
| Misaligned Objectives | Solution ignores key constraints (e.g., latency SLOs) |
| Sub-Optimal Optimizations | Shallow API-level tweaks instead of algorithmic innovation |
| Overfitting | Works on training traces, fails on unseen workloads |
| Reward Hacking | Exploits evaluator loopholes to get high scores without solving the problem |
The failure distribution comes from an interesting methodology. The authors took 420 individual evolution traces (candidate solutions across all case studies) and had an LLM classify each one into failure categories. This meta-evaluation reveals that the majority of failures are mundane runtime errors, not deep algorithmic issues. This suggests a practical intervention: better sandboxing, automatic syntax checking, and interface validation before sending code to the expensive simulator would dramatically improve evolution efficiency.
python -c "import candidate") could nearly double the effective search budget for the same cost.Here is the summary of all 11 tasks investigated in the paper:
| Task | Result vs. SOTA | Time / Cost |
|---|---|---|
| Telemetry Repair | +9% repair, +30% calibration | 8h, ≤$10 |
| Adaptive Weight Compression | Similar bits/elem, 14.2% worse PPL | 12h, ≤$20 |
| Cloudcast (multi-cloud transfer) | Matches SOTA cost | 1h, ≤$5 |
| Expert Parallelism Load Balancer | 2× → 5× faster runtime | 5h, ≤$10 |
| Global Model Placement | 18.5% cheaper | 40m, ≤$5 |
| LLM-SQL (prefix cache reuse) | 3.9× faster reordering | 1h, ≤$7 |
| Transaction Scheduling (online) | Rediscovers SOTA (SMF) | <2h, ≤$20 |
| Transaction Scheduling (offline) | 20–34% better makespan | <2h, ≤$20 |
| Can't Be Late (single-region) | 7% avg cost savings, up to 16.7% | 5h, ≤$20 |
| Can't Be Late (multi-region) | 26% lower cost | 1h, ≤$5 |
| Sparse Attention Design | 7% avg improvement | 4h, ≤$15 |
The authors argue that the human researcher's role shifts from algorithm designer to problem architect:
| Framework | Key Idea | Relation to ADRS |
|---|---|---|
| AlphaEvolve | MAP-Elites + island evolution + LLM mutations | The inspiration; proprietary (Google DeepMind) |
| OpenEvolve | Open-source reimplementation of AlphaEvolve core | Primary ADRS engine in this paper |
| GEPA | Genetic-Pareto with reflective LLM feedback | Alternative ADRS framework |
| LLM4AD | Unified platform with NSGA-II, MOEA/D, etc. | Broader search strategy support |
| Cursor / Windsurf | Coding assistants with full codebase context | Human-in-the-loop ADRS analogs |
This paper is closely related to AlphaEvolve (Novikov et al., Google DeepMind, 2025), but there are key differences:
| Dimension | AlphaEvolve | ADRS (this paper) |
|---|---|---|
| Focus | General scientific discovery (math, chip design, scheduling) | Systems performance research specifically |
| Framework | Proprietary (Google internal) | Open-source (OpenEvolve) |
| Verification | Problem-specific evaluators | Emphasizes simulator-based eval as key advantage |
| Human role | Minimal — fully automated search | Explicit outer loop with human guidance |
| Contribution | Individual breakthroughs (Strassen improvement) | Systematic methodology + best practices + taxonomy of failures |
| Cost | Not disclosed | $5–$20 per run, explicitly tracked |
The ADRS paper's unique contribution is the meta-analysis: rather than showing one spectacular result, it systematically evaluates ADRS across 11 diverse tasks, categorizes failure modes, and distills actionable best practices. It is a roadmap for the field, not just a proof of concept.