Yoonho Lee, Roshen Nair, Qizheng Zhang, Omar Khattab, Kangwook Lee, Chelsea Finn — Stanford, MIT, KRAFTON, 2026

Meta-Harness: End-to-End Optimization of Model Harnesses

The code wrapping an LLM matters as much as the model itself (6x performance gap). Meta-Harness automates harness search by giving a coding agent filesystem access to all prior candidates' source code, execution traces, and scores.

Prerequisites: LLM prompting basics + What an agent is + Basic optimization intuition
10
Chapters
4
Simulations

Chapter 0: The Problem

You have the same LLM — same weights, same architecture, same everything. You change only the code that wraps it: what goes in the prompt, how context is managed, what examples are retrieved, how memory is updated. The performance gap? 6x on the same benchmark.

This is the harness engineering problem. The harness — the code around the model — often matters as much as the model itself. And right now, designing a good harness is entirely manual. An engineer inspects failures, tweaks prompts, adjusts retrieval logic, and repeats. It's slow, it doesn't scale, and it depends entirely on the engineer's intuition.

The 6x gap is real: On SWE-Bench Mobile, changing only the harness code around a fixed LLM produces a 6x difference in pass rate. The same model. The same benchmark. The only variable is the code that decides what the model sees.

Why is this hard to automate? Because a harness is a program, not a prompt template. It decides what to store, when to retrieve it, how to construct prompts, and how to update state after each model response. A single choice about what to put in context can affect behavior many reasoning steps later. The feedback signal is distributed across code, scores, and execution traces in a way that's extremely hard to compress into a simple gradient or summary.

Existing text optimizers — OPRO, TextGrad, AlphaEvolve, GEPA — try to iteratively improve prompts using feedback from prior attempts. But they compress that feedback too aggressively: some condition only on scalar scores, others use short LLM-generated summaries, others restrict feedback to templates. The total diagnostic context they work with ranges from 100 to 30,000 tokens per optimization step. For harness search, a single evaluation can produce 10,000,000 tokens of diagnostic information.

Performance Gap: Same Model, Different Harnesses

The same base model achieves wildly different scores depending on its harness. Click "Shuffle Harnesses" to see how different harness designs land on the performance spectrum.

Why is harness engineering hard to automate with existing text optimizers?

Chapter 1: The Key Insight

Here's the central idea of Meta-Harness: give a coding agent full filesystem access to all prior experience, and let it decide what to inspect.

Instead of compressing feedback into a summary or a score, Meta-Harness stores everything: every candidate harness's source code, every execution trace (the actual prompts sent, the model's responses, the state updates), and every evaluation score. All of it goes into a filesystem. The proposer — a coding agent — navigates this filesystem freely using standard developer tools: grep, cat, ls. It reads what it needs, ignores what it doesn't.

Why this is different from everything before: Prior text optimizers are like a student studying from flashcard summaries. Meta-Harness is like a student with access to every past exam, every solution attempt, every teacher's comment, and the ability to search through them at will. The student decides what to review based on what went wrong.

The proposer isn't a raw LLM receiving a fixed prompt assembled by the outer loop. It's a coding agent — Claude Code with Opus-4.6 — that can invoke developer tools and modify code. This matters because the amount of experience quickly exceeds any model's context window. In the most demanding setting, the proposer reads a median of 82 files per iteration, referencing over 20 prior candidates per step. The total diagnostic information from a single evaluation reaches up to 10 million tokens.

Prior Optimizers
Compress feedback into summaries or scores (100-30K tokens). Fixed prompt template. Agent sees a curated, lossy view of history.
↓ vs ↓
Meta-Harness
Store EVERYTHING in a filesystem (10M tokens per eval). Coding agent freely navigates with grep/cat. Agent decides what to inspect.
Concrete numbers: In the TerminalBench-2 setting, the proposer reads a median of 82 files per iteration. Each evaluation produces up to 10,000,000 tokens of logs. The full filesystem is far larger than the proposer's context window — so the agent must search through it, not ingest it whole. This selective inspection is the key capability that makes rich feedback usable.
What is the key difference between Meta-Harness's feedback mechanism and prior text optimizers?

Chapter 2: What is a Harness?

Before we can optimize harnesses, we need to define what one is. A harness is a stateful program that wraps a language model and determines what context the model sees at each step. It handles:

A harness is NOT just a prompt. A prompt is a static string. A harness is a program that dynamically constructs prompts based on task state, retrieved context, stored memories, and prior model outputs. Think of it as the operating system for an LLM application.

The Formal Objective

Let M be a fixed language model and X a task distribution. For a harness H and task instance x from X, we execute a rollout trajectory:

τ ~ pM(H, x)

The harness constructs prompts for M, the model responds, and the harness updates its state after each interaction. A task-specific reward function r(τ, x) scores the trajectory. The objective:

H* = argmaxH Ex~X, τ~pM(H,x) [ r(τ, x) ]

In plain English: find the harness program that makes the model perform best, on average, across the task distribution. When multiple objectives matter (accuracy and context cost), the system evaluates candidates under Pareto dominance and reports the resulting frontier.

Why "meta"? Meta-Harness is itself a harness — it's the code that determines what the proposer model sees during search. It's a harness for optimizing harnesses. The task-specific harnesses being optimized are single-file Python programs that modify prompting, retrieval, memory, and orchestration logic.

In practice, each harness is implemented as a single Python file. This keeps the search space navigable: the agent can read, understand, and modify a single file more reliably than a multi-file project. The file contains all the logic for interacting with the base model M on a given task.

What does the formal objective H* = argmax E[r(τ, x)] mean in plain language?

Chapter 3: The Search Loop

The Meta-Harness search loop is deliberately simple. No evolutionary operators. No parent-selection rules. No fitness-proportionate sampling. Just a coding agent with a growing filesystem.

The Algorithm

Initialize
Start with a population H of baseline harnesses (e.g., zero-shot, few-shot, ACE, MCE). Evaluate each. Store source code + scores + execution traces in the filesystem D.
For each iteration t = 1...N
The proposer agent reads the filesystem D. It browses prior harnesses' code, scores, and traces using grep/cat. It proposes k new harnesses.
Evaluate
Each proposed harness that passes interface validation is evaluated on the task distribution. Code, reasoning traces, and scores are stored back in D.
Return
After N iterations, return the Pareto frontier of harnesses stored in D.
The simplicity is deliberate. By leaving diagnosis and edit decisions to the proposer rather than hard-coding search heuristics, Meta-Harness can improve automatically as coding agents become more capable. A typical run evaluates ~60 harnesses over 20 iterations, with 2 candidates per iteration.

What Gets Stored Per Candidate

Every evaluated harness gets its own directory in the filesystem containing:

Interactive Search Loop

Step through the Meta-Harness search loop. Watch the filesystem grow as candidates are evaluated and stored. The agent reads prior experience before each proposal.

Click to start iteration 1
No parent-selection rule: Unlike evolutionary methods that select parents based on fitness, Meta-Harness imposes no restriction on what the proposer inspects. The agent is free to read any prior harness and its execution trace. In practice, it often starts from a strong prior harness, but this is an emergent strategy, not a hard-coded rule.
Why does Meta-Harness avoid hard-coded search heuristics like parent selection or mutation operators?

Chapter 4: Why Filesystem, Not Summaries?

This is the engineering decision that makes Meta-Harness work. Prior text optimizers compress feedback for a practical reason: scalability. But that compression loses the information needed to trace downstream failures back to earlier harness decisions.

The Compression Problem

Consider what happens when a harness fails on a task. The failure might be caused by:

A scalar score tells you the harness failed. An LLM-generated summary might say "the model struggled with multi-step reasoning." But the raw execution trace shows you exactly which prompt was sent, what the model said, what state was updated, and where the chain of reasoning went off the rails. That's the information you need to fix it.

The ablation proves it: When the proposer receives only scores (no traces), it achieves 34.6 median accuracy. With scores plus LLM-generated summaries, it reaches 34.9. With full filesystem access to raw traces, it jumps to 50.0 — and even its median candidate outperforms the best candidate from either ablation. Summaries don't recover the missing signal.

Context Budgets Across Methods

MethodHistory AccessLog ContentMTok/iter
OPROWindowpast (solution, score) pairs0.002
TextGradLasttextual feedback on current artifact0.015
AlphaEvolveWindowprogram database + eval scores0.022
GEPASummaryreflective feedback from rollout traces0.008
Feedback DescentSummarycomparison + textual feedback0.012
TTT-DiscoverWindowprev solution fragment0.026
Meta-HarnessFullall logs and scores10.0
Three orders of magnitude. Meta-Harness operates with roughly 1000x more diagnostic context per optimization step than any prior method. The key is that the agent doesn't ingest all 10M tokens at once — it selectively inspects the parts it needs via grep and cat. The filesystem is an external memory that the agent queries on demand.
Context Budget Comparison (log scale)

Tokens of diagnostic information available per optimization step. Note the logarithmic scale — Meta-Harness operates with ~1000x more context than any prior method.

Why Not a Vector Database?

A natural question: why use a plain filesystem instead of a vector DB or structured knowledge graph? Because the agent's queries are unpredictable. Sometimes it needs to compare two specific harnesses' retrieval logic. Sometimes it needs to grep for a particular error message across all traces. Sometimes it needs to read the reasoning log of the highest-scoring candidate. A filesystem with standard UNIX tools supports all of these access patterns without pre-committing to a retrieval strategy.

In the ablation study, what happened when the proposer received LLM-generated summaries instead of raw execution traces?

Chapter 5: Code-Space Search

Meta-Harness searches over full Python programs, not prompt templates or configuration files. This is a fundamental difference from prior text optimization methods.

Why Code, Not Templates?

A prompt template has a fixed structure with slots to fill. A code-space harness can change anything: the retrieval algorithm, the memory data structure, the prompt construction logic, the error handling, the multi-step orchestration strategy. The agent can rewrite the retrieval function from dense search to BM25, add a caching layer, change how examples are formatted, or restructure the entire control flow.

Natural regularization bias. Representing harnesses as programs provides a surprising benefit: coding models tend to propose coherent algorithms rather than brittle, hard-coded solutions. A model trained on millions of codebases has strong priors about what "good code" looks like. This biases the search toward reusable context-management procedures rather than task-specific hacks like if-chains or hard-coded class mappings.

What the Agent Can Modify

Retrieval Logic
Switch between dense/sparse retrieval, change k, modify the query construction, add re-ranking
Memory Strategy
Change what gets stored, when it's retrieved, how it's formatted, when it expires
Prompt Construction
Rewrite system prompts, change example selection, modify formatting, add constraints
Orchestration
Change multi-step workflows, tool use patterns, retry logic, error handling, state management

Credit Assignment Through Traces

Code-space search makes credit assignment possible. When the agent reads an execution trace, it can see: "This harness used BM25 retrieval and got 3 irrelevant examples for problem #47, which led the model to apply the wrong proof technique." That's actionable. The agent can then modify the retrieval function specifically, rather than making a vague "improve retrieval" edit.

The action space matches training. Coding agents are trained on read-write-execute workflows over codebases. Meta-Harness's loop — read prior code, reason about failures, propose an edit, test it — is exactly the workflow these agents are optimized for. This alignment between the search space and the agent's training distribution is a key advantage.

In practice, the proposer is guided by a minimal domain-specific skill that describes where to write new harnesses, how to inspect previous harnesses and their execution traces, and what files it can and cannot modify. This skill acts as a soft constraint on the search space without hard-coding any search strategy.

Why does code-space search provide a natural regularization bias?

Chapter 6: Results — Text Classification

The first test domain is online text classification: an LLM receives labeled examples one at a time, updates its memory, and is evaluated on a held-out test set. The model is GPT-OSS-120B. Three datasets: LawBench (215 criminal charge classes), Symptom2Disease (22 classes), and USPTO-50k (180 chemical reaction classes).

vs. Hand-Designed Harnesses

The baselines are the current state-of-the-art hand-designed harnesses for this problem:

Meta-Harness reaches 48.6% accuracy — a 7.7-point improvement over ACE — while using only 11.4K context tokens. That's 4x fewer tokens than ACE. Higher accuracy, lower cost.

Better AND cheaper: Meta-Harness doesn't just match ACE's accuracy with fewer tokens. It surpasses ACE by 7.7 points while using 4x fewer context tokens. This is a strict Pareto improvement — better on both dimensions simultaneously.

vs. Text Optimizers

For a fair comparison against other search methods (same proposer model, same evaluation budget):

MethodMedian AccBest Acc
GEPA32.640.2
Best-of-N34.044.2
OpenEvolve39.143.3
TTT-Discover34.145.6
Meta-Harness50.056.7

Meta-Harness matches the best prior text optimizers' final accuracy after just 4 evaluations, while they need 60. Its final accuracy surpasses theirs by more than 10 points.

Out-of-Distribution Generalization

The discovered harness generalizes to 9 entirely new datasets unseen during search, achieving 73.1% average accuracy vs 70.2% for ACE. This confirms Meta-Harness captures generally effective strategies rather than overfitting to the search datasets.

10x faster convergence. Meta-Harness matches the best prior text optimizers with 10x fewer evaluations, then continues improving to a substantially higher final accuracy. The authors attribute this to the design choice of preserving full experience history rather than compressing it.
How does Meta-Harness compare to ACE on the accuracy-vs-context tradeoff?

Chapter 7: Results — Math & Coding

Retrieval-Augmented Math Reasoning

The setup: augment an LLM with the ability to retrieve examples from a large corpus of 500,000+ solved problems before attempting IMO-level math problems. The question isn't whether retrieval helps — it's whether Meta-Harness can discover the right retrieval policy.

The search optimizes over retrieval harnesses for 40 iterations using GPT-OSS-20B on a 250-problem search set. A single winning harness is then evaluated on 200 previously unseen IMO-level problems, across five models (including four never seen during search).

+4.7 points on IMO-level problems across 5 held-out models. The discovered retrieval harness outperforms no-retrieval on all five models, including four models not seen during search. It operates on top of plain BM25 lexical retrieval — no fancy dense encoder needed. The harness improves how retrieval is used (query construction, example selection, formatting), not the retrieval mechanism itself.

Agentic Coding: TerminalBench-2

TerminalBench-2 evaluates LLM agents on 89 challenging tasks requiring long-horizon, fully autonomous execution. This is an actively contested benchmark with multiple teams directly optimizing for it.

Harness (Opus 4.6)Pass %
Claude Code58.0
Terminus 262.9
Terminus-KIRA74.7
Capy75.3
Meta-Harness76.4
Harness (Haiku 4.5)Pass %
Claude Code27.5
Mini-SWE-Agent29.8
Terminus-KIRA33.7
Goose35.5
Meta-Harness37.6

On Haiku 4.5, Meta-Harness ranks #1 among all agents on the leaderboard. On Opus 4.6, it ranks #2 (76.4% vs ForgeCode's reported 81.8%, though ForgeCode's result could not be reproduced from their published code).

Results: Meta-Harness vs Baselines

Performance comparison across three evaluation domains. Toggle between tasks.

What makes the TerminalBench-2 result particularly impressive?

Chapter 8: What the Agent Actually Does

The search trajectories reveal how the proposer agent behaves in practice. This is where the filesystem access pays off — the agent develops sophisticated strategies for navigating the search space.

A Real Search Trajectory (TerminalBench-2)

Here's what the proposer actually does across iterations, reconstructed from the paper's appendix:

Early Iterations
The agent reads broadly across multiple prior harnesses. It combines plausible structural fixes (e.g., better error handling) with prompt-template edits. Both candidates regress.
Hypothesis Formation
The agent reads the execution traces of the regressed candidates. It explicitly hypothesizes that the regressions were confounded — the structural change and prompt rewrite were entangled, making it impossible to tell which caused the regression.
Isolation Strategy
The agent isolates the structural changes from the prompt rewrite. It tests each independently, trying to identify the causal variable.
Pivot to Safety
After repeated regressions, the agent pivots toward a safer additive modification strategy — making small, targeted changes to a known-good harness rather than attempting large rewrites.
Best Candidate
The safe additive modification becomes the best candidate in the run. The agent's causal reasoning, enabled by filesystem access, led it to the right strategy.
This is not random search. The agent forms hypotheses ("the prompt rewrite confounded the structural change"), designs experiments to test them (isolate the variables), and updates its strategy based on results (pivot to safer edits). This causal reasoning is only possible because the agent can read the raw execution traces — a summary saying "candidate regressed" wouldn't be enough.

What 82 Files Looks Like

In a typical iteration, the proposer:

This adds up to a median of 82 files per iteration, referencing over 20 prior candidates per step. The agent is doing what a skilled engineer would do — reading code, checking logs, forming hypotheses, testing edits — but at a scale and speed no human could match.

Why this only became practical recently: The authors note that this workflow "only became practical recently, following major improvements in coding-agent capabilities around early 2026." The proposer must reliably read code, reason about multi-step failures, and generate valid Python programs. Earlier coding agents weren't consistent enough for this loop to converge.
What did the proposer agent do when it observed that two candidates both regressed?

Chapter 9: Connections

Meta-Harness sits at the intersection of several important research threads. Let's map where it fits.

Relation to AlphaEvolve / OpenEvolve

AlphaEvolve (Google DeepMind) searches over executable code for mathematical functions and algorithms. Like Meta-Harness, it uses LLMs as mutation operators over code. But AlphaEvolve operates on designated functions within fixed scaffolds and uses a structured program database with eval scores (~22K tokens/iter). Meta-Harness searches over entire harness implementations with unrestricted filesystem access (~10M tokens/iter). On text classification, Meta-Harness outperforms OpenEvolve by 13+ points.

Relation to GEPA

GEPA uses "reflective feedback from rollout traces" — but these are LLM-generated summaries of traces, not the raw traces themselves. Meta-Harness's ablation shows this compression loses critical diagnostic information.

Relation to OPRO / TextGrad

OPRO conditions only on past (solution, score) pairs. TextGrad uses textual feedback on the current artifact only. Both operate in a much narrower feedback regime than Meta-Harness.

Relation to AI Scientist

The AI Scientist uses coding agents to propose and evaluate scientific hypotheses. Meta-Harness applies a similar agent-driven search loop, but specialized to harness optimization rather than scientific discovery.

Relation to Coding Agents (Claude Code, Cursor)

Meta-Harness's proposer is a coding agent (Claude Code with Opus-4.6). The key contribution isn't the agent itself but the outer loop: what information the agent sees, how experience is accumulated, and how the search is structured.

The Bitter Lesson Connection

The authors explicitly cite Sutton's Bitter Lesson: "once a search space becomes accessible, stronger general-purpose agents can outperform hand-engineered solutions." Meta-Harness is a concrete instance of this pattern applied to harness engineering.

Cheat Sheet

AspectMeta-Harness
What it optimizesHarness code (Python programs wrapping an LLM)
ProposerClaude Code with Opus-4.6
Feedback mechanismFull filesystem of code + traces + scores
Context per evalUp to 10M tokens
Files read/iterationMedian 82
Search budget~60 harnesses over 20 iterations
Text classification+7.7 over ACE, 4x fewer tokens
Math reasoning+4.7 on 200 IMO-level problems, 5 models
Agentic coding#1 Haiku 4.5 on TerminalBench-2 (37.6%)
Key ablation findingFull traces >> summaries >> scores only
Convergence speed10x fewer evals than best text optimizer
The broader lesson: The harness around a model is a first-class optimization target. When you give a capable coding agent unrestricted access to its own prior experience, it can discover harness strategies that outperform expert-designed ones — and the strategies generalize across tasks and models.
According to the authors, what is the main advantage of Meta-Harness over prior methods?