The code wrapping an LLM matters as much as the model itself (6x performance gap). Meta-Harness automates harness search by giving a coding agent filesystem access to all prior candidates' source code, execution traces, and scores.
You have the same LLM — same weights, same architecture, same everything. You change only the code that wraps it: what goes in the prompt, how context is managed, what examples are retrieved, how memory is updated. The performance gap? 6x on the same benchmark.
This is the harness engineering problem. The harness — the code around the model — often matters as much as the model itself. And right now, designing a good harness is entirely manual. An engineer inspects failures, tweaks prompts, adjusts retrieval logic, and repeats. It's slow, it doesn't scale, and it depends entirely on the engineer's intuition.
Why is this hard to automate? Because a harness is a program, not a prompt template. It decides what to store, when to retrieve it, how to construct prompts, and how to update state after each model response. A single choice about what to put in context can affect behavior many reasoning steps later. The feedback signal is distributed across code, scores, and execution traces in a way that's extremely hard to compress into a simple gradient or summary.
Existing text optimizers — OPRO, TextGrad, AlphaEvolve, GEPA — try to iteratively improve prompts using feedback from prior attempts. But they compress that feedback too aggressively: some condition only on scalar scores, others use short LLM-generated summaries, others restrict feedback to templates. The total diagnostic context they work with ranges from 100 to 30,000 tokens per optimization step. For harness search, a single evaluation can produce 10,000,000 tokens of diagnostic information.
The same base model achieves wildly different scores depending on its harness. Click "Shuffle Harnesses" to see how different harness designs land on the performance spectrum.
Here's the central idea of Meta-Harness: give a coding agent full filesystem access to all prior experience, and let it decide what to inspect.
Instead of compressing feedback into a summary or a score, Meta-Harness stores everything: every candidate harness's source code, every execution trace (the actual prompts sent, the model's responses, the state updates), and every evaluation score. All of it goes into a filesystem. The proposer — a coding agent — navigates this filesystem freely using standard developer tools: grep, cat, ls. It reads what it needs, ignores what it doesn't.
The proposer isn't a raw LLM receiving a fixed prompt assembled by the outer loop. It's a coding agent — Claude Code with Opus-4.6 — that can invoke developer tools and modify code. This matters because the amount of experience quickly exceeds any model's context window. In the most demanding setting, the proposer reads a median of 82 files per iteration, referencing over 20 prior candidates per step. The total diagnostic information from a single evaluation reaches up to 10 million tokens.
Before we can optimize harnesses, we need to define what one is. A harness is a stateful program that wraps a language model and determines what context the model sees at each step. It handles:
Let M be a fixed language model and X a task distribution. For a harness H and task instance x from X, we execute a rollout trajectory:
The harness constructs prompts for M, the model responds, and the harness updates its state after each interaction. A task-specific reward function r(τ, x) scores the trajectory. The objective:
In plain English: find the harness program that makes the model perform best, on average, across the task distribution. When multiple objectives matter (accuracy and context cost), the system evaluates candidates under Pareto dominance and reports the resulting frontier.
In practice, each harness is implemented as a single Python file. This keeps the search space navigable: the agent can read, understand, and modify a single file more reliably than a multi-file project. The file contains all the logic for interacting with the base model M on a given task.
The Meta-Harness search loop is deliberately simple. No evolutionary operators. No parent-selection rules. No fitness-proportionate sampling. Just a coding agent with a growing filesystem.
Every evaluated harness gets its own directory in the filesystem containing:
Step through the Meta-Harness search loop. Watch the filesystem grow as candidates are evaluated and stored. The agent reads prior experience before each proposal.
This is the engineering decision that makes Meta-Harness work. Prior text optimizers compress feedback for a practical reason: scalability. But that compression loses the information needed to trace downstream failures back to earlier harness decisions.
Consider what happens when a harness fails on a task. The failure might be caused by:
A scalar score tells you the harness failed. An LLM-generated summary might say "the model struggled with multi-step reasoning." But the raw execution trace shows you exactly which prompt was sent, what the model said, what state was updated, and where the chain of reasoning went off the rails. That's the information you need to fix it.
| Method | History Access | Log Content | MTok/iter |
|---|---|---|---|
| OPRO | Window | past (solution, score) pairs | 0.002 |
| TextGrad | Last | textual feedback on current artifact | 0.015 |
| AlphaEvolve | Window | program database + eval scores | 0.022 |
| GEPA | Summary | reflective feedback from rollout traces | 0.008 |
| Feedback Descent | Summary | comparison + textual feedback | 0.012 |
| TTT-Discover | Window | prev solution fragment | 0.026 |
| Meta-Harness | Full | all logs and scores | 10.0 |
Tokens of diagnostic information available per optimization step. Note the logarithmic scale — Meta-Harness operates with ~1000x more context than any prior method.
A natural question: why use a plain filesystem instead of a vector DB or structured knowledge graph? Because the agent's queries are unpredictable. Sometimes it needs to compare two specific harnesses' retrieval logic. Sometimes it needs to grep for a particular error message across all traces. Sometimes it needs to read the reasoning log of the highest-scoring candidate. A filesystem with standard UNIX tools supports all of these access patterns without pre-committing to a retrieval strategy.
Meta-Harness searches over full Python programs, not prompt templates or configuration files. This is a fundamental difference from prior text optimization methods.
A prompt template has a fixed structure with slots to fill. A code-space harness can change anything: the retrieval algorithm, the memory data structure, the prompt construction logic, the error handling, the multi-step orchestration strategy. The agent can rewrite the retrieval function from dense search to BM25, add a caching layer, change how examples are formatted, or restructure the entire control flow.
Code-space search makes credit assignment possible. When the agent reads an execution trace, it can see: "This harness used BM25 retrieval and got 3 irrelevant examples for problem #47, which led the model to apply the wrong proof technique." That's actionable. The agent can then modify the retrieval function specifically, rather than making a vague "improve retrieval" edit.
In practice, the proposer is guided by a minimal domain-specific skill that describes where to write new harnesses, how to inspect previous harnesses and their execution traces, and what files it can and cannot modify. This skill acts as a soft constraint on the search space without hard-coding any search strategy.
The first test domain is online text classification: an LLM receives labeled examples one at a time, updates its memory, and is evaluated on a held-out test set. The model is GPT-OSS-120B. Three datasets: LawBench (215 criminal charge classes), Symptom2Disease (22 classes), and USPTO-50k (180 chemical reaction classes).
The baselines are the current state-of-the-art hand-designed harnesses for this problem:
Meta-Harness reaches 48.6% accuracy — a 7.7-point improvement over ACE — while using only 11.4K context tokens. That's 4x fewer tokens than ACE. Higher accuracy, lower cost.
For a fair comparison against other search methods (same proposer model, same evaluation budget):
| Method | Median Acc | Best Acc |
|---|---|---|
| GEPA | 32.6 | 40.2 |
| Best-of-N | 34.0 | 44.2 |
| OpenEvolve | 39.1 | 43.3 |
| TTT-Discover | 34.1 | 45.6 |
| Meta-Harness | 50.0 | 56.7 |
Meta-Harness matches the best prior text optimizers' final accuracy after just 4 evaluations, while they need 60. Its final accuracy surpasses theirs by more than 10 points.
The discovered harness generalizes to 9 entirely new datasets unseen during search, achieving 73.1% average accuracy vs 70.2% for ACE. This confirms Meta-Harness captures generally effective strategies rather than overfitting to the search datasets.
The setup: augment an LLM with the ability to retrieve examples from a large corpus of 500,000+ solved problems before attempting IMO-level math problems. The question isn't whether retrieval helps — it's whether Meta-Harness can discover the right retrieval policy.
The search optimizes over retrieval harnesses for 40 iterations using GPT-OSS-20B on a 250-problem search set. A single winning harness is then evaluated on 200 previously unseen IMO-level problems, across five models (including four never seen during search).
TerminalBench-2 evaluates LLM agents on 89 challenging tasks requiring long-horizon, fully autonomous execution. This is an actively contested benchmark with multiple teams directly optimizing for it.
| Harness (Opus 4.6) | Pass % |
|---|---|
| Claude Code | 58.0 |
| Terminus 2 | 62.9 |
| Terminus-KIRA | 74.7 |
| Capy | 75.3 |
| Meta-Harness | 76.4 |
| Harness (Haiku 4.5) | Pass % |
|---|---|
| Claude Code | 27.5 |
| Mini-SWE-Agent | 29.8 |
| Terminus-KIRA | 33.7 |
| Goose | 35.5 |
| Meta-Harness | 37.6 |
On Haiku 4.5, Meta-Harness ranks #1 among all agents on the leaderboard. On Opus 4.6, it ranks #2 (76.4% vs ForgeCode's reported 81.8%, though ForgeCode's result could not be reproduced from their published code).
Performance comparison across three evaluation domains. Toggle between tasks.
The search trajectories reveal how the proposer agent behaves in practice. This is where the filesystem access pays off — the agent develops sophisticated strategies for navigating the search space.
Here's what the proposer actually does across iterations, reconstructed from the paper's appendix:
In a typical iteration, the proposer:
This adds up to a median of 82 files per iteration, referencing over 20 prior candidates per step. The agent is doing what a skilled engineer would do — reading code, checking logs, forming hypotheses, testing edits — but at a scale and speed no human could match.
Meta-Harness sits at the intersection of several important research threads. Let's map where it fits.
AlphaEvolve (Google DeepMind) searches over executable code for mathematical functions and algorithms. Like Meta-Harness, it uses LLMs as mutation operators over code. But AlphaEvolve operates on designated functions within fixed scaffolds and uses a structured program database with eval scores (~22K tokens/iter). Meta-Harness searches over entire harness implementations with unrestricted filesystem access (~10M tokens/iter). On text classification, Meta-Harness outperforms OpenEvolve by 13+ points.
GEPA uses "reflective feedback from rollout traces" — but these are LLM-generated summaries of traces, not the raw traces themselves. Meta-Harness's ablation shows this compression loses critical diagnostic information.
OPRO conditions only on past (solution, score) pairs. TextGrad uses textual feedback on the current artifact only. Both operate in a much narrower feedback regime than Meta-Harness.
The AI Scientist uses coding agents to propose and evaluate scientific hypotheses. Meta-Harness applies a similar agent-driven search loop, but specialized to harness optimization rather than scientific discovery.
Meta-Harness's proposer is a coding agent (Claude Code with Opus-4.6). The key contribution isn't the agent itself but the outer loop: what information the agent sees, how experience is accumulated, and how the search is structured.
The authors explicitly cite Sutton's Bitter Lesson: "once a search space becomes accessible, stronger general-purpose agents can outperform hand-engineered solutions." Meta-Harness is a concrete instance of this pattern applied to harness engineering.
| Aspect | Meta-Harness |
|---|---|
| What it optimizes | Harness code (Python programs wrapping an LLM) |
| Proposer | Claude Code with Opus-4.6 |
| Feedback mechanism | Full filesystem of code + traces + scores |
| Context per eval | Up to 10M tokens |
| Files read/iteration | Median 82 |
| Search budget | ~60 harnesses over 20 iterations |
| Text classification | +7.7 over ACE, 4x fewer tokens |
| Math reasoning | +4.7 on 200 IMO-level problems, 5 models |
| Agentic coding | #1 Haiku 4.5 on TerminalBench-2 (37.6%) |
| Key ablation finding | Full traces >> summaries >> scores only |
| Convergence speed | 10x fewer evals than best text optimizer |