Meta-Harness — Veanors

Chapter 0: The Problem

You have the same LLM — same weights, same architecture, same everything. You change only the code that wraps it: what goes in the prompt, how context is managed, what examples are retrieved, how memory is updated. The performance gap? 6x on the same benchmark.

This is the harness engineering problem. The harness — the code around the model — often matters as much as the model itself. And right now, designing a good harness is entirely manual. An engineer inspects failures, tweaks prompts, adjusts retrieval logic, and repeats. It's slow, it doesn't scale, and it depends entirely on the engineer's intuition.

The 6x gap is real: On SWE-Bench Mobile, changing only the harness code around a fixed LLM produces a 6x difference in pass rate. The same model. The same benchmark. The only variable is the code that decides what the model sees.

Why is this hard to automate? Because a harness is a program, not a prompt template. It decides what to store, when to retrieve it, how to construct prompts, and how to update state after each model response. A single choice about what to put in context can affect behavior many reasoning steps later. The feedback signal is distributed across code, scores, and execution traces in a way that's extremely hard to compress into a simple gradient or summary.

Existing text optimizers — OPRO, TextGrad, AlphaEvolve, GEPA — try to iteratively improve prompts using feedback from prior attempts. But they compress that feedback too aggressively: some condition only on scalar scores, others use short LLM-generated summaries, others restrict feedback to templates. The total diagnostic context they work with ranges from 100 to 30,000 tokens per optimization step. For harness search, a single evaluation can produce 10,000,000 tokens of diagnostic information.

Performance Gap: Same Model, Different Harnesses

The same base model achieves wildly different scores depending on its harness. Click "Shuffle Harnesses" to see how different harness designs land on the performance spectrum.

Why is harness engineering hard to automate with existing text optimizers?

Because harness changes affect behavior many steps later, and the relevant diagnostic info (10M tokens) far exceeds what existing optimizers can process (100-30K tokens) Because LLMs are too slow to evaluate harness candidates Because harness code requires specialized programming languages

Chapter 1: The Key Insight

Here's the central idea of Meta-Harness: give a coding agent full filesystem access to all prior experience, and let it decide what to inspect.

Instead of compressing feedback into a summary or a score, Meta-Harness stores everything: every candidate harness's source code, every execution trace (the actual prompts sent, the model's responses, the state updates), and every evaluation score. All of it goes into a filesystem. The proposer — a coding agent — navigates this filesystem freely using standard developer tools: grep, cat, ls. It reads what it needs, ignores what it doesn't.

Why this is different from everything before: Prior text optimizers are like a student studying from flashcard summaries. Meta-Harness is like a student with access to every past exam, every solution attempt, every teacher's comment, and the ability to search through them at will. The student decides what to review based on what went wrong.

The proposer isn't a raw LLM receiving a fixed prompt assembled by the outer loop. It's a coding agent — Claude Code with Opus-4.6 — that can invoke developer tools and modify code. This matters because the amount of experience quickly exceeds any model's context window. In the most demanding setting, the proposer reads a median of 82 files per iteration, referencing over 20 prior candidates per step. The total diagnostic information from a single evaluation reaches up to 10 million tokens.

Prior Optimizers

Compress feedback into summaries or scores (100-30K tokens). Fixed prompt template. Agent sees a curated, lossy view of history.

↓ vs ↓

Meta-Harness

Store EVERYTHING in a filesystem (10M tokens per eval). Coding agent freely navigates with grep/cat. Agent decides what to inspect.

Concrete numbers: In the TerminalBench-2 setting, the proposer reads a median of 82 files per iteration. Each evaluation produces up to 10,000,000 tokens of logs. The full filesystem is far larger than the proposer's context window — so the agent must search through it, not ingest it whole. This selective inspection is the key capability that makes rich feedback usable.

What is the key difference between Meta-Harness's feedback mechanism and prior text optimizers?

Meta-Harness uses a bigger language model Meta-Harness stores all raw experience (code, traces, scores) in a filesystem and lets a coding agent selectively inspect it, rather than compressing feedback into fixed summaries Meta-Harness runs more evaluation iterations

Chapter 2: What is a Harness?

Before we can optimize harnesses, we need to define what one is. A harness is a stateful program that wraps a language model and determines what context the model sees at each step. It handles:

Prompt construction: What instructions, examples, and context go into each prompt?
Context management: As the conversation or task progresses, what gets kept, what gets dropped, what gets summarized?
Retrieval: When the model needs external information, what gets retrieved and how?
Memory: What gets stored between interactions for later use?
Orchestration: How are multi-step workflows managed? When does the model use tools? When does it retry?

A harness is NOT just a prompt. A prompt is a static string. A harness is a program that dynamically constructs prompts based on task state, retrieved context, stored memories, and prior model outputs. Think of it as the operating system for an LLM application.

The Formal Objective

Let M be a fixed language model and X a task distribution. For a harness H and task instance x from X, we execute a rollout trajectory:

τ ~ p_M(H, x)

The harness constructs prompts for M, the model responds, and the harness updates its state after each interaction. A task-specific reward function r(τ, x) scores the trajectory. The objective:

H* = argmax_H E_{x~X, τ~p_M(H,x)} [ r(τ, x) ]

In plain English: find the harness program that makes the model perform best, on average, across the task distribution. When multiple objectives matter (accuracy and context cost), the system evaluates candidates under Pareto dominance and reports the resulting frontier.

Why "meta"? Meta-Harness is itself a harness — it's the code that determines what the proposer model sees during search. It's a harness for optimizing harnesses. The task-specific harnesses being optimized are single-file Python programs that modify prompting, retrieval, memory, and orchestration logic.

In practice, each harness is implemented as a single Python file. This keeps the search space navigable: the agent can read, understand, and modify a single file more reliably than a multi-file project. The file contains all the logic for interacting with the base model M on a given task.

What does the formal objective H* = argmax E[r(τ, x)] mean in plain language?

Find the harness program that maximizes the expected reward when the base model executes tasks through it Find the model weights that maximize accuracy Find the prompt template that gets the highest score on one specific task

Chapter 3: The Search Loop

The Meta-Harness search loop is deliberately simple. No evolutionary operators. No parent-selection rules. No fitness-proportionate sampling. Just a coding agent with a growing filesystem.

The Algorithm

Initialize

Start with a population H of baseline harnesses (e.g., zero-shot, few-shot, ACE, MCE). Evaluate each. Store source code + scores + execution traces in the filesystem D.

↓

For each iteration t = 1...N

The proposer agent reads the filesystem D. It browses prior harnesses' code, scores, and traces using grep/cat. It proposes k new harnesses.

↓

Evaluate

Each proposed harness that passes interface validation is evaluated on the task distribution. Code, reasoning traces, and scores are stored back in D.

↓

Return

After N iterations, return the Pareto frontier of harnesses stored in D.

The simplicity is deliberate. By leaving diagnosis and edit decisions to the proposer rather than hard-coding search heuristics, Meta-Harness can improve automatically as coding agents become more capable. A typical run evaluates ~60 harnesses over 20 iterations, with 2 candidates per iteration.

What Gets Stored Per Candidate

Every evaluated harness gets its own directory in the filesystem containing:

Source code: The complete single-file Python harness
Evaluation scores: Per-task and aggregate metrics
Execution traces: The actual prompts sent to the base model, the model's responses, state updates, tool calls — the full diagnostic log of what happened during evaluation
Proposed reasoning: The agent's reasoning about why it made specific design choices

Interactive Search Loop

Step through the Meta-Harness search loop. Watch the filesystem grow as candidates are evaluated and stored. The agent reads prior experience before each proposal.

Click to start iteration 1

No parent-selection rule: Unlike evolutionary methods that select parents based on fitness, Meta-Harness imposes no restriction on what the proposer inspects. The agent is free to read any prior harness and its execution trace. In practice, it often starts from a strong prior harness, but this is an emergent strategy, not a hard-coded rule.

Why does Meta-Harness avoid hard-coded search heuristics like parent selection or mutation operators?

Because those heuristics are too computationally expensive Because the population is too small for evolutionary strategies Because delegating all diagnosis and proposal decisions to the coding agent lets the system improve automatically as agents get better, without being constrained by fixed heuristics

Chapter 4: Why Filesystem, Not Summaries?

This is the engineering decision that makes Meta-Harness work. Prior text optimizers compress feedback for a practical reason: scalability. But that compression loses the information needed to trace downstream failures back to earlier harness decisions.

The Compression Problem

Consider what happens when a harness fails on a task. The failure might be caused by:

A retrieval decision 5 steps earlier that surfaced the wrong examples
A memory update that stored the wrong information
A prompt construction choice that omitted a key constraint
An interaction between two components that each seemed reasonable in isolation

A scalar score tells you the harness failed. An LLM-generated summary might say "the model struggled with multi-step reasoning." But the raw execution trace shows you exactly which prompt was sent, what the model said, what state was updated, and where the chain of reasoning went off the rails. That's the information you need to fix it.

The ablation proves it: When the proposer receives only scores (no traces), it achieves 34.6 median accuracy. With scores plus LLM-generated summaries, it reaches 34.9. With full filesystem access to raw traces, it jumps to 50.0 — and even its median candidate outperforms the best candidate from either ablation. Summaries don't recover the missing signal.

Context Budgets Across Methods

Method	History Access	Log Content	MTok/iter
OPRO	Window	past (solution, score) pairs	0.002
TextGrad	Last	textual feedback on current artifact	0.015
AlphaEvolve	Window	program database + eval scores	0.022
GEPA	Summary	reflective feedback from rollout traces	0.008
Feedback Descent	Summary	comparison + textual feedback	0.012
TTT-Discover	Window	prev solution fragment	0.026
Meta-Harness	Full	all logs and scores	10.0

Three orders of magnitude. Meta-Harness operates with roughly 1000x more diagnostic context per optimization step than any prior method. The key is that the agent doesn't ingest all 10M tokens at once — it selectively inspects the parts it needs via grep and cat. The filesystem is an external memory that the agent queries on demand.

Context Budget Comparison (log scale)

Tokens of diagnostic information available per optimization step. Note the logarithmic scale — Meta-Harness operates with ~1000x more context than any prior method.

Why Not a Vector Database?

A natural question: why use a plain filesystem instead of a vector DB or structured knowledge graph? Because the agent's queries are unpredictable. Sometimes it needs to compare two specific harnesses' retrieval logic. Sometimes it needs to grep for a particular error message across all traces. Sometimes it needs to read the reasoning log of the highest-scoring candidate. A filesystem with standard UNIX tools supports all of these access patterns without pre-committing to a retrieval strategy.

In the ablation study, what happened when the proposer received LLM-generated summaries instead of raw execution traces?

Summaries barely helped over scores-only (34.9 vs 34.6 median), while full traces reached 50.0 — summaries may actually hurt by compressing away diagnostically useful details Summaries performed nearly as well as full traces Summaries were faster to process but slightly less accurate

Chapter 5: Code-Space Search

Meta-Harness searches over full Python programs, not prompt templates or configuration files. This is a fundamental difference from prior text optimization methods.

Why Code, Not Templates?

A prompt template has a fixed structure with slots to fill. A code-space harness can change anything: the retrieval algorithm, the memory data structure, the prompt construction logic, the error handling, the multi-step orchestration strategy. The agent can rewrite the retrieval function from dense search to BM25, add a caching layer, change how examples are formatted, or restructure the entire control flow.

Natural regularization bias. Representing harnesses as programs provides a surprising benefit: coding models tend to propose coherent algorithms rather than brittle, hard-coded solutions. A model trained on millions of codebases has strong priors about what "good code" looks like. This biases the search toward reusable context-management procedures rather than task-specific hacks like if-chains or hard-coded class mappings.

What the Agent Can Modify

Retrieval Logic

Switch between dense/sparse retrieval, change k, modify the query construction, add re-ranking

↓

Memory Strategy

Change what gets stored, when it's retrieved, how it's formatted, when it expires

↓

Prompt Construction

Rewrite system prompts, change example selection, modify formatting, add constraints

↓

Orchestration

Change multi-step workflows, tool use patterns, retry logic, error handling, state management

Credit Assignment Through Traces

Code-space search makes credit assignment possible. When the agent reads an execution trace, it can see: "This harness used BM25 retrieval and got 3 irrelevant examples for problem #47, which led the model to apply the wrong proof technique." That's actionable. The agent can then modify the retrieval function specifically, rather than making a vague "improve retrieval" edit.

The action space matches training. Coding agents are trained on read-write-execute workflows over codebases. Meta-Harness's loop — read prior code, reason about failures, propose an edit, test it — is exactly the workflow these agents are optimized for. This alignment between the search space and the agent's training distribution is a key advantage.

In practice, the proposer is guided by a minimal domain-specific skill that describes where to write new harnesses, how to inspect previous harnesses and their execution traces, and what files it can and cannot modify. This skill acts as a soft constraint on the search space without hard-coding any search strategy.

Why does code-space search provide a natural regularization bias?

Because Python has strict type checking that prevents bad code Because the code is shorter than prompt templates Because coding models trained on millions of codebases have strong priors toward coherent algorithms, biasing search away from brittle, task-specific hacks

Chapter 6: Results — Text Classification

The first test domain is online text classification: an LLM receives labeled examples one at a time, updates its memory, and is evaluated on a held-out test set. The model is GPT-OSS-120B. Three datasets: LawBench (215 criminal charge classes), Symptom2Disease (22 classes), and USPTO-50k (180 chemical reaction classes).

vs. Hand-Designed Harnesses

The baselines are the current state-of-the-art hand-designed harnesses for this problem:

ACE (Agentic Context Engineering): uses reflective memory curation. Achieves 40.9% avg accuracy but uses 50.8K context tokens.
MCE (Meta Context Engineering): maintains a library of natural-language skills. Achieves 40.0% accuracy with 28.5K tokens.

Meta-Harness reaches 48.6% accuracy — a 7.7-point improvement over ACE — while using only 11.4K context tokens. That's 4x fewer tokens than ACE. Higher accuracy, lower cost.

Better AND cheaper: Meta-Harness doesn't just match ACE's accuracy with fewer tokens. It surpasses ACE by 7.7 points while using 4x fewer context tokens. This is a strict Pareto improvement — better on both dimensions simultaneously.

vs. Text Optimizers

For a fair comparison against other search methods (same proposer model, same evaluation budget):

Method	Median Acc	Best Acc
GEPA	32.6	40.2
Best-of-N	34.0	44.2
OpenEvolve	39.1	43.3
TTT-Discover	34.1	45.6
Meta-Harness	50.0	56.7

Meta-Harness matches the best prior text optimizers' final accuracy after just 4 evaluations, while they need 60. Its final accuracy surpasses theirs by more than 10 points.

Out-of-Distribution Generalization

The discovered harness generalizes to 9 entirely new datasets unseen during search, achieving 73.1% average accuracy vs 70.2% for ACE. This confirms Meta-Harness captures generally effective strategies rather than overfitting to the search datasets.

10x faster convergence. Meta-Harness matches the best prior text optimizers with 10x fewer evaluations, then continues improving to a substantially higher final accuracy. The authors attribute this to the design choice of preserving full experience history rather than compressing it.

How does Meta-Harness compare to ACE on the accuracy-vs-context tradeoff?

Meta-Harness uses more tokens but achieves higher accuracy Meta-Harness achieves 7.7 points higher accuracy while using 4x fewer context tokens — a strict Pareto improvement Meta-Harness matches ACE's accuracy with similar token usage

Chapter 7: Results — Math & Coding

Retrieval-Augmented Math Reasoning

The setup: augment an LLM with the ability to retrieve examples from a large corpus of 500,000+ solved problems before attempting IMO-level math problems. The question isn't whether retrieval helps — it's whether Meta-Harness can discover the right retrieval policy.

The search optimizes over retrieval harnesses for 40 iterations using GPT-OSS-20B on a 250-problem search set. A single winning harness is then evaluated on 200 previously unseen IMO-level problems, across five models (including four never seen during search).

+4.7 points on IMO-level problems across 5 held-out models. The discovered retrieval harness outperforms no-retrieval on all five models, including four models not seen during search. It operates on top of plain BM25 lexical retrieval — no fancy dense encoder needed. The harness improves how retrieval is used (query construction, example selection, formatting), not the retrieval mechanism itself.

Agentic Coding: TerminalBench-2

TerminalBench-2 evaluates LLM agents on 89 challenging tasks requiring long-horizon, fully autonomous execution. This is an actively contested benchmark with multiple teams directly optimizing for it.

Harness (Opus 4.6)	Pass %
Claude Code	58.0
Terminus 2	62.9
Terminus-KIRA	74.7
Capy	75.3
Meta-Harness	76.4

Harness (Haiku 4.5)	Pass %
Claude Code	27.5
Mini-SWE-Agent	29.8
Terminus-KIRA	33.7
Goose	35.5
Meta-Harness	37.6

On Haiku 4.5, Meta-Harness ranks #1 among all agents on the leaderboard. On Opus 4.6, it ranks #2 (76.4% vs ForgeCode's reported 81.8%, though ForgeCode's result could not be reproduced from their published code).

Results: Meta-Harness vs Baselines

Performance comparison across three evaluation domains. Toggle between tasks.

What makes the TerminalBench-2 result particularly impressive?

An automatic search method (#1 on Haiku 4.5) outperforms multiple hand-engineered agents on an actively contested benchmark where teams are directly optimizing their harnesses Meta-Harness uses a more powerful base model than the competitors The benchmark tasks are easier than typical coding tasks

Chapter 8: What the Agent Actually Does

The search trajectories reveal how the proposer agent behaves in practice. This is where the filesystem access pays off — the agent develops sophisticated strategies for navigating the search space.

A Real Search Trajectory (TerminalBench-2)

Here's what the proposer actually does across iterations, reconstructed from the paper's appendix:

Early Iterations

The agent reads broadly across multiple prior harnesses. It combines plausible structural fixes (e.g., better error handling) with prompt-template edits. Both candidates regress.

↓

Hypothesis Formation

The agent reads the execution traces of the regressed candidates. It explicitly hypothesizes that the regressions were confounded — the structural change and prompt rewrite were entangled, making it impossible to tell which caused the regression.

↓

Isolation Strategy

The agent isolates the structural changes from the prompt rewrite. It tests each independently, trying to identify the causal variable.

↓

Pivot to Safety

After repeated regressions, the agent pivots toward a safer additive modification strategy — making small, targeted changes to a known-good harness rather than attempting large rewrites.

↓

Best Candidate

The safe additive modification becomes the best candidate in the run. The agent's causal reasoning, enabled by filesystem access, led it to the right strategy.

This is not random search. The agent forms hypotheses ("the prompt rewrite confounded the structural change"), designs experiments to test them (isolate the variables), and updates its strategy based on results (pivot to safer edits). This causal reasoning is only possible because the agent can read the raw execution traces — a summary saying "candidate regressed" wouldn't be enough.

What 82 Files Looks Like

In a typical iteration, the proposer:

Reads the top 5 scoring harnesses' source code in full
Greps across all execution traces for specific error patterns
Compares the retrieval logic of two harnesses that performed differently on similar tasks
Reads the reasoning logs of candidates that regressed, looking for confounded edits
Checks the scores breakdown per task, looking for patterns in which tasks improved vs degraded

This adds up to a median of 82 files per iteration, referencing over 20 prior candidates per step. The agent is doing what a skilled engineer would do — reading code, checking logs, forming hypotheses, testing edits — but at a scale and speed no human could match.

Why this only became practical recently: The authors note that this workflow "only became practical recently, following major improvements in coding-agent capabilities around early 2026." The proposer must reliably read code, reason about multi-step failures, and generate valid Python programs. Earlier coding agents weren't consistent enough for this loop to converge.

What did the proposer agent do when it observed that two candidates both regressed?

It randomly sampled a new harness from scratch It hypothesized the regressions were confounded (structural change + prompt rewrite entangled), isolated the variables, and eventually pivoted to safer additive modifications It increased the number of evaluation samples

Chapter 9: Connections

Meta-Harness sits at the intersection of several important research threads. Let's map where it fits.

Relation to AlphaEvolve / OpenEvolve

AlphaEvolve (Google DeepMind) searches over executable code for mathematical functions and algorithms. Like Meta-Harness, it uses LLMs as mutation operators over code. But AlphaEvolve operates on designated functions within fixed scaffolds and uses a structured program database with eval scores (~22K tokens/iter). Meta-Harness searches over entire harness implementations with unrestricted filesystem access (~10M tokens/iter). On text classification, Meta-Harness outperforms OpenEvolve by 13+ points.

Relation to GEPA

GEPA uses "reflective feedback from rollout traces" — but these are LLM-generated summaries of traces, not the raw traces themselves. Meta-Harness's ablation shows this compression loses critical diagnostic information.

Relation to OPRO / TextGrad

OPRO conditions only on past (solution, score) pairs. TextGrad uses textual feedback on the current artifact only. Both operate in a much narrower feedback regime than Meta-Harness.

Relation to AI Scientist

The AI Scientist uses coding agents to propose and evaluate scientific hypotheses. Meta-Harness applies a similar agent-driven search loop, but specialized to harness optimization rather than scientific discovery.

Relation to Coding Agents (Claude Code, Cursor)

Meta-Harness's proposer is a coding agent (Claude Code with Opus-4.6). The key contribution isn't the agent itself but the outer loop: what information the agent sees, how experience is accumulated, and how the search is structured.

The Bitter Lesson Connection

The authors explicitly cite Sutton's Bitter Lesson: "once a search space becomes accessible, stronger general-purpose agents can outperform hand-engineered solutions." Meta-Harness is a concrete instance of this pattern applied to harness engineering.

Cheat Sheet

Aspect	Meta-Harness
What it optimizes	Harness code (Python programs wrapping an LLM)
Proposer	Claude Code with Opus-4.6
Feedback mechanism	Full filesystem of code + traces + scores
Context per eval	Up to 10M tokens
Files read/iteration	Median 82
Search budget	~60 harnesses over 20 iterations
Text classification	+7.7 over ACE, 4x fewer tokens
Math reasoning	+4.7 on 200 IMO-level problems, 5 models
Agentic coding	#1 Haiku 4.5 on TerminalBench-2 (37.6%)
Key ablation finding	Full traces >> summaries >> scores only
Convergence speed	10x fewer evals than best text optimizer

The broader lesson: The harness around a model is a first-class optimization target. When you give a capable coding agent unrestricted access to its own prior experience, it can discover harness strategies that outperform expert-designed ones — and the strategies generalize across tasks and models.

According to the authors, what is the main advantage of Meta-Harness over prior methods?

Using a more powerful base model Having a larger training dataset Not just search over code, but search with selective access to full prior diagnostic experience — raw code, execution traces, and scores — rather than compressed summaries

Meta-Harness: End-to-End Optimization of Model Harnesses