Ning, Tieu, Fu, Wei, Li, Bei et al. — UIUC, Meta, Stanford, 2026

Code as Agent Harness

A unified framework for executable, verifiable, and stateful AI agent systems — where code isn't just what agents produce, but the infrastructure they operate through.

Prerequisites: What an LLM is + Basic agent concepts (prompting, tool use)
12
Chapters
9
Simulations

Chapter 0: The Problem

You give an LLM a task: "Fix the authentication bug in issue #347." The model generates a patch. It looks right. But it doesn't compile because it imported a function that was renamed two commits ago. It doesn't know the repo's test framework. It can't run the tests to check. It has no memory of what it tried before. It can't even see the file it needs to edit.

This isn't a model intelligence problem. The model could fix the bug — it just can't operate. It has no way to read the repository, execute code, verify its output, remember what worked, or coordinate with other agents. It's a brain without a body.

The core thesis: Code is no longer just what agents produce. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. The code wrapping the model — the harness — determines whether a capable LLM becomes a capable agent.

Think about what happens when you use a coding assistant like Claude Code or Codex. The model doesn't just generate text. It reads files, runs shell commands, executes tests, edits code, manages context across hundreds of interactions, and decides when to ask for human approval. All of that infrastructure — the tools, the permissions, the memory, the execution sandbox — is the harness.

The paper draws a critical three-way distinction that most surveys miss:

  1. Model-internal capabilities — the LLM's reasoning, perception, planning, simulation, and evaluation abilities. These are the weights.
  2. System-provided harness infrastructure — predefined tools, APIs, sandboxes, memory systems, validators, permission boundaries, telemetry, and workflows. This is the operating system.
  3. Agent-initiated code artifacts — interactive code objects that agents create, execute, observe, revise, persist, and share within the task execution loop. Regression tests, temporary tools, DSL programs, reusable skills, intermediate program states. This is the medium of agency.

Most prior surveys collapse (2) and (3). This paper's contribution is distinguishing them: the harness infrastructure is predefined, but agent-initiated code artifacts are emergent — the agent creates them during execution, and they feed back into its own capabilities.

And here's the striking finding: the same model, with different harnesses, produces a 6x performance gap on the same benchmark. Same weights. Same architecture. The only variable is the code that decides what the model sees and does.

The Harness Gap: Same Model, Different Infrastructure

Click "Shuffle" to see how different harness designs change agent performance on the same task. The model is identical — only the surrounding code changes.

Why "harness"? In software testing, a test harness is the infrastructure that sets up the environment, runs the test, captures results, and reports outcomes. An agent harness does the same for an LLM: it sets up context, routes actions, captures feedback, and governs state transitions. The harness is the operating system for the agent. The paper explicitly identifies four properties that make code uniquely suited as this medium: executable (model outputs become operations with formally verifiable outcomes), inspectable (intermediate computation is exposed as structured traces), stateful (the evolving program represents task progress in a persistent, modifiable form), and governed (constrained by permissions, verification, and accountability).

What Counts as "Code" in This Framework?

The survey explicitly defines what counts as "code": executable or machine-checkable artifacts including programs, scripts, formal specifications, proof scripts, API schemas, tool definitions, tests, repositories, simulators, configuration files, and execution artifacts like traces and logs.

Not code: raw perception, physical state, human intent, or model-internal reasoning. These may be represented through code, but they're not code themselves. This boundary matters because it determines what the harness can verify. If an artifact is code, the harness can execute it, check it, and feed back structured results. If it's not code, the harness must either convert it to code or rely on weaker verification signals.

Why does the same LLM produce vastly different results with different harnesses?

Chapter 1: The Three-Layer Taxonomy

The survey organizes the entire code-as-agent-harness landscape into three connected layers. This isn't just a convenient filing system — it follows how code becomes an operational medium inside the agent loop: it first enters as a harness interface for reasoning, acting, and environment representation; it then supports harness mechanisms that manage planning, memory, tool use, execution, and repair over time; and it finally becomes a shared artifact through which multiple agents coordinate.

Layer 1: Harness Interface

This is where code connects the agent to the world. Code plays three distinct roles at this layer:

Layer 2: Harness Mechanisms

Once the interface connects the agent to the world, it needs systems to operate reliably over time. Five mechanisms:

Layer 3: Multi-Agent Orchestration

When tasks exceed what a single agent can handle — due to context limits, specialization needs, or verification requirements — the harness scales to multiple agents sharing code artifacts as their coordination medium. The paper examines role specialization, interaction modes (collaborative, adversarial, debate), workflow topologies (chain, star, cyclic, hierarchical, adaptive), execution feedback integration, shared-harness synchronization, and the concept of harness-state convergence.

The Three-Layer Harness Architecture

Hover over each layer to see its components. Click a component to highlight its connections across layers.

Why code specifically? Code has three properties that make it uniquely suited as a harness medium: (1) Executable — model outputs become operations with formally verifiable outcomes. (2) Inspectable — intermediate computation is exposed as structured traces the harness can read and act upon. (3) Stateful — the evolving program represents task progress in a persistent, modifiable form across steps. Natural language has none of these properties. JSON schemas have executability but not statefulness. Code has all three.

The Three Elements of Agent Systems

A worked example clarifies the three-way distinction. Consider a coding agent resolving a GitHub issue:

Model-Internal Capabilities
The LLM understands Python syntax, can reason about control flow, and knows common patterns for authentication systems. These are baked into the weights.
System-Provided Harness Infrastructure
The harness provides: file read/write tools, shell access, test runners, permission gates, context management, memory systems. These are predefined before the agent starts.
Agent-Initiated Code Artifacts
During execution, the agent creates: a regression test for the bug, a temporary debugging script, a PLAN.md file, a patch file. These emerge from the interaction and feed back into the agent's capabilities.

Prior surveys treat (2) and (3) as the same thing. This paper's key conceptual move is separating them, because agent-initiated code artifacts represent a feedback loop: the agent improves its own capabilities by creating executable artifacts that persist across steps.

Which of these is NOT one of the three properties that make code uniquely suited as a harness medium?

Chapter 2: Code for Reasoning

Ask an LLM to compute 347 × 892. It might get it wrong. Ask it to write a Python program that computes 347 × 892, then execute that program. It gets it right every time. This is the simplest form of code-for-reasoning: delegating computation from the model's unreliable internal arithmetic to an external, deterministic executor.

But code-for-reasoning goes far beyond arithmetic. The survey identifies three paradigms, each representing a deeper integration of code into the reasoning process:

Paradigm 1: Program-Delegated Reasoning

The model generates executable programs as the primary interface between problem decomposition and computation. Instead of chain-of-thought in natural language, the model produces code that an interpreter executes.

Program-of-Thoughts (PoT) was the first to formalize this: decompose a reasoning problem into an executable program, run it, return the result. The key insight is separating what to compute (the model's job) from how to compute it (the interpreter's job).

Chain-of-Thought
Model reasons AND computes in natural language. "347 × 892 = let me think... 300 × 892 = 267,600... 47 × 892 = 41,924... total = 309,524." (Often wrong.)
↓ vs ↓
Program-of-Thoughts
Model writes: print(347 * 892). Interpreter executes. Returns: 309524. (Always right.)

CodeI/O pushes this further: it transforms programs into input-output prediction tasks, exposing reasoning primitives like state-space search, decision-tree traversal, and modular decomposition while keeping procedural rigor through executable verification. The key idea: instead of teaching the model to reason in natural language, teach it to predict what code would produce — then verify by actually running it.

PAL (Program-Aided Language models) extends PoT by having the model generate Python programs where natural language comments serve as the reasoning trace. The comments explain why each step; the code computes what each step produces. This creates a dual-channel reasoning signal: human-readable rationale + machine-verifiable computation.

Paradigm 2: Formal Verification and Symbolic Reasoning

What if the reasoning itself needs to be proved correct, not just executed? Formal verification uses machine-checkable languages — Lean, Isabelle, Coq — where every derivation step is verified by a proof checker.

Systems like DeepSeek-Prover-V2 combine LLM generation with proof-assistant feedback: the model proposes proof steps, the verifier checks them, and failures guide the next attempt. Lean4Agent extends this to agent workflows themselves — using Lean to model and verify that an agent's trajectory satisfies safety contracts.

The survey identifies a crucial data flow pattern in formal verification systems:

# Formal verification loop (pseudocode)
def prove_theorem(statement, model, verifier, max_attempts=10):
    proof_state = verifier.initialize(statement)
    for attempt in range(max_attempts):
        # Model proposes next tactic
        tactic = model.generate(
            context=proof_state.goals,      # Current proof obligations
            feedback=proof_state.errors,     # Prior tactic failures
            history=proof_state.trace        # Successful steps so far
        )
        # Verifier checks the tactic deterministically
        result = verifier.apply_tactic(proof_state, tactic)
        if result.success:
            proof_state = result.new_state
            if proof_state.is_complete:
                return proof_state.proof   # Verified proof
        else:
            proof_state.errors.append(result.error)
    return None  # Failed to prove
The verification spectrum: Program-delegated reasoning verifies that computation is correct (the interpreter guarantees 347 × 892 = 309524). Formal verification proves that reasoning is sound (every step follows logically from axioms). The first catches arithmetic errors; the second catches logical errors. Both use code as the verification medium. The paper maps these to a continuous spectrum from "weakly checked" (execute and observe output) to "strongly checked" (machine-verified proof).

Paradigm 3: Iterative Code-Grounded Reasoning

The most powerful paradigm treats reasoning as a closed loop: generate code → execute it → observe the trace → refine. The execution trace itself becomes the reasoning signal.

R1-Code-Interpreter interleaves reasoning and multiple rounds of code execution within persistent interactive sessions. The model reasons in text, writes code to test a hypothesis, observes the result, updates its reasoning, writes more code. Each execution is a checkpoint that grounds the reasoning in reality.

CodePRM learns reward functions over reasoning-execution trajectories, using execution outcomes to evaluate and refine intermediate reasoning steps. The harness doesn't just execute code — it scores the reasoning quality based on what the execution reveals. This is a process reward model (PRM) for code, where the reward signal comes from deterministic execution rather than human labeling.

Why this matters: In purely textual reasoning, the model can confidently produce a wrong answer and never know it. In code-grounded reasoning, the interpreter acts as a reality check at every step. The model can't hallucinate 347 × 892 = 310,000 because the interpreter will disagree. This is the fundamental advantage of executable reasoning: the harness has deterministic sensors for detecting errors.

Data Flow: The Three Paradigms Side-by-Side

Let's trace how a single problem flows through each paradigm. The problem: "What is the probability of drawing two aces from a standard deck without replacement?"

Code-for-Reasoning: Three Paradigms

Click each paradigm to see its data flow, verification signal, and failure mode.

What is the fundamental advantage of code-grounded reasoning over pure chain-of-thought?

Chapter 3: Code for Acting

A robot needs to pick up a cup. An agent needs to click "Submit" on a web form. A coding assistant needs to edit line 47 of auth.py. In each case, the model's internal reasoning must be translated into an executable operation that changes the state of the world. Code is that translation layer.

The central challenge is grounding: mapping abstract language outputs into executable behaviors that respect the constraints of the target environment — physical limits, API schemas, safety requirements, timing constraints.

Grounded Skill Selection

The simplest form: the model selects from a library of pre-defined executable skills. SayCan (Google, 2022) pioneered this by combining an LLM's semantic knowledge with an affordance model that scores which actions are physically possible right now.

The actual computation is a product of two scores:

a* = argmaxa ∈ A PLLM(a | instruction, history) × Paffordance(feasible | a, state)

The LLM scores relevance: "pick up mug" is more relevant to "make coffee" than "open fridge". The affordance model scores feasibility: can the robot actually execute this action given its current position, gripper state, and reachable objects?

User Intent
"Make me a coffee"
LLM Scoring
P("pick up mug" | "make coffee") = 0.8
P("open fridge" | "make coffee") = 0.1
×
Affordance Scoring
P(can_execute("pick up mug") | current_state) = 0.9
P(can_execute("open fridge") | current_state) = 0.7
Selected Action
pick_up_mug() — highest combined score: 0.8 × 0.9 = 0.72

The critical engineering decision: the LLM never directly controls the robot. It scores which skill to invoke. The skill itself is a pre-built controller with safety guarantees. This separation — LLM for intent, code for execution — is the harness pattern.

Programmatic Policy Generation

What if the skill library doesn't have the right skill? Code as Policies (Liang et al., 2023) lets the model write new robot control programs on the fly. The model generates Python code that directly calls perception and control APIs:

# Model-generated robot policy
def pour_into_tallest_cup(robot):
    cups = robot.detect("cup")
    tallest = max(cups, key=lambda c: c.height)
    robot.pick(robot.detect("pitcher")[0])
    robot.pour(tallest.position)

The model writes the program; the robot's API handles the actual motor commands. This is more flexible than skill selection but introduces a new risk: the generated code might be unsafe (pour into a cup that's upside down, move the arm into a collision). The harness must validate generated code before execution.

Lifelong Code-Based Agents

Voyager (Wang et al., 2023) takes this to the extreme: an agent that plays Minecraft by writing JavaScript programs, storing successful programs in a skill library, and composing them to solve increasingly complex tasks. The skill library grows over time — the agent literally programs itself into greater capability.

The data flow is a self-reinforcing loop:

1. Automatic Curriculum
GPT-4 proposes increasingly difficult tasks based on the agent's current skill level and inventory.
2. Code Generation
Model writes a JavaScript program using the Mineflayer API to accomplish the task.
3. Execution + Self-Verification
Program runs in Minecraft. Success is checked by environment criteria (inventory changes, block placements).
4. Skill Library
Successful programs are stored with descriptions. Future tasks can retrieve and compose stored skills. Library grows from 0 to 70+ skills.

SkillOS (Ouyang et al., 2026) extends this pattern with RL-trained skill curation, automatically deciding which skills to keep, merge, or discard based on usage patterns and success rates. The skill library isn't just a growing list — it's a managed, evolving codebase.

The key pattern across all three: Code sits between the model and the environment. It's never raw model output → world. It's always model output → code → environment APIs → world. The code layer provides: (1) a structured interface the model can target, (2) a validation point where the harness can check safety, and (3) a reusable artifact that can be stored, versioned, and composed.

GUI/OS Agents: The POMDP Formulation

GUI agents reveal the code-as-action thesis most clearly. A web page is rendered code (HTML/CSS/JS). An action is a code call (click, type, scroll). An evaluation is a code check (did the DOM change correctly?). The entire perception-action-evaluation loop is code.

The paper formalizes this as a POMDP:

The CodeAct paradigm pushes this further: instead of emitting JSON tool calls, agents emit Python or JavaScript snippets that compose primitives. Cradle uses an LMM to output executable Python that drives keyboard and mouse for any application, including games. Native GUI models like UI-TARS and ShowUI collapse the planner→grounder→executor pipeline into a single VLA model whose output token stream is itself runnable code.

Why does Code-as-Policies generate Python programs rather than directly outputting motor commands?

Chapter 4: Code for Environment

We've seen code as a reasoning substrate and an action interface. Now the third role: code as the environment itself. When a coding agent works on a repository, the repository is the world. Files are the state. Tests are the laws of physics. Git history is the timeline. Execution traces are observations.

Structured World Representations

An agent working on a codebase needs a model of that codebase. Not just the raw files, but the structure: which classes depend on which, what functions call what, where the tests are, what the build system does.

WorldCoder (Tang et al., 2024) takes this idea literally: it builds world models by writing code. The agent interacts with an environment, observes transitions, and writes Python programs that predict what will happen next. The world model is executable — the agent can run it to simulate future states without actually taking actions.

Why code-as-world-model beats neural world models here: A learned neural world model is a black box. You can query it but you can't inspect why it predicts what it does. A code-based world model is transparent: you can read the rules, find bugs in them, and fix them. The agent can literally debug its own understanding of the world.

Execution-Trace World Modeling

Sometimes the world model isn't a program you write from scratch — it's the execution trace of the program you're working on. CWM (Code World Model) trains an LLM to predict code execution behavior: given a program, predict its output, variable states, and control flow without actually running it.

This is world modeling in the most literal sense: the model learns the physics of code execution. When the model can accurately predict what a program will do, it can reason about edits without executing them — a crucial capability for agents working on codebases where tests are slow or running infrastructure is expensive.

The data flow for execution-trace world modeling:

# CWM: The model predicts execution traces
input_program = """
def fibonacci(n):
    if n <= 1: return n
    return fibonacci(n-1) + fibonacci(n-2)
print(fibonacci(6))
"""

# Model predicts without executing:
predicted_output = "8"
predicted_trace = {
    "call_depth": 6,
    "return_values": [0, 1, 1, 2, 3, 5, 8],
    "total_calls": 25
}

# Verification: actually run and compare
actual_output = exec(input_program)  # "8" — matches!

Code-Grounded Evaluation Environments

Benchmarks for agent evaluation are themselves code-as-environment artifacts. SWE-Bench defines a task as: given an issue description and a repository snapshot, produce a patch that makes the failing tests pass. The repository is the environment. Tests are the reward function. Everything is executable.

More sophisticated environments like WebArena and OSWorld instantiate full browser or operating system environments where agent actions (code) produce observable state changes (code execution) that are evaluated by scripts (more code). The entire loop is code.

The environment trilogy: (1) The environment state is code (repository files, DOM tree, OS state). (2) Agent actions are code (shell commands, API calls, UI actions). (3) Evaluation is code (test scripts, assertion checkers, state validators). When all three are code, the harness can make the entire agent-environment interaction inspectable, replayable, and verifiable.

Verifiable Environment Construction

Endless Terminals (Gandhi et al., 2026) takes the logical next step: using code to generate evaluation environments. Instead of hand-crafting benchmarks, it uses programs to procedurally generate terminal-based tasks with verifiable solutions. The environment, the task, and the evaluator are all synthesized from code.

This has profound scaling implications. Traditional benchmarks require human labor per task. Code-generated benchmarks can produce unlimited tasks with built-in verification. The trade-off: generated tasks may lack the semantic richness of real-world issues.

Traditional Benchmark
Human writes 500 tasks + solutions + evaluation scripts. Expensive, limited scale, static. Cannot adapt to new domains without more human effort.
↓ vs ↓
Code-Generated Benchmark
Program generates tasks + solutions + evaluators procedurally. Cheap, unlimited scale, dynamic. Verification is built into the generation process. Risk: may test tool-use mechanics rather than genuine problem-solving.
What makes a code-based world model more useful for debugging than a neural world model?

Chapter 5: Planning for Agent Harness

Real-world software tasks are never one-shot. You don't just write a function — you read the issue, find the relevant files, understand the dependencies, plan an approach, implement it, test it, fix the tests, and iterate. Planning is the harness mechanism that organizes this long-horizon process.

The paper makes a critical framing move: planning is not merely an internal reasoning capability of the LLM. It's a form of harness control — it structures how the agent externalizes intent into executable steps, schedules interactions with code artifacts and tools, and regulates the trajectory of reasoning, execution, and revision over time.

1. Linear Decomposition

The simplest approach: break the task into a numbered list of steps, then execute them in order. Self-Planning has the model first produce a step-by-step plan in text, then generate code step by step under the plan's guidance. ReAct (Yao et al., 2022) is a lightweight precursor: interleave thoughts, actions, and observations in a serial trajectory where each reasoning step constrains the next action.

The 2026 evolution is significant: plans are no longer ephemeral prompt artifacts. They're persistent filesystem objectsPLAN.md, Implement.md, status.log — that can be reviewed by humans, versioned with Git, consumed by subagents, and reloaded across context resets. The plan becomes a harness-backed control object, not just a reasoning trace.

Limitation: commits to a single trajectory. When the initial plan is wrong, recovery is limited to patching, not rethinking.

2. Structure-Grounded Planning

Instead of planning from free-form text, ground the plan in the structure of the task environment. CodePlan constructs a dependency graph over files and derives edit obligations from change-impact analysis. The plan follows the dependency structure, not the model's intuition.

Key insight: the structure of the codebase (what depends on what) tells you the order of operations. Edit the dependency before the dependent. Run the tests for the edited module before moving on. The plan emerges from the program structure itself.

Specialized domains follow the same pattern. VerilogCoder grounds subtask planning in a Task and Circuit Relation Graph where each subtask is enriched with signals, transitions, and examples. DomAgent combines top-down knowledge graphs with bottom-up examples for domain-specific code generation.

3. Search-Based Planning

Don't commit to one plan — explore multiple alternatives and use feedback to select the best. The survey identifies three levels of search:

Meta-Harness pushes search to the harness level itself: it searches over harness code by giving an agent access to prior source code, scores, and execution traces through a filesystem. The search isn't just over solutions — it's over the infrastructure that generates solutions.

4. Orchestration-Based Planning

Planning isn't done by one agent deciding what to do next — it's an emergent property of how multiple modules coordinate. Three patterns:

  1. Feedback-centered: Distribute coding, testing, analysis, and repair across modules. Planning emerges from repeated execution-grounded feedback. AgentCoder's programmer/tester/executor loop.
  2. Staged workflow: Structure code generation as a software-process pipeline: comprehension → retrieval → planning → coding → debugging → repair. MapCoder and Blueprint2Code.
  3. Controller-centric: Planning is embedded in the routing substrate. Natural-Language Agent Harnesses write harness logic as editable natural language executed by an Intelligent Harness Runtime (IHR) that converts high-level instructions into constrained execution steps.
Planning Paradigms: Side-by-Side Comparison

Select a paradigm to see how the same task ("Fix authentication bug #347") would be planned under each approach. Watch how the execution path, branching, and feedback differ.

The planning-evaluation trap: The survey makes a crucial and often-ignored point: planning conclusions depend heavily on the evaluation environment. If tests are weak, revision budgets unrealistic, or benchmarks don't stress long-range dependencies, then reported planning gains may not reflect genuine improvements. Many current conclusions about planning depend on surrounding conditions: execution environments, feedback quality, tool access, trajectory budgets, and whether the benchmark truly stresses multi-step coordination rather than localized patch generation. Planning is a harness problem between agent and environment, not just a model capability.
MethodCategoryCore MechanismFeedback
Self-PlanningLinearStepwise decompositionNone
WebAgentLinearSub-instruction sequencingRuntime exception
CodePlanStructurePlan graph from dependenciesCritique
VerilogCoderStructureTask-circuit relation graphTest pass/fail
ReThinkMCTSSearchMCTS over reasoning pathsCritique + tests
CodeTreeSearchTrajectory tree searchTest pass/fail
MapCoderOrchestrationRole orchestrationCritique + tests
Blueprint2CodeOrchestrationBlueprint-to-code pipelineCritique
Why does structure-grounded planning often outperform linear decomposition for repository-level tasks?

Chapter 6: Memory & Context Engineering

A coding agent is debugging a complex issue. It's been working for 200 turns. It found the root cause 150 turns ago, explored three fix strategies, identified the right one 50 turns ago, and is now implementing the fix. But its context window only holds the last 30 turns. How does it remember the root cause? The rejected strategies? The rationale for the chosen approach?

This is the memory problem. Real software engineering is inherently long-horizon and state-intensive. Memory isn't a nice-to-have — it's infrastructure.

Memory is NOT just a bigger context window. The paper is explicit: memory is a state-management layer that decides which information stays in the active model context, which gets compacted into summaries, and which gets offloaded to durable external storage. Without effective memory, an agent loses critical clues during long-range reasoning, repeats completed searches, or breaks local consistency established earlier during later modifications. The wrong decision at any of these three boundaries degrades agent performance.

The Six Memory Components

1. Working Memory — What the agent needs for its next action. Structured prompt regions, state summaries, failed-test records, file lists, critical stack information. SWE-agent and RepairAgent show that even with the same model, repository-level performance varies substantially based on how working memory is organized. CodeMem treats context as a managed resource with budgeted memory slots to stabilize multi-step edits.

2. Semantic Memory — Task-relevant evidence from the external codebase. Not "recall what I learned" but "find what I need." AutoCodeRover retrieves class definitions, function implementations, and call relations aligned with program structure. RepoCoder iteratively rewrites queries to improve retrieval. CodeRAG uses multi-path retrieval with reranking. The key insight: retrieve evidence aligned with program structure (AST-based chunking, call graphs), not just text similarity.

3. Experiential Memory — Reusable experience from past tasks. MemGovern stores repair trajectories, failure cases, and strategy patterns — but with quality controls. ExpeL reuses reflections as task-solving strategies. The paper makes a strong claim: ungoverned experience introduces noise, error propagation, and false retrievals. Governed experience — curated and validated — enables cross-task transfer. The quality of stored experience matters more than its scale.

4. Long-Term Memory — Knowledge that persists across sessions. MemCoder stores validated commits and human-approved solutions. TALM incorporates long-term memory into multi-agent generation, retrieving prior traces and consolidating overlapping memories. The focus shifts from what to store to when to write, when to compress, when to retrieve, and how to avoid contamination. Memory without governance becomes a burden that amplifies noise, staleness, and error.

5. Multi-Agent Memory — Shared state across agent roles. In multi-agent systems, memory isn't just individual storage — it's a coordination medium. MIRIX routes shared memory across specialized roles. ChatDev maintains context across role-based phases. The challenge: controlling sharing granularity, preventing information flooding, and supporting bidirectional access between high-level plans and fine-grained traces.

6. Context Compaction & State Offloading — The cross-cutting mechanism. Long-horizon tasks generate massive artifacts: build logs, execution traces, repo diffs. A single SWE-Bench evaluation can produce millions of tokens of diagnostic information. Three compression strategies:

Context Compaction
LongCodeZip: coarse-to-fine compression preserving reasoning cues. SWE-Pruner: task-aware pruning of irrelevant context. SWEZZE: lightweight learned compression distilling fix-relevant evidence.
State Offloading
Store full-fidelity artifacts outside the active window. Agent receives compact summaries + resource identifiers (file paths, MCP URIs) to retrieve full data on demand.
Retrieval Handles
Instead of "here's the full 50KB test log," give: "test_auth.py failed at line 47: AssertionError: expected 200, got 403. Full log at /traces/run_42/test_output.log."
MethodMemory TypeManaged StateHarness Operation
SWE-agentWorkingRepair trajectoryStructured state tracking
CodeMemWorkingEdit state + slotsBudgeted slot management
AutoCodeRoverSemanticRepo structureStructure-aware retrieval
RepoCoderSemanticRepo contextIterative query rewriting
MemGovernExperientialTrajectories + critiquesGoverned experience replay
MemCoderLong-termCommits + fixesStructured memory + self-internalization
MIRIXMulti-agentCross-agent stateCross-agent memory routing
LongCodeZipCompactionLong code contextCoarse-to-fine compression
The meta-insight: Memory in code agents has a unique advantage over memory in general-purpose agents: stored items (functions, tests, traces, commits) are themselves executable. You can re-run a stored test to verify it's still valid. You can re-execute a stored fix to check it still applies. Memory items can be verified, not just recalled. This narrows the gap between memory and environment.
Why is ungoverned experiential memory (storing everything) often worse than governed experiential memory (curated storage)?

Chapter 7: Tools & Control Loops

An agent that can reason, act, remember, and plan still needs one more thing: governed access to the external world. Tools are the action/observation layer — and the control loop determines how tool results feed back into the next action.

Four Types of Tool Use

1. Function-Oriented: Tools that fill gaps in the model's knowledge. ToolCoder integrates API search tools into code generation — the model decides when it needs documentation and queries for it. Key challenge: query formulation and result selection from long-tail APIs and private libraries.

2. Environment-Interaction: Tools for operating inside the development environment. SWE-agent formalizes the agent-computer interface: shell commands, file editing, test execution. CodeAgent navigates repositories, inspects dependencies, and validates through testing. These aren't knowledge tools — they're hands.

3. Verification-Driven: Tools as deterministic sensors. Tests, compilers, linters, type checkers, static analyzers — these don't help the agent do things, they help the harness verify things. AgentCoder's three-agent loop (programmer, test designer, executor) exemplifies this: the tool's job is feedback, not action.

4. Workflow-Orchestration: Tools that coordinate other tools. Tool schemas, sessions, guardrails, handoffs, sandboxes. ToolNet learns a multi-tool routing policy. MapCoder assigns agents to recall, planning, coding, and debugging stages, each with its own tool set.

The lifecycle hook pattern: Modern harnesses don't just call tools — they govern tool lifecycle. Pre-use hooks validate arguments, check permissions, block risky commands (no rm -rf /, no credential exfiltration). Post-use hooks sanitize outputs, compact logs, update memory, trigger verification. This turns tool use from raw model action into a monitored state transition with audit trails. The hook pattern is what separates "a model calling an API" from "a governed agent operating through a harness."

The Plan-Execute-Verify (PEV) Loop

All of these mechanisms — planning, memory, tools, verification — come together in a single control loop. The survey calls it PEV: Plan-Execute-Verify. This is the paper's core formalization of harness-level control.

The paper develops PEV from a specific observation: traditional debugging research treats code repair as "generate patch → check tests." But in production agent systems, the control problem is much broader. The harness must:

PEV Phase 1: Planning as Contract Formation

The plan isn't a vague intent — it's an explicit contract: which files to edit, what invariants to maintain, what validation commands to run, what rollback points exist. This is a harness artifact, not just a reasoning trace. In industrial systems, these contracts are persisted as filesystem objects (PLAN.md) that can be reviewed by humans before execution begins.

PEV Phase 2: Sandboxed Execution with Tiered Permissions

The plan is executed inside a sandboxed, permissioned environment. The sandbox provides: isolated filesystem, shell, runtime, resource boundaries. Permissions are tiered:

PEV Phase 3: Verification through Deterministic Sensors

The harness uses deterministic sensors — compilers, linters, test suites, static analyzers — to compare the new state against constraints. If verification fails, the same sensor evidence determines the next action: diagnose, retrieve context, regenerate, route to another agent, or escalate to a human.

The PEV Control Loop

Watch the agent loop through plan-execute-verify cycles. Click "Step" to advance one cycle, or "Run" for continuous execution. Notice how verification failures trigger re-planning.

The cybernetic governor: The harness acts as a cybernetic governor — a control layer that observes effects of agent actions and regulates subsequent state transitions. It doesn't just forward error messages to the model. It observes through deterministic sensors, decides whether to continue/revise/escalate, and enforces permission boundaries. The agent proposes; the harness disposes. This framing elevates debugging from "fix the code" to "control the agent-environment interaction loop."
Why do modern harnesses use tiered permissions (read-only, sandbox-edit, full-access) rather than giving agents unrestricted access?

Chapter 8: Agentic Harness Engineering

Here's the most meta idea in the entire survey: using agents to measure and improve the harness itself. The paper calls this Agentic Harness Engineering (AHE). The key observation: many observed agent failures arise from the harness, not the model. Missing context, brittle tool interfaces, weak validators, excessive token cost, poor retry policies.

Deep Telemetry as the Optimization Substrate

AHE starts with observability. The harness instruments every interaction to produce detailed telemetry:

This telemetry is the raw data from which AHE diagnoses harness problems. Without it, you're debugging blind.

The Evolution Agent

An Evolution Agent operates on the telemetry to propose harness improvements. The workflow:

1. Collect
Gather deep telemetry from agent runs: prompts, tool calls, costs, failures, latencies across many task instances.
2. Diagnose
Identify which harness component is the bottleneck. Is it context packing? Tool descriptions? Retry policy? Permission gates?
3. Propose
Generate candidate harness mutations: rewrite tool descriptions, change context packing order, add a new validator, adjust retry budget.
4. Evaluate
Run the mutated harness on held-out task instances. Compare telemetry before and after. Does it solve more tasks? Faster? Cheaper?
5. Promote or Revert
If the mutation passes regression tests AND improves metrics, promote it. Otherwise, revert. All mutations are governed by the same PEV loop.

Governed Harness Mutation

The Evolution Agent is itself subject to the PEV loop — it plans a harness mutation, executes it in isolation, verifies via telemetry, and escalates risky changes to humans. This prevents the self-modification problem: you don't want an agent modifying its own safety boundaries without oversight.

The paper draws a parallel to AlphaEvolve (Novikov et al., 2025), which applies evolutionary coding to program optimization. AHE extends this principle: the harness is a program, and it can be optimized by the same evolutionary search methods that optimize other programs.

# Agentic Harness Engineering: Evolution Agent pseudocode
class EvolutionAgent:
    def evolve(self, harness, telemetry, eval_tasks):
        # Diagnose: identify the weakest harness component
        diagnosis = self.analyze_telemetry(telemetry)
        # e.g., "tool_description for file_search is ambiguous,
        #         causing 34% of localization failures"

        # Propose: generate candidate mutation
        mutation = self.propose_fix(diagnosis, harness)
        # e.g., rewrite file_search tool description with examples

        # Evaluate: run mutated harness on held-out tasks
        mutated_harness = harness.apply(mutation)
        results = self.evaluate(mutated_harness, eval_tasks)

        # Promote only if regression-free improvement
        if results.improved and not results.regressions:
            return mutated_harness
        return harness  # Revert
What the paper doesn't say: AHE is presented as a promising direction, but the paper acknowledges the regression-free improvement problem is largely unsolved. When the Evolution Agent modifies the harness, how do you know it didn't break something else? Current systems evaluate on held-out tasks, but the space of possible regressions is vast. You're changing the system that tests itself. The paper flags this as one of its six open problems.
Why is Agentic Harness Engineering (AHE) described as "optimizing the optimizer"?

Chapter 9: Multi-Agent Orchestration

One agent, one task, one context window. This works for function-level edits. But for repository-level engineering — where you need to understand 100 files, write code, generate tests, review for security, and debug failures — a single agent hits three walls:

  1. Context limit: Can't hold the entire codebase + interaction history + execution traces in memory
  2. Specialization: One generalist agent for planning, synthesis, testing, review, and debugging is inefficient
  3. Self-correction: Without independent verification channels, the agent can't reliably detect its own errors

Functional Role Specialization

Mirror human software teams: different agents for different jobs. The paper identifies six canonical roles:

Interaction Modes

Harness-State Convergence

The paper introduces a formal concept of harness-state convergence: multiple agents working toward a shared code state that satisfies correctness, security, performance, and consensus criteria simultaneously. Three convergence types:

Test-Verified Convergence
The shared code state must pass a deterministic test suite. Strongest signal — objective, reproducible, but limited to what tests cover. Used by AgentCoder, SWE-agent.
Consensus Convergence
Multiple reviewer agents vote or agree. CANDOR uses majority voting among three panelists. QualityFlow uses a single quality checker as the gating signal — 75-84% of problems converge after the first generator call.
Implicit Convergence (Warning!)
Pipeline terminates after a fixed number of stages or iterations with no objective quality criterion. ChatDev terminates after fixed phases. MetaGPT completes fixed SOP stages. The most prevalent pattern in the literature and the most significant gap. Without an objective representation of program state, systems have no principled criterion for convergence.
The synchronization problem: When multiple agents edit the same codebase concurrently, one agent's edit can silently invalidate another agent's assumptions. SyncMind formalizes this as belief-state divergence — each agent maintains its own model of the world, and these models can drift apart. Current best step-level attribution accuracy: 14-53%. We can't even reliably tell which agent caused a failure. The harness must detect and reconcile these divergences, or risk agents working at cross-purposes.
Multi-Agent Topologies

Click each topology to see how agents are connected, how information flows, and which failure modes each topology is susceptible to.

The survey identifies a critical trend: adaptive topologies that restructure themselves at runtime. Dynamic agent pool scaling (spin up more coders when parallelism helps), feedback-driven DAG restructuring (add a security reviewer when the task touches auth), and runtime self-reorganization.

The Shared Code-Centric Substrate

The survey's key position on multi-agent systems: the shared repository is not just a communication channel — it's the convergence target. All agents read from and write to the same codebase. Tests pass or fail for all agents equally. The repository is the shared state.

This is what the paper calls program-state convergence: the repository + its tests + execution traces become both the medium of communication and the definition of success. Compare this to multi-agent systems that communicate through natural language messages — language is ambiguous, subjective, and unverifiable. Code is deterministic.

Why is implicit convergence (fixed iterations, no quality criterion) the most significant gap in multi-agent coding systems?

Chapter 10: Showcase — Build an Agent Harness

Now let's put everything together. This interactive simulation lets you configure and run an agent harness to solve a coding task. You'll see how each layer of the taxonomy affects the outcome — and experience firsthand why harness design dominates model capability.

What you're about to build: A configurable agent harness that resolves GitHub issues. You'll choose the planning strategy, memory configuration, tool access, and verification level. Then you'll watch the agent attempt to solve "Fix authentication bug #347" and see how your harness choices affect the trajectory, cost, and success rate.
Agent Harness Simulator

Configure each harness component, then click "Run Agent" to see the execution trajectory. Try different configurations to see the harness gap in action.

What to Try

Run the simulator with these configurations to observe the harness gap:

  1. Bare minimum: No planning, no memory, no tools, no verification. The agent generates a patch from the issue description alone. Watch it fail.
  2. Tool-augmented: No planning, working memory, full tools, test verification. The agent can read files and run tests but has no strategy. Watch it wander.
  3. Full harness: Structure-grounded planning, full memory, full tools, full verification. Watch it systematically locate the bug, plan the fix, execute, verify, and succeed.
  4. Overkill search: Search (MCTS) planning, full memory, full tools, full verification. Watch it explore multiple branches — more thorough but more expensive.
The key takeaway: Same model in all four configurations. The difference is entirely the harness. Configuration 1 might solve 5% of tasks. Configuration 3 might solve 45%. Configuration 4 might solve 50% but cost 5x more. The harness determines the cost-performance frontier.
In the simulator, what is the single most impactful harness component to add (going from "bare minimum" to the next useful configuration)?

Chapter 11: Open Problems & Connections

The survey identifies six fundamental open problems that define the frontier of harness engineering. Each represents a gap between what current harnesses can do and what reliable, long-horizon agent systems require.

1. Harness-Level Evaluation and Oracle Adequacy

Current benchmarks mostly measure "did the agent solve the task?" This misses how it solved it — cost, time, number of retries, safety violations, hallucinated steps. The field needs process-level metrics that evaluate the trajectory, not just the endpoint.

A harness that solves a task in 3 steps with $0.10 is better than one that solves it in 300 steps with $50 — but current benchmarks score them identically. Worse, the oracle-adequacy crisis: benchmarks like SWE-Bench rely on test suites that may not fully specify correct behavior. A patch that passes the tests may introduce security vulnerabilities, break untested edge cases, or violate project conventions.

2. Semantic Verification Beyond Executable Feedback

Tests are the primary verification mechanism, but tests are often incomplete. The verifier surface needs to expand beyond functional correctness to include:

None of these can be verified purely by running tests. They require richer verification signals — static analysis, LLM-as-judge, human review — each with its own failure modes.

3. Regression-Free Harness Improvement

When the Evolution Agent modifies the harness, how do you know it didn't break something else? Current AHE systems evaluate on held-out tasks, but the space of possible regressions is vast. The harness self-evolution problem: you're changing the system that tests itself. If the harness mutation changes what "passing a test" means, the evaluation becomes circular.

4. Consistent Shared State Across Multiple Agents

SyncMind's belief-state divergence problem: when multiple agents concurrently edit a shared codebase, each agent's model of the world can drift from reality. Current best step-level attribution accuracy: 14-53%. We can't even reliably tell which agent caused a failure, let alone prevent the divergence proactively.

5. Human Oversight for Safety-Critical Actions

The permission-tier model (read-only / sandbox-edit / full-access) is a start, but production agents need:

6. Extensions to Multimodal Environments

GUI agents see screenshots. Embodied agents see camera frames. Current harnesses handle text and code well but struggle with multimodal memory (which visual details to store?), visual grounding (mapping language to image regions), implicit feedback (a button looks clicked but didn't trigger the right state), and multimodal skill evolution.

The science of harness engineering: The survey argues that the field is moving toward a broader science of harness engineering. The central object of study is no longer only the model or the generated program, but the complete closed-loop system: context, memory, tools, execution, feedback, safety, coordination, and evaluation. The most important future systems will combine four properties: executable (grounded in code), inspectable (exposing plans, state, provenance), stateful (preserving information across trajectories), and governed (constrained by permissions, verification, and accountability).

Applications: Where the Harness Is Already Deployed

Code Assistants — the clearest instantiation. The evolution tells the story: 2021 inline completion → 2023 issue-to-patch (SWE-agent) → 2025 autonomous SWE agent (Claude Code, Codex) → 2026 harness as training data. LingmaAgent (Alibaba Cloud) resolves 16.9% of in-house issues fully autonomously and 43.3% with manual intervention — in production, not on benchmarks. Cursor trains with continuous online RL on real usage traces. The harness feeds back into the model — the boundary between "the agent" and "the harness around the agent" is becoming a learnable surface.

Scientific DiscoveryAI Scientist v2 uses an agentic tree search to produce workshop-quality papers — the first AI-authored paper accepted at an ICLR workshop. The harness manages hypotheses, experiments, analyses, and laboratory protocols as executable pipelines.

GUI/OS Agents — Web pages are code. Actions are code. Evaluation is code. The most literal code-as-environment setting.

Embodied Agents — Voyager, RoboCodeX, Code-as-Policies. Code as the bridge between intent and physical execution.

Personalization — Agents that adapt recommendation policies through structured user feedback and editable preference states.

Connections

Cheat Sheet: The Harness Framework

Layer 1: Interface
Reasoning: PoT, formal verification, iterative code-grounded, CodePRM
Acting: skill selection (SayCan), policy generation (CaP), lifelong agents (Voyager, SkillOS)
Environment: world models, traces, eval environments, verifiable construction
Layer 2: Mechanisms
Planning: linear, structure-grounded, search (MCTS), orchestration
Memory: working, semantic, experiential, long-term, multi-agent + compaction
Tools: function, environment, verification, workflow
Control: PEV loop (plan as contract, sandboxed execution, deterministic sensors)
Engineering: telemetry, evolution agents, governed mutation
Layer 3: Multi-Agent
Roles: manager, coder, tester, reviewer, executor, debugger
Modes: collaborative, adversarial, debate/red-teaming
Topologies: chain, star, cyclic, hierarchical, adaptive
Convergence: test-verified, consensus, implicit (the gap)
Substrate: shared repository as convergence target
What are the four properties that the survey argues the most important future agent systems will combine?