A unified framework for executable, verifiable, and stateful AI agent systems — where code isn't just what agents produce, but the infrastructure they operate through.
You give an LLM a task: "Fix the authentication bug in issue #347." The model generates a patch. It looks right. But it doesn't compile because it imported a function that was renamed two commits ago. It doesn't know the repo's test framework. It can't run the tests to check. It has no memory of what it tried before. It can't even see the file it needs to edit.
This isn't a model intelligence problem. The model could fix the bug — it just can't operate. It has no way to read the repository, execute code, verify its output, remember what worked, or coordinate with other agents. It's a brain without a body.
Think about what happens when you use a coding assistant like Claude Code or Codex. The model doesn't just generate text. It reads files, runs shell commands, executes tests, edits code, manages context across hundreds of interactions, and decides when to ask for human approval. All of that infrastructure — the tools, the permissions, the memory, the execution sandbox — is the harness.
The paper draws a critical three-way distinction that most surveys miss:
Most prior surveys collapse (2) and (3). This paper's contribution is distinguishing them: the harness infrastructure is predefined, but agent-initiated code artifacts are emergent — the agent creates them during execution, and they feed back into its own capabilities.
And here's the striking finding: the same model, with different harnesses, produces a 6x performance gap on the same benchmark. Same weights. Same architecture. The only variable is the code that decides what the model sees and does.
Click "Shuffle" to see how different harness designs change agent performance on the same task. The model is identical — only the surrounding code changes.
The survey explicitly defines what counts as "code": executable or machine-checkable artifacts including programs, scripts, formal specifications, proof scripts, API schemas, tool definitions, tests, repositories, simulators, configuration files, and execution artifacts like traces and logs.
Not code: raw perception, physical state, human intent, or model-internal reasoning. These may be represented through code, but they're not code themselves. This boundary matters because it determines what the harness can verify. If an artifact is code, the harness can execute it, check it, and feed back structured results. If it's not code, the harness must either convert it to code or rely on weaker verification signals.
The survey organizes the entire code-as-agent-harness landscape into three connected layers. This isn't just a convenient filing system — it follows how code becomes an operational medium inside the agent loop: it first enters as a harness interface for reasoning, acting, and environment representation; it then supports harness mechanisms that manage planning, memory, tool use, execution, and repair over time; and it finally becomes a shared artifact through which multiple agents coordinate.
This is where code connects the agent to the world. Code plays three distinct roles at this layer:
Once the interface connects the agent to the world, it needs systems to operate reliably over time. Five mechanisms:
When tasks exceed what a single agent can handle — due to context limits, specialization needs, or verification requirements — the harness scales to multiple agents sharing code artifacts as their coordination medium. The paper examines role specialization, interaction modes (collaborative, adversarial, debate), workflow topologies (chain, star, cyclic, hierarchical, adaptive), execution feedback integration, shared-harness synchronization, and the concept of harness-state convergence.
Hover over each layer to see its components. Click a component to highlight its connections across layers.
A worked example clarifies the three-way distinction. Consider a coding agent resolving a GitHub issue:
Prior surveys treat (2) and (3) as the same thing. This paper's key conceptual move is separating them, because agent-initiated code artifacts represent a feedback loop: the agent improves its own capabilities by creating executable artifacts that persist across steps.
Ask an LLM to compute 347 × 892. It might get it wrong. Ask it to write a Python program that computes 347 × 892, then execute that program. It gets it right every time. This is the simplest form of code-for-reasoning: delegating computation from the model's unreliable internal arithmetic to an external, deterministic executor.
But code-for-reasoning goes far beyond arithmetic. The survey identifies three paradigms, each representing a deeper integration of code into the reasoning process:
The model generates executable programs as the primary interface between problem decomposition and computation. Instead of chain-of-thought in natural language, the model produces code that an interpreter executes.
Program-of-Thoughts (PoT) was the first to formalize this: decompose a reasoning problem into an executable program, run it, return the result. The key insight is separating what to compute (the model's job) from how to compute it (the interpreter's job).
print(347 * 892). Interpreter executes. Returns: 309524. (Always right.)CodeI/O pushes this further: it transforms programs into input-output prediction tasks, exposing reasoning primitives like state-space search, decision-tree traversal, and modular decomposition while keeping procedural rigor through executable verification. The key idea: instead of teaching the model to reason in natural language, teach it to predict what code would produce — then verify by actually running it.
PAL (Program-Aided Language models) extends PoT by having the model generate Python programs where natural language comments serve as the reasoning trace. The comments explain why each step; the code computes what each step produces. This creates a dual-channel reasoning signal: human-readable rationale + machine-verifiable computation.
What if the reasoning itself needs to be proved correct, not just executed? Formal verification uses machine-checkable languages — Lean, Isabelle, Coq — where every derivation step is verified by a proof checker.
Systems like DeepSeek-Prover-V2 combine LLM generation with proof-assistant feedback: the model proposes proof steps, the verifier checks them, and failures guide the next attempt. Lean4Agent extends this to agent workflows themselves — using Lean to model and verify that an agent's trajectory satisfies safety contracts.
The survey identifies a crucial data flow pattern in formal verification systems:
# Formal verification loop (pseudocode) def prove_theorem(statement, model, verifier, max_attempts=10): proof_state = verifier.initialize(statement) for attempt in range(max_attempts): # Model proposes next tactic tactic = model.generate( context=proof_state.goals, # Current proof obligations feedback=proof_state.errors, # Prior tactic failures history=proof_state.trace # Successful steps so far ) # Verifier checks the tactic deterministically result = verifier.apply_tactic(proof_state, tactic) if result.success: proof_state = result.new_state if proof_state.is_complete: return proof_state.proof # Verified proof else: proof_state.errors.append(result.error) return None # Failed to prove
The most powerful paradigm treats reasoning as a closed loop: generate code → execute it → observe the trace → refine. The execution trace itself becomes the reasoning signal.
R1-Code-Interpreter interleaves reasoning and multiple rounds of code execution within persistent interactive sessions. The model reasons in text, writes code to test a hypothesis, observes the result, updates its reasoning, writes more code. Each execution is a checkpoint that grounds the reasoning in reality.
CodePRM learns reward functions over reasoning-execution trajectories, using execution outcomes to evaluate and refine intermediate reasoning steps. The harness doesn't just execute code — it scores the reasoning quality based on what the execution reveals. This is a process reward model (PRM) for code, where the reward signal comes from deterministic execution rather than human labeling.
Let's trace how a single problem flows through each paradigm. The problem: "What is the probability of drawing two aces from a standard deck without replacement?"
Click each paradigm to see its data flow, verification signal, and failure mode.
A robot needs to pick up a cup. An agent needs to click "Submit" on a web form. A coding assistant needs to edit line 47 of auth.py. In each case, the model's internal reasoning must be translated into an executable operation that changes the state of the world. Code is that translation layer.
The central challenge is grounding: mapping abstract language outputs into executable behaviors that respect the constraints of the target environment — physical limits, API schemas, safety requirements, timing constraints.
The simplest form: the model selects from a library of pre-defined executable skills. SayCan (Google, 2022) pioneered this by combining an LLM's semantic knowledge with an affordance model that scores which actions are physically possible right now.
The actual computation is a product of two scores:
The LLM scores relevance: "pick up mug" is more relevant to "make coffee" than "open fridge". The affordance model scores feasibility: can the robot actually execute this action given its current position, gripper state, and reachable objects?
The critical engineering decision: the LLM never directly controls the robot. It scores which skill to invoke. The skill itself is a pre-built controller with safety guarantees. This separation — LLM for intent, code for execution — is the harness pattern.
What if the skill library doesn't have the right skill? Code as Policies (Liang et al., 2023) lets the model write new robot control programs on the fly. The model generates Python code that directly calls perception and control APIs:
# Model-generated robot policy def pour_into_tallest_cup(robot): cups = robot.detect("cup") tallest = max(cups, key=lambda c: c.height) robot.pick(robot.detect("pitcher")[0]) robot.pour(tallest.position)
The model writes the program; the robot's API handles the actual motor commands. This is more flexible than skill selection but introduces a new risk: the generated code might be unsafe (pour into a cup that's upside down, move the arm into a collision). The harness must validate generated code before execution.
Voyager (Wang et al., 2023) takes this to the extreme: an agent that plays Minecraft by writing JavaScript programs, storing successful programs in a skill library, and composing them to solve increasingly complex tasks. The skill library grows over time — the agent literally programs itself into greater capability.
The data flow is a self-reinforcing loop:
SkillOS (Ouyang et al., 2026) extends this pattern with RL-trained skill curation, automatically deciding which skills to keep, merge, or discard based on usage patterns and success rates. The skill library isn't just a growing list — it's a managed, evolving codebase.
GUI agents reveal the code-as-action thesis most clearly. A web page is rendered code (HTML/CSS/JS). An action is a code call (click, type, scroll). An evaluation is a code check (did the DOM change correctly?). The entire perception-action-evaluation loop is code.
The paper formalizes this as a POMDP:
element.click(), setText(node_id, "..."), pyautogui.click(x,y)The CodeAct paradigm pushes this further: instead of emitting JSON tool calls, agents emit Python or JavaScript snippets that compose primitives. Cradle uses an LMM to output executable Python that drives keyboard and mouse for any application, including games. Native GUI models like UI-TARS and ShowUI collapse the planner→grounder→executor pipeline into a single VLA model whose output token stream is itself runnable code.
We've seen code as a reasoning substrate and an action interface. Now the third role: code as the environment itself. When a coding agent works on a repository, the repository is the world. Files are the state. Tests are the laws of physics. Git history is the timeline. Execution traces are observations.
An agent working on a codebase needs a model of that codebase. Not just the raw files, but the structure: which classes depend on which, what functions call what, where the tests are, what the build system does.
WorldCoder (Tang et al., 2024) takes this idea literally: it builds world models by writing code. The agent interacts with an environment, observes transitions, and writes Python programs that predict what will happen next. The world model is executable — the agent can run it to simulate future states without actually taking actions.
Sometimes the world model isn't a program you write from scratch — it's the execution trace of the program you're working on. CWM (Code World Model) trains an LLM to predict code execution behavior: given a program, predict its output, variable states, and control flow without actually running it.
This is world modeling in the most literal sense: the model learns the physics of code execution. When the model can accurately predict what a program will do, it can reason about edits without executing them — a crucial capability for agents working on codebases where tests are slow or running infrastructure is expensive.
The data flow for execution-trace world modeling:
# CWM: The model predicts execution traces input_program = """ def fibonacci(n): if n <= 1: return n return fibonacci(n-1) + fibonacci(n-2) print(fibonacci(6)) """ # Model predicts without executing: predicted_output = "8" predicted_trace = { "call_depth": 6, "return_values": [0, 1, 1, 2, 3, 5, 8], "total_calls": 25 } # Verification: actually run and compare actual_output = exec(input_program) # "8" — matches!
Benchmarks for agent evaluation are themselves code-as-environment artifacts. SWE-Bench defines a task as: given an issue description and a repository snapshot, produce a patch that makes the failing tests pass. The repository is the environment. Tests are the reward function. Everything is executable.
More sophisticated environments like WebArena and OSWorld instantiate full browser or operating system environments where agent actions (code) produce observable state changes (code execution) that are evaluated by scripts (more code). The entire loop is code.
Endless Terminals (Gandhi et al., 2026) takes the logical next step: using code to generate evaluation environments. Instead of hand-crafting benchmarks, it uses programs to procedurally generate terminal-based tasks with verifiable solutions. The environment, the task, and the evaluator are all synthesized from code.
This has profound scaling implications. Traditional benchmarks require human labor per task. Code-generated benchmarks can produce unlimited tasks with built-in verification. The trade-off: generated tasks may lack the semantic richness of real-world issues.
Real-world software tasks are never one-shot. You don't just write a function — you read the issue, find the relevant files, understand the dependencies, plan an approach, implement it, test it, fix the tests, and iterate. Planning is the harness mechanism that organizes this long-horizon process.
The paper makes a critical framing move: planning is not merely an internal reasoning capability of the LLM. It's a form of harness control — it structures how the agent externalizes intent into executable steps, schedules interactions with code artifacts and tools, and regulates the trajectory of reasoning, execution, and revision over time.
The simplest approach: break the task into a numbered list of steps, then execute them in order. Self-Planning has the model first produce a step-by-step plan in text, then generate code step by step under the plan's guidance. ReAct (Yao et al., 2022) is a lightweight precursor: interleave thoughts, actions, and observations in a serial trajectory where each reasoning step constrains the next action.
The 2026 evolution is significant: plans are no longer ephemeral prompt artifacts. They're persistent filesystem objects — PLAN.md, Implement.md, status.log — that can be reviewed by humans, versioned with Git, consumed by subagents, and reloaded across context resets. The plan becomes a harness-backed control object, not just a reasoning trace.
Limitation: commits to a single trajectory. When the initial plan is wrong, recovery is limited to patching, not rethinking.
Instead of planning from free-form text, ground the plan in the structure of the task environment. CodePlan constructs a dependency graph over files and derives edit obligations from change-impact analysis. The plan follows the dependency structure, not the model's intuition.
Key insight: the structure of the codebase (what depends on what) tells you the order of operations. Edit the dependency before the dependent. Run the tests for the edited module before moving on. The plan emerges from the program structure itself.
Specialized domains follow the same pattern. VerilogCoder grounds subtask planning in a Task and Circuit Relation Graph where each subtask is enriched with signals, transitions, and examples. DomAgent combines top-down knowledge graphs with bottom-up examples for domain-specific code generation.
Don't commit to one plan — explore multiple alternatives and use feedback to select the best. The survey identifies three levels of search:
Meta-Harness pushes search to the harness level itself: it searches over harness code by giving an agent access to prior source code, scores, and execution traces through a filesystem. The search isn't just over solutions — it's over the infrastructure that generates solutions.
Planning isn't done by one agent deciding what to do next — it's an emergent property of how multiple modules coordinate. Three patterns:
Select a paradigm to see how the same task ("Fix authentication bug #347") would be planned under each approach. Watch how the execution path, branching, and feedback differ.
| Method | Category | Core Mechanism | Feedback |
|---|---|---|---|
| Self-Planning | Linear | Stepwise decomposition | None |
| WebAgent | Linear | Sub-instruction sequencing | Runtime exception |
| CodePlan | Structure | Plan graph from dependencies | Critique |
| VerilogCoder | Structure | Task-circuit relation graph | Test pass/fail |
| ReThinkMCTS | Search | MCTS over reasoning paths | Critique + tests |
| CodeTree | Search | Trajectory tree search | Test pass/fail |
| MapCoder | Orchestration | Role orchestration | Critique + tests |
| Blueprint2Code | Orchestration | Blueprint-to-code pipeline | Critique |
A coding agent is debugging a complex issue. It's been working for 200 turns. It found the root cause 150 turns ago, explored three fix strategies, identified the right one 50 turns ago, and is now implementing the fix. But its context window only holds the last 30 turns. How does it remember the root cause? The rejected strategies? The rationale for the chosen approach?
This is the memory problem. Real software engineering is inherently long-horizon and state-intensive. Memory isn't a nice-to-have — it's infrastructure.
1. Working Memory — What the agent needs for its next action. Structured prompt regions, state summaries, failed-test records, file lists, critical stack information. SWE-agent and RepairAgent show that even with the same model, repository-level performance varies substantially based on how working memory is organized. CodeMem treats context as a managed resource with budgeted memory slots to stabilize multi-step edits.
2. Semantic Memory — Task-relevant evidence from the external codebase. Not "recall what I learned" but "find what I need." AutoCodeRover retrieves class definitions, function implementations, and call relations aligned with program structure. RepoCoder iteratively rewrites queries to improve retrieval. CodeRAG uses multi-path retrieval with reranking. The key insight: retrieve evidence aligned with program structure (AST-based chunking, call graphs), not just text similarity.
3. Experiential Memory — Reusable experience from past tasks. MemGovern stores repair trajectories, failure cases, and strategy patterns — but with quality controls. ExpeL reuses reflections as task-solving strategies. The paper makes a strong claim: ungoverned experience introduces noise, error propagation, and false retrievals. Governed experience — curated and validated — enables cross-task transfer. The quality of stored experience matters more than its scale.
4. Long-Term Memory — Knowledge that persists across sessions. MemCoder stores validated commits and human-approved solutions. TALM incorporates long-term memory into multi-agent generation, retrieving prior traces and consolidating overlapping memories. The focus shifts from what to store to when to write, when to compress, when to retrieve, and how to avoid contamination. Memory without governance becomes a burden that amplifies noise, staleness, and error.
5. Multi-Agent Memory — Shared state across agent roles. In multi-agent systems, memory isn't just individual storage — it's a coordination medium. MIRIX routes shared memory across specialized roles. ChatDev maintains context across role-based phases. The challenge: controlling sharing granularity, preventing information flooding, and supporting bidirectional access between high-level plans and fine-grained traces.
6. Context Compaction & State Offloading — The cross-cutting mechanism. Long-horizon tasks generate massive artifacts: build logs, execution traces, repo diffs. A single SWE-Bench evaluation can produce millions of tokens of diagnostic information. Three compression strategies:
AssertionError: expected 200, got 403. Full log at /traces/run_42/test_output.log."| Method | Memory Type | Managed State | Harness Operation |
|---|---|---|---|
| SWE-agent | Working | Repair trajectory | Structured state tracking |
| CodeMem | Working | Edit state + slots | Budgeted slot management |
| AutoCodeRover | Semantic | Repo structure | Structure-aware retrieval |
| RepoCoder | Semantic | Repo context | Iterative query rewriting |
| MemGovern | Experiential | Trajectories + critiques | Governed experience replay |
| MemCoder | Long-term | Commits + fixes | Structured memory + self-internalization |
| MIRIX | Multi-agent | Cross-agent state | Cross-agent memory routing |
| LongCodeZip | Compaction | Long code context | Coarse-to-fine compression |
An agent that can reason, act, remember, and plan still needs one more thing: governed access to the external world. Tools are the action/observation layer — and the control loop determines how tool results feed back into the next action.
1. Function-Oriented: Tools that fill gaps in the model's knowledge. ToolCoder integrates API search tools into code generation — the model decides when it needs documentation and queries for it. Key challenge: query formulation and result selection from long-tail APIs and private libraries.
2. Environment-Interaction: Tools for operating inside the development environment. SWE-agent formalizes the agent-computer interface: shell commands, file editing, test execution. CodeAgent navigates repositories, inspects dependencies, and validates through testing. These aren't knowledge tools — they're hands.
3. Verification-Driven: Tools as deterministic sensors. Tests, compilers, linters, type checkers, static analyzers — these don't help the agent do things, they help the harness verify things. AgentCoder's three-agent loop (programmer, test designer, executor) exemplifies this: the tool's job is feedback, not action.
4. Workflow-Orchestration: Tools that coordinate other tools. Tool schemas, sessions, guardrails, handoffs, sandboxes. ToolNet learns a multi-tool routing policy. MapCoder assigns agents to recall, planning, coding, and debugging stages, each with its own tool set.
rm -rf /, no credential exfiltration). Post-use hooks sanitize outputs, compact logs, update memory, trigger verification. This turns tool use from raw model action into a monitored state transition with audit trails. The hook pattern is what separates "a model calling an API" from "a governed agent operating through a harness."All of these mechanisms — planning, memory, tools, verification — come together in a single control loop. The survey calls it PEV: Plan-Execute-Verify. This is the paper's core formalization of harness-level control.
The paper develops PEV from a specific observation: traditional debugging research treats code repair as "generate patch → check tests." But in production agent systems, the control problem is much broader. The harness must:
The plan isn't a vague intent — it's an explicit contract: which files to edit, what invariants to maintain, what validation commands to run, what rollback points exist. This is a harness artifact, not just a reasoning trace. In industrial systems, these contracts are persisted as filesystem objects (PLAN.md) that can be reviewed by humans before execution begins.
The plan is executed inside a sandboxed, permissioned environment. The sandbox provides: isolated filesystem, shell, runtime, resource boundaries. Permissions are tiered:
The harness uses deterministic sensors — compilers, linters, test suites, static analyzers — to compare the new state against constraints. If verification fails, the same sensor evidence determines the next action: diagnose, retrieve context, regenerate, route to another agent, or escalate to a human.
Watch the agent loop through plan-execute-verify cycles. Click "Step" to advance one cycle, or "Run" for continuous execution. Notice how verification failures trigger re-planning.
Here's the most meta idea in the entire survey: using agents to measure and improve the harness itself. The paper calls this Agentic Harness Engineering (AHE). The key observation: many observed agent failures arise from the harness, not the model. Missing context, brittle tool interfaces, weak validators, excessive token cost, poor retry policies.
AHE starts with observability. The harness instruments every interaction to produce detailed telemetry:
This telemetry is the raw data from which AHE diagnoses harness problems. Without it, you're debugging blind.
An Evolution Agent operates on the telemetry to propose harness improvements. The workflow:
The Evolution Agent is itself subject to the PEV loop — it plans a harness mutation, executes it in isolation, verifies via telemetry, and escalates risky changes to humans. This prevents the self-modification problem: you don't want an agent modifying its own safety boundaries without oversight.
The paper draws a parallel to AlphaEvolve (Novikov et al., 2025), which applies evolutionary coding to program optimization. AHE extends this principle: the harness is a program, and it can be optimized by the same evolutionary search methods that optimize other programs.
# Agentic Harness Engineering: Evolution Agent pseudocode class EvolutionAgent: def evolve(self, harness, telemetry, eval_tasks): # Diagnose: identify the weakest harness component diagnosis = self.analyze_telemetry(telemetry) # e.g., "tool_description for file_search is ambiguous, # causing 34% of localization failures" # Propose: generate candidate mutation mutation = self.propose_fix(diagnosis, harness) # e.g., rewrite file_search tool description with examples # Evaluate: run mutated harness on held-out tasks mutated_harness = harness.apply(mutation) results = self.evaluate(mutated_harness, eval_tasks) # Promote only if regression-free improvement if results.improved and not results.regressions: return mutated_harness return harness # Revert
One agent, one task, one context window. This works for function-level edits. But for repository-level engineering — where you need to understand 100 files, write code, generate tests, review for security, and debug failures — a single agent hits three walls:
Mirror human software teams: different agents for different jobs. The paper identifies six canonical roles:
The paper introduces a formal concept of harness-state convergence: multiple agents working toward a shared code state that satisfies correctness, security, performance, and consensus criteria simultaneously. Three convergence types:
Click each topology to see how agents are connected, how information flows, and which failure modes each topology is susceptible to.
The survey identifies a critical trend: adaptive topologies that restructure themselves at runtime. Dynamic agent pool scaling (spin up more coders when parallelism helps), feedback-driven DAG restructuring (add a security reviewer when the task touches auth), and runtime self-reorganization.
The survey's key position on multi-agent systems: the shared repository is not just a communication channel — it's the convergence target. All agents read from and write to the same codebase. Tests pass or fail for all agents equally. The repository is the shared state.
This is what the paper calls program-state convergence: the repository + its tests + execution traces become both the medium of communication and the definition of success. Compare this to multi-agent systems that communicate through natural language messages — language is ambiguous, subjective, and unverifiable. Code is deterministic.
Now let's put everything together. This interactive simulation lets you configure and run an agent harness to solve a coding task. You'll see how each layer of the taxonomy affects the outcome — and experience firsthand why harness design dominates model capability.
Configure each harness component, then click "Run Agent" to see the execution trajectory. Try different configurations to see the harness gap in action.
Run the simulator with these configurations to observe the harness gap:
The survey identifies six fundamental open problems that define the frontier of harness engineering. Each represents a gap between what current harnesses can do and what reliable, long-horizon agent systems require.
Current benchmarks mostly measure "did the agent solve the task?" This misses how it solved it — cost, time, number of retries, safety violations, hallucinated steps. The field needs process-level metrics that evaluate the trajectory, not just the endpoint.
A harness that solves a task in 3 steps with $0.10 is better than one that solves it in 300 steps with $50 — but current benchmarks score them identically. Worse, the oracle-adequacy crisis: benchmarks like SWE-Bench rely on test suites that may not fully specify correct behavior. A patch that passes the tests may introduce security vulnerabilities, break untested edge cases, or violate project conventions.
Tests are the primary verification mechanism, but tests are often incomplete. The verifier surface needs to expand beyond functional correctness to include:
None of these can be verified purely by running tests. They require richer verification signals — static analysis, LLM-as-judge, human review — each with its own failure modes.
When the Evolution Agent modifies the harness, how do you know it didn't break something else? Current AHE systems evaluate on held-out tasks, but the space of possible regressions is vast. The harness self-evolution problem: you're changing the system that tests itself. If the harness mutation changes what "passing a test" means, the evaluation becomes circular.
SyncMind's belief-state divergence problem: when multiple agents concurrently edit a shared codebase, each agent's model of the world can drift from reality. Current best step-level attribution accuracy: 14-53%. We can't even reliably tell which agent caused a failure, let alone prevent the divergence proactively.
The permission-tier model (read-only / sandbox-edit / full-access) is a start, but production agents need:
GUI agents see screenshots. Embodied agents see camera frames. Current harnesses handle text and code well but struggle with multimodal memory (which visual details to store?), visual grounding (mapping language to image regions), implicit feedback (a button looks clicked but didn't trigger the right state), and multimodal skill evolution.
Code Assistants — the clearest instantiation. The evolution tells the story: 2021 inline completion → 2023 issue-to-patch (SWE-agent) → 2025 autonomous SWE agent (Claude Code, Codex) → 2026 harness as training data. LingmaAgent (Alibaba Cloud) resolves 16.9% of in-house issues fully autonomously and 43.3% with manual intervention — in production, not on benchmarks. Cursor trains with continuous online RL on real usage traces. The harness feeds back into the model — the boundary between "the agent" and "the harness around the agent" is becoming a learnable surface.
Scientific Discovery — AI Scientist v2 uses an agentic tree search to produce workshop-quality papers — the first AI-authored paper accepted at an ICLR workshop. The harness manages hypotheses, experiments, analyses, and laboratory protocols as executable pipelines.
GUI/OS Agents — Web pages are code. Actions are code. Evaluation is code. The most literal code-as-environment setting.
Embodied Agents — Voyager, RoboCodeX, Code-as-Policies. Code as the bridge between intent and physical execution.
Personalization — Agents that adapt recommendation policies through structured user feedback and editable preference states.