Dive into Claude Code

Chapter 0: The Problem

You start typing a function name and your editor suggests the next line. That was 2021 — autocomplete. By 2024, you could ask a chat assistant to rewrite an entire file. But neither actually does anything. They suggest. You copy-paste. You run the tests. You fix what broke.

Now imagine something different: you describe a bug, and the tool reads the code, runs the failing test, edits three files, re-runs the test, sees it pass, and tells you what it did. The tool acts in your codebase, autonomously, in a loop.

This is the shift from suggestion to agency. And it introduces architectural requirements that have no counterpart in autocomplete tools.

The fundamental shift: An autocomplete tool maps input to output. An agent maps a goal to a sequence of actions, observing intermediate results and adjusting course. This means the system needs safety boundaries (what if it runs rm -rf /?), context management (what if the conversation exceeds the model's memory?), extensibility (what if the user needs custom tools?), and persistence (what if the user closes the terminal mid-task?).

This paper by Liu et al. reverse-engineers Claude Code — Anthropic's agentic coding tool — from its publicly available TypeScript source code. The goal is not to document one product, but to map the design space that every production AI agent must navigate: recurring questions about safety posture, context management, extensibility, delegation, and persistence.

The authors identify a remarkable ratio: only about 1.6% of Claude Code's codebase is AI decision logic. The remaining 98.4% is deterministic infrastructure — the operational harness that makes agency safe, reliable, and useful. The core agent loop is trivially simple. Everything interesting lives around it.

Running example throughout: The paper traces a single task — "Fix the failing test in auth.test.ts" — through every architectural layer: the agent loop, permission gates, tool dispatch, context assembly, subagent delegation, and session persistence. We will do the same.

Evolution of AI Coding Tools

Click through the four eras. Each adds new capabilities — and new architectural requirements.

What is the fundamental architectural difference between a code autocomplete tool and an agentic coding system?

An agent maps goals to action sequences, acting in a loop and observing results — requiring safety, context management, persistence, and extensibility that autocomplete never needs An agent uses a larger language model An agent runs in the terminal instead of the IDE

Chapter 1: Five Values

Before looking at code, the paper asks a deeper question: what does the system believe matters? Every architectural decision in Claude Code traces back to five human values that its creators prioritize. These are not abstract philosophy — they produce concrete implementation choices.

1. Human Decision Authority

The human retains ultimate control. Not "the human can technically override," but "the architecture is designed so that humans can observe, approve, reject, interrupt, and audit." When Anthropic found that users approve 93% of permission prompts (approval fatigue), the response was not more warnings. It was restructuring the problem: defined sandboxed boundaries within which the agent works freely, reducing the number of decisions humans must make rather than adding more.

2. Safety, Security, and Privacy

Distinct from authority. Authority is the human's power to choose; safety is the system's obligation to protect even when that power lapses. The auto-mode threat model targets four risk categories: overeager behavior, honest mistakes, prompt injection, and model misalignment.

3. Reliable Execution

The agent does what the human actually meant, stays coherent over time, and supports verification. This spans single-turn correctness and long-horizon dependability across context boundaries, session resumption, and multi-agent delegation.

4. Capability Amplification

Approximately 27% of Claude Code-assisted tasks (per Anthropic's internal survey of 132 engineers) were work that would not have been attempted without the tool. The system enables qualitatively new workflows, not just faster existing ones. The architecture invests in deterministic infrastructure rather than decision scaffolding.

5. Contextual Adaptability

The system fits the user's specific project, tools, conventions, and skill level — and the relationship improves over time. Auto-approve rates increase from ~20% at fewer than 50 sessions to over 40% by 750 sessions. Trust is co-constructed, not fixed.

From values to principles: These five values are operationalized through thirteen design principles. For example, Human Decision Authority motivates "deny-first with human escalation" (unrecognized actions are blocked, not allowed). Capability Amplification motivates "minimal scaffolding, maximal operational harness" (don't constrain the model's choices — give it rich infrastructure to act within). Each principle answers a recurring design question that any production agent must face.

Value: Human Authority

Deny-first evaluation, graduated trust spectrum, append-only auditable state, externalized policy, values over rigid rules

↓

Value: Safety

Defense in depth (layered mechanisms), deny-first defaults, reversibility-weighted risk assessment, isolated subagent boundaries

↓

Value: Reliability

Context as scarce resource with progressive management, append-only durable state, graceful recovery

↓

Value: Capability

Minimal scaffolding / maximal harness, composable extensibility, reversibility-weighted risk

↓

Value: Adaptability

Transparent file-based memory, composable multi-mechanism extensibility, graduated trust, externalized programmable policy

What the architecture does NOT do: It does not impose explicit planning graphs on the model's reasoning. It does not provide a single unified extension mechanism. It does not restore session-scoped trust state across resume. These absences are consistent with the principles — they are deliberate design choices, not oversights.

Why did Anthropic respond to the 93% permission-approval rate by reducing the number of decisions rather than adding more warnings?

Because approval fatigue makes interactive confirmation behaviorally unreliable as a sole safety mechanism — users habitually approve without review, so the system must maintain safety independently of human vigilance Because showing warnings is too expensive computationally Because 93% approval means the permissions were mostly unnecessary

Chapter 2: The Agent Loop

The core of Claude Code is a while loop. Seriously. The queryLoop() function in query.ts is an async generator that repeats: call the model, check if the response contains tool calls, execute the tools, feed results back, repeat. When the model produces only text (no tool calls), the turn is complete.

The deceptive simplicity: The loop itself is trivial. But it sits inside a massive operational harness: context assembly before every model call, a permission gate for every tool invocation, a compaction pipeline that fires pre-call, recovery mechanisms for token limits and API failures, and streaming tool execution with concurrency control. The loop is the kernel. Everything else is the operating system.

The Turn Pipeline

Each iteration of the loop follows a fixed sequence:

Settings resolution. Immutable parameters: system prompt, user context, permission callback, model config.
Mutable state. A single State object stores messages, tool context, compaction tracking, recovery counters. Updated via whole-object assignment at seven "continue sites."
Context assembly. Retrieve messages from the last compact boundary forward. Compacted content is represented by its summary, not the original.
Pre-model shapers. Five context shapers execute sequentially (Chapter 4).
Model call. Stream the response with assembled messages, system prompt, tool schemas, thinking config.
Tool-use dispatch. If the response contains tool_use blocks, route to the tool orchestration layer.
Permission gate. Every tool request passes through the permission system (Chapter 3).
Tool execution + result collection. Results are appended as tool_result messages; loop continues.
Stop condition. No tool_use blocks in the response = turn complete.

Tool Dispatch: Concurrent Reads, Serial Writes

When the model emits multiple tool calls, the StreamingToolExecutor begins executing them as they stream in, reducing latency. Read-only operations (file reads, searches) run in parallel. State-modifying operations (shell commands, file edits) are serialized. A sibling abort controller fires when any Bash tool errors, killing other in-flight subprocesses.

Stop Conditions

Five conditions can terminate the loop:

No tool use: The model produces only text (the primary, happy-path exit).
Max turns: A configurable limit is reached.
Context overflow: The API returns prompt_too_long after recovery attempts fail.
Hook intervention: A PostToolUse hook sets hook_stopped_continuation.
Explicit abort: The abort controller signal fires (user presses Ctrl+C).

ReAct pattern: This follows the ReAct pattern (Yao et al., 2022): the model generates reasoning and tool invocations, the harness executes actions, results feed the next iteration. Alternatives include explicit graph-based routing (LangGraph) and tree-search methods (LATS). Claude Code trades search completeness for simplicity and latency: each turn commits to one action sequence without backtracking.

Recovery Mechanisms

The loop includes several self-healing behaviors:

Max output tokens escalation: If the response hits the output cap, retry with a higher limit (up to 3 attempts).
Reactive compaction: When context is near capacity, summarize just enough to free space (fires at most once per turn).
Prompt-too-long handling: Try context collapse, then reactive compaction, then terminate.
Streaming fallback: Switch strategies if the streaming API fails.
Fallback model: Switch to an alternative model if the primary fails.

Interactive Agent Loop

Watch the task "Fix failing test in auth.test.ts" flow through the agent loop. Click Step to advance one iteration, or Auto to animate.

Why is the core agent loop described as "deceptively simple"?

Because the loop itself is a trivial while-loop, but it only works because of the massive surrounding infrastructure: context assembly, permission gates, compaction, recovery, and streaming execution It has deceptive error handling The model decides when to stop, which creates unpredictability

Chapter 3: Permission & Safety

When the model decides to run npm test to reproduce the auth test failure, the request enters a multi-layered permission pipeline. The default posture: deny or ask, never allow silently.

Seven Permission Modes

The system offers a graduated autonomy spectrum — from fully supervised to nearly autonomous:

Mode	Behavior	Autonomy
plan	Model creates a plan; execution only after user approval	Lowest
default	Standard interactive use; most operations need user approval	Low
acceptEdits	File edits + certain shell commands auto-approved; others need approval	Medium
auto	ML classifier evaluates safety; auto-approves or escalates	High
dontAsk	No prompting, but deny rules still enforced	Higher
bypassPermissions	Skips most prompts; safety-critical checks and bypass-immune rules remain	Highest
bubble	Internal-only: subagent permissions escalate to parent terminal	N/A

Seven Independent Safety Layers

A request must pass through all applicable layers. Any single layer can block it:

Tool pre-filtering: Blanket-denied tools are removed from the model's view before it can even try to invoke them.
Deny-first rule evaluation: Deny rules always beat allow rules, even when the allow rule is more specific. A broad "deny all shell commands" cannot be overridden by a narrow "allow npm test."
Permission mode constraints: The active mode determines baseline handling for requests that match no explicit rule.
Auto-mode classifier: An ML-based classifier evaluates tool safety — can deny requests the rule system would allow.
Shell sandboxing: Even approved shell commands may execute inside a sandbox restricting filesystem and network access.
Non-restoration on resume: Session-scoped permissions are not restored when resuming — users must re-grant.
Hook-based interception: PreToolUse hooks can modify permission decisions; PermissionRequest hooks can resolve asynchronously.

Defense in depth, not defense in series: These layers operate in parallel. The independence assumption is that if one layer fails, others catch the violation. But the paper notes a real tension: security researchers found that commands with 50+ subcommands fall back to a single generic prompt (because per-subcommand parsing caused UI freezes). Defense-in-depth fails when layers share failure modes.

The Auto-Mode Classifier

When enabled, the classifier loads a base system prompt, an external permissions template, and (for internal users) a separate internal template. It evaluates the proposed tool invocation against the conversation transcript and produces: allow, deny, or request manual approval.

Crucially, when a deny occurs, the system treats it as a routing signal, not a hard stop. The model receives the denial reason, revises its approach, and attempts a safer alternative in the next loop iteration.

Pre-trust initialization vulnerability: Two independently verified CVEs share a root cause: code executing during project initialization (hooks, MCP server connections, settings resolution) runs before the interactive trust dialog is presented. This reveals that the permission pipeline captures spatial ordering (which layers check what) but not temporal ordering (when each layer becomes active during startup).

Why does Claude Code use deny-first rule evaluation where deny rules always override allow rules, even more specific ones?

Because a safety boundary must be absolute — a broad safety prohibition like "deny all shell commands" must not be circumventable by a narrow allow rule, or the entire safety architecture collapses Because it is simpler to implement than allow-first Because most tool invocations are dangerous

Chapter 4: Context Management

By the time our "fix auth.test.ts" task has run a few iterations, the context window is filling up: the original request, npm test output, file reads, error messages, edit attempts, re-test outputs. The context window (200K–1M tokens) is the binding resource constraint — the one resource that, when exhausted, halts everything.

Claude Code does not use simple truncation. It uses a five-layer compaction pipeline that applies progressively more aggressive compression, escalating only when cheaper strategies prove insufficient.

The Five Layers

Layer 1: Budget Reduction

Always active. Enforces per-message size limits on tool results. Replaces oversized outputs with content references. Cheap, targeted, lossless for small outputs.

↓

Layer 2: Snip

Feature-gated. Lightweight trim of older history segments. Returns {messages, tokensFreed, boundaryMessage}. Quick, removes temporal depth.

↓

Layer 3: Microcompact

Feature-gated. Fine-grained compression with an optional cache-aware path. When enabled, boundary messages are deferred until after the API response so they can use actual cache deletion counts rather than estimates.

↓

Layer 4: Context Collapse

Feature-gated. A read-time projection — replaces the message array with a virtual view. The full history remains available for reconstruction, but the model sees the collapsed version. Nothing is mutated.

↓

Layer 5: Auto-compact

User-configurable. The nuclear option: a full model-generated summary via a separate compaction call. Fires only when all four previous layers are insufficient.

Lazy degradation principle: Apply the least disruptive compression first. Budget reduction costs nothing. Snip is fast. Microcompact is clever. Context collapse is virtual. Auto-compact is expensive (a separate model call). Each layer runs only if previous layers left context pressure unresolved. The graduated design contrasts with simpler systems that use single-pass truncation or a single summarization step.

Beyond Compaction: Other Context-Saving Decisions

Context pressure shapes decisions across the entire system, not just the compaction pipeline:

CLAUDE.md lazy loading: Base hierarchy loads at startup, but subdirectory files load only when the agent reads files in those directories.
Deferred tool schemas: Some tools include only their names initially; full schemas are loaded on demand via ToolSearch.
Subagent summary-only return: Subagents return only summary text, not their full conversation history.
Per-tool-result budget: Individual tool results are capped at a configurable size.

Context Window Assembly Order

Six layers are assembled into the context window, each loaded at different times:

System layer (startup): System prompt, environment info (git status), skill descriptions, MCP tool names, output styles.
Project config (startup / lazy): CLAUDE.md hierarchy (5 levels: managed, user, project, local, directory-specific). Path-scoped rules load lazily.
Memory (startup): Auto memory entries, prefetched asynchronously.
Conversation (carry forward): History + subagent summaries, subject to compaction.
Runtime (carry forward): File reads, command outputs, tool results.
On-demand (lazy): Deferred tool definitions loaded via ToolSearch.

The transparency trade-off: Compression is largely invisible to the user. When budget reduction replaces a long tool output, when context collapse substitutes a summary, or when snip trims older history, the user has no easy way to inspect what was lost. The five-layer design achieves effective management at the cost of opacity.

Compaction Pipeline

Drag the Context Pressure slider to see which compaction layers activate. At low pressure, nothing fires. As pressure increases, each layer kicks in progressively.

Context Pressure 20%

Why does Claude Code use five compaction layers instead of a single summarization step?

Because five layers are faster than one Because each layer is specialized for a different type of context pressure Because the graduated design applies the least disruptive compression first and escalates only when cheaper strategies are insufficient — a full model-generated summary is expensive and only fires as a last resort

Chapter 5: Extensibility

Once Claude is trying to repair auth.test.ts and the npm test command has passed through the permission system, the next question is: what tools are available for the repair? The model sees not just built-in tools like BashTool and FileReadTool, but also database queries from an MCP server, a custom lint skill, and tools from an installed plugin. These arrive through four distinct mechanisms.

Why Four Mechanisms?

A natural question. The answer lies in context cost. Different kinds of extensibility consume different amounts of the bounded context window, and a single mechanism cannot span the full range without forcing unnecessary trade-offs.

Mechanism	Unique Capability	Context Cost	Insertion Point
MCP Servers	External service integration (multi-transport: stdio, SSE, HTTP, WebSocket)	High (tool schemas)	Tool pool
Plugins	Multi-component packaging + distribution (10 component types)	Medium (varies)	All three points
Skills	Domain-specific instructions + meta-tool invocation	Low (descriptions only)	Context injection
Hooks	Lifecycle interception + event-driven automation (27 event types)	Zero by default	Pre/post tool execution

The graduated cost ordering: Hooks: zero context. Skills: low (only frontmatter descriptions stay in the prompt). Plugins: medium (varies by components). MCP servers: high (full tool schemas). This means cheap extensions can scale widely without exhausting the context window, while expensive ones are reserved for cases that genuinely require new tool surfaces.

Three Injection Points

Every agent loop iteration has three phases where extensions can plug in:

assemble(): What the model sees. CLAUDE.md files, skill descriptions, MCP resources, output styles, UserPromptSubmit hooks, SessionStart hooks.
model(): What the model can reach. Built-in tools, MCP tools, SkillTool (meta-tool that launches a skill), AgentTool (meta-tool that spawns a sub-agent).
execute(): Whether and how an action runs. Permission rules, PreToolUse hooks (approve/block/rewrite), PostToolUse hooks (mutate output/inject context), Stop hooks.

Tool Pool Assembly

The assembleToolPool() function is the single source of truth for combining built-in and external tools. It follows a five-step pipeline: base tool enumeration (up to 54 tools), mode filtering, deny rule pre-filtering, MCP tool integration, and deduplication (built-ins take precedence).

27 hook events: The hook system spans tool authorization (PreToolUse, PostToolUse, PermissionRequest, PermissionDenied), session lifecycle (SessionStart, SessionEnd, Setup, Stop), user interaction (UserPromptSubmit, Elicitation), subagent coordination (SubagentStart, SubagentStop, TeammateIdle), context management (PreCompact, PostCompact, InstructionsLoaded), and workspace events (CwdChanged, FileChanged, WorktreeCreate). Of these, 5 are safety-related; the remaining 22 serve lifecycle and orchestration.

Extension Surface Comparison

Click each mechanism to see where it plugs in and what it costs. The bar height shows context cost; arrows show injection points.

Why does Claude Code use four extension mechanisms instead of one unified tool API?

Because different kinds of extensibility impose different costs on the bounded context window — hooks consume zero context, skills consume low, MCP consumes high — and a single mechanism cannot span this full range without forcing unnecessary trade-offs Because four APIs are more secure than one Because each mechanism was developed by a different team

Chapter 6: Subagent Delegation

When Claude determines that fixing the auth test requires first understanding the authentication module's structure, it can delegate this exploration to a subagent. The delegation mechanism is the Agent tool — a meta-tool that spawns an isolated child agent running its own instance of the same queryLoop().

Built-in Subagent Types

Up to six types, depending on feature flags:

Explore: Read/search-oriented. Write and edit tools are in its deny-list.
Plan: Creates structured plans; execution uses standard permissions.
General-purpose: Broadly capable, used when explicitly requested.
Claude Code Guide: Onboarding and documentation assistance.
Verification: Runs validation checks (test suites, linting).
Statusline-setup: Terminal status line configuration.

Beyond built-ins, users define custom subagents via .claude/agents/*.md files. Each file's markdown body is the agent's system prompt, and YAML frontmatter specifies tools, model, permissions, hooks, memory scope, and isolation mode. A custom agent is a fully configured, isolated sub-system.

Three Isolation Modes

Mode	How	Trade-off
Worktree	Creates a temporary git worktree — the subagent gets its own copy of the repository	Filesystem-level separation with zero external dependencies
Remote	Launches in a remote Claude Code environment (internal-only), always background	Full environment isolation but requires infrastructure
In-process	Shares the filesystem with parent but has its own isolated conversation context	Lightest weight, but file conflicts possible

Summary-only return: Subagents return only their final response text and metadata to the parent. The full subagent conversation history never enters the parent's context window. This is a critical context-conservation choice. Agent teams consume approximately 7x the tokens of a standard session — making summary-only return essential for preventing context explosion.

Permission Override Logic

When a subagent defines a permissionMode, the override applies unless the parent is already in bypassPermissions, acceptEdits, or auto mode — those always take precedence because they represent explicit user decisions about autonomy.

For async agents: explicit canShowPermissionPrompts checks first, then bubble mode (always show, since they escalate to parent), then default (sync = show, async = don't).

Sidechain Transcripts

Each subagent writes its own transcript as a separate .jsonl file with a .meta.json metadata file. This sidechain design means subagent histories are preserved for debugging and auditing but do not inflate the parent's session file.

The key difference from SkillTool: SkillTool injects instructions into the current context window. AgentTool spawns a new, isolated context window. The trade-off: most subagent invocations require a self-contained prompt because the default path does not inherit the parent's conversation history.

Why do subagents return only summary text to the parent, not their full conversation history?

Because context is the binding resource constraint — agent teams already consume ~7x standard tokens, so injecting full subagent histories would cause context explosion in the parent Because the parent cannot parse subagent tool results Because subagent conversations are confidential

Chapter 7: Session Persistence

By now our auth-test task has accumulated a full transcript: the original prompt, tool invocations and results, compaction boundaries, and the subagent summary. The question: which artifacts are durably recorded, and what can be recovered later?

Append-Only JSONL Transcripts

Session transcripts are stored as mostly append-only JSONL files (with explicit cleanup rewrites as a rare exception). Every event is human-readable, version-controllable, and reconstructable without specialized tooling. Three channels operate independently:

Session transcripts: Conversation records (user, assistant, attachment, system messages + compaction events + metadata). One file per session, project-scoped.
Global prompt history: User prompts only, stored in history.jsonl. Supports Up-arrow and Ctrl+R navigation.
Subagent sidechains: Separate .jsonl + .meta.json per subagent.

Append-only is a deliberate choice: It favors auditability and simplicity over query power. Every event is preserved, enabling resume, fork, and audit. The cost: richer queries like "show me all tool calls that modified file X across sessions" require post-hoc reconstruction rather than direct lookup. A database-backed alternative would enable richer queries but introduce deployment dependencies and reduce transparency.

Resume, Fork, and NOT Restoring Permissions

The --resume flag rebuilds the conversation by replaying the transcript. Fork creates a new session from an existing one. But neither restores session-scoped permissions. Users must grant them again.

This is a deliberate safety-conservative choice: sessions are treated as isolated trust domains. Restoring previously granted permissions on resume would risk carrying stale trust decisions into a changed context. The architecture accepts user friction as the cost of the safety invariant that trust is always established in the current session.

Compaction + Persistence Integration

The compact_boundary marker records headUuid, anchorUuid, and tailUuid. These UUIDs enable the session loader to patch the message chain at read time. The mostly-append design means compaction never modifies or deletes previously written transcript lines; it only appends new boundary and summary events.

File-history checkpoints: The "checkpoints" in Claude Code are file-level snapshots stored at ~/.claude/filehistory/<sessionId>/. They support --rewind-files for reverting filesystem changes — these are file snapshots, not a generic checkpoint store.

Why does resume NOT restore session-scoped permissions?

Because sessions are treated as isolated trust domains — carrying stale trust decisions into a changed context is a security risk, so the architecture requires trust to be re-established in each session Because permissions cannot be serialized to JSON Because it would require too much disk space

Chapter 8: OpenClaw Contrast

The paper does not just analyze Claude Code. It compares it with OpenClaw, an independent open-source AI agent system that answers the same design questions from a completely different deployment context. OpenClaw is a local-first WebSocket gateway connecting ~24 messaging surfaces (WhatsApp, Telegram, Slack, Discord, Signal) to an embedded agent runtime.

The comparison reveals that the design questions are stable — every agent must answer them. But the answers vary with context.

Six Dimensions of Contrast

Dimension	Claude Code	OpenClaw
System scope	Ephemeral CLI process, single repository	Persistent WebSocket gateway daemon, multi-channel control plane
Trust model	Deny-first per-action evaluation + ML classifier; 7 modes	Single trusted operator per gateway; DM pairing + allowlists; opt-in sandboxing
Agent runtime	queryLoop() async generator IS the system center	Agent runner is embedded INSIDE a larger gateway dispatch
Extension arch	4 mechanisms at graduated context costs	Manifest-first plugin system with 12 capability types + central registry
Memory/context	CLAUDE.md 4-level hierarchy; 5-layer compaction	Workspace bootstrap files (AGENTS.md, SOUL.md, etc.); dreaming for long-term memory; hybrid vector+keyword search
Multi-agent	Task-delegating subagents; worktree isolation; summary-only return	Multi-agent routing with isolated agents + sub-agent delegation with depth limits

Opposite bets: Claude Code invests in graduated per-action safety evaluation. OpenClaw invests in perimeter-level identity and access control. Claude Code treats the agent loop as the architectural center. OpenClaw treats the gateway control plane as the center, embedding the agent loop as one component. Claude Code's extensions modify a single context window. OpenClaw's plugins extend a shared gateway surface. These inversions follow from different trust models and deployment topologies.

Three Observations from the Contrast

The questions are universal. Where reasoning lives, what safety posture to adopt, how to manage context, how to structure extensibility — OpenClaw answers all of them, but from the starting point of a multi-channel personal assistant.
The systems make opposite bets on several dimensions. Per-action vs. perimeter safety. Agent loop as center vs. as component. Single-window vs. gateway-wide extensions.
They compose. OpenClaw can host Claude Code as an external coding harness via ACP (Agent Client Protocol). The design space is layered, not flat: gateway-level and task-level systems can stack.

The most fundamental divergence: Claude Code is an ephemeral CLI process that starts and ends with the terminal. OpenClaw is a persistent daemon that owns all messaging connections and coordinates clients, tools, and device nodes. This difference in system scope determines how every other design question is framed.

What is the most important insight from comparing Claude Code with OpenClaw?

That Claude Code is better That the recurring design questions (safety posture, context management, extensibility, etc.) are universal across agent systems, but the answers vary with deployment context — and the two systems can even compose That open-source agents are more extensible

Chapter 9: Connections

This paper maps a design space. Let's connect it to the broader landscape and surface the open questions that remain.

Six Open Directions

The Observability-Evaluation Gap. Industry surveys estimate 78% of AI failures are invisible. The architecture gives operators visibility into tool calls, hooks, and transcripts — but nearly 89% of teams adopt observability while only 52% adopt offline evaluation. Closing this gap likely requires generator-evaluator separation inside the harness, not just model improvements.
Cross-Session Persistence. What belongs between static instructions (CLAUDE.md) and a single session's transcript? Durable state that accumulates across sessions — learned strategies, reusable procedures, relationship evolution. The experiential tier is the natural next step.
Harness Boundary Evolution. The space of interesting harness combinations doesn't shrink as models improve — it moves. Four axes: where (local vs. cloud), when (reactive vs. proactive), what (text vs. multimodal vs. physical), with whom (single agent vs. role-differentiated teams).
Horizon Scaling. Current architecture units are turn, session, and subagent. What happens when autonomous work extends to days or weeks? Multi-session research programs test whether compaction, summary-only return, and append-only persistence remain sufficient.
Governance at Scale. The EU AI Act (fully applicable August 2026) and evolving copyright jurisprudence may impose external constraints on logging, transparency, and human oversight. The deny-first evaluation is internally auditable but not yet externally auditable in the forms emerging frameworks contemplate.
Long-Term Human Capability. The most provocative open question: while the architecture amplifies short-term capabilities, it offers limited mechanisms that explicitly preserve long-term human understanding, codebase coherence, or the developer pipeline. Future systems could treat this sustainability gap as a first-class design problem.

The provocative finding: A randomized controlled trial found AI tools made experienced developers 19% slower despite a perceived 20% improvement. A causal analysis of 807 repositories found code complexity increased by 40.7% after AI adoption. An EEG study found weakened neural connectivity that persisted after AI was removed. Whether architecture can respond to these signals — through comprehension-preserving surfaces, generator-evaluator separation for the human loop, or mechanisms not yet named — is the deepest open question the paper raises.

Relation to Other Architectural Patterns

LangGraph: Routes control flow through developer-defined state graphs. More scaffolding, less model autonomy. Opposite philosophy to Claude Code's minimal-scaffolding approach.
SWE-Agent / OpenHands: Docker container isolation rather than layered policy enforcement. Stronger boundaries but heavier infrastructure.
Aider: Git as the primary safety mechanism. All changes are reversible through version control. Simpler but less fine-grained.
ReAct: Claude Code's loop directly follows this pattern: reason, act, observe, repeat.
Devin: Fully autonomous with explicit planning. Heavier decision scaffolding.

Cheat Sheet

Aspect	Claude Code
Core loop	while-true: model call → tool dispatch → result append → repeat
Design philosophy	1.6% decision logic, 98.4% deterministic infrastructure
Permission system	7 modes, deny-first, ML classifier, 7 independent safety layers
Context management	5-layer graduated compaction pipeline
Extensibility	4 mechanisms at graduated context costs (hooks → skills → plugins → MCP)
Delegation	Subagents with isolated context, summary-only return, worktree isolation
Persistence	Append-only JSONL; no permission restoration on resume
Safety posture	Deny-first with human escalation; defense in depth
Tool pool	Up to 54 built-in + MCP tools via assembleToolPool()
Key insight	The design questions are universal; the answers vary with deployment context

The broader lesson: Production coding agents are converging toward operating-system-like abstractions. The core loop is the kernel. The permission system, context management, tool routing, extensibility, and persistence are the OS. As frontier models converge in capability, the quality of this surrounding harness becomes the principal differentiator — validating an architecture that invests in infrastructure over decision scaffolding.

What recurring design pattern does the paper identify across all six subsystems of Claude Code (safety, context, extensibility, delegation, persistence, loop)?

Graduated layering over monolithic mechanisms: safety uses 7 layers, context uses 5 compaction stages, extensibility uses 4 mechanisms at different context costs — always trading simplicity for defense in depth All subsystems use machine learning classifiers All subsystems are configurable via CLAUDE.md