Research · May 18 2026 · State-of-the-art reference

The long-running agent stack in 2026.

Four lanes have emerged. Two protocols have settled. The harness moved underneath the model, and the most interesting work in 2026 is the supervision layer above both. This is the map, with the concepts underneath it shown working.

SCOPE Multi-vendor HORIZON 12 months ENTRIES 43 platforms SIMULATIONS 18 interactive UPDATED 2026-05-18

01The shape

The category is no longer "AI agents." The category is the harness — the supervision, verification, and durability layer that turns a model into a worker that can be left running.

Three things settled between Anthropic's December 2025 MCP donation to the Linux Foundation and Microsoft's April 2026 unification of AutoGen and Semantic Kernel. First, the model is commodity — Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro all perform within a few percentage points of one another on the long-horizon coding benchmarks. Second, the protocol layer went open — MCP for tool access, A2A for agent coordination, both under foundation stewardship, both with thousands of public servers. Third, the operational concerns that used to live inside each agent framework — durable execution, sandboxes, observability, HITL pauses — have separated into their own layer underneath.

10,000+Public MCP servers as of April 2026
$5BTemporal valuation, Feb 2026 raise
1.86TTemporal executions from AI-native firms
97MMonthly MCP SDK downloads

What's left is the harness layer — the system prompts, the tool palettes, the verification logic, the decomposition strategies, the trust tiers, the budget enforcement. As Taskade put it bluntly in March: "2025 was the year of agents. 2026 is the year of harnesses." The frontier moved one level up.

LANE 01

Managed harnesses

The vendor runs the loop. You bring prompts, tools, skills. Sessions, sandboxes, middleware are theirs.

Vendor-managed runtime
LANE 02

Frameworks

You run the loop on your own substrate. The framework gives primitives — graphs, crews, handoffs, hooks.

Self-hosted runtime
LANE 03

Specialized agents

Pre-built agents for specific verbs — coding, browsing, computer use. You delegate; they do their job.

Task-specific cloud
LANE 04

Durable execution

The substrate everything else runs on. Journal replay, checkpoints, exactly-once semantics, weeks-long flows.

Workflow runtime
The lanes are not interchangeable. Pick one per concern. A typical long-running production system in 2026 picks one managed harness or framework, runs it on top of one durable-execution runtime, delegates specialized work to one or two cloud agents, and speaks MCP plus A2A. Five components, not five frameworks.

02The four lanes

Thirty-three platforms, grouped by the concern they solve. Filter by lane to narrow the choice, or search for a specific name. Each card carries an honest "watch for" line.

03Protocols deep dive

Two protocols, both under foundation stewardship, both wired into nearly every entry in the four lanes. They are the reason the stack works at all. Architecture, wire format, and a live message simulation for each.

MCP · Model Context Protocol

The tool and data wire.

MCP is a client-server protocol that standardizes how an agent reaches a tool or a data source. Created by Anthropic in late 2024, donated to the Linux Foundation in December 2025. Wire format is JSON-RPC 2.0 over one of three transports: stdio (local subprocess), HTTP with Server-Sent Events (remote), or HTTP (request-response). By April 2026 the public registry passed 10,000 servers.

Host (agent) CLAUDE · GPT · GEMINI MCP Client one per server JSON-RPC 2.0 stdio · sse · http request response MCP Server GITHUB · POSTGRES · WALL-BRIDGE tools resources prompts External system API · DB · FS A host opens one client per server. Servers expose tools, resources, prompts.
Request flow Response flow The server (where capability lives)

The three primitives.

Tools
Callable functions. The server lists them; the client invokes by name with JSON args. github.create_issue, db.query, wall_bridge.show_card.
Resources
Readable data — files, query results, web pages. Referenced by URI. file:///repo/README.md, postgres://customers/42.
Prompts
Server-defined prompt templates the host can offer to its user. review_pr, summarize_email_thread.

Of the three, tools dominate in practice. Resources are common but underused; prompts are rare in the wild. The 2026 norm is "MCP server" meaning "server that exposes tools, optionally with resources."

Wire format simulation.

Watch the JSON-RPC 2.0 exchange between client and server. The client initializes the connection, discovers tools, invokes one, and reads the response.

Sim 01 · MCP message flow
JSON-RPC 2.0 over stdio
Claude
MCP CLIENT
GitHub MCP
MCP SERVER
IDLE
tool_call.jsonJSON-RPC 2.0
// Client → Server: invoke a tool by name
{
  "jsonrpc": "2.0",
  "id": 7,
  "method": "tools/call",
  "params": {
    "name": "github.create_issue",
    "arguments": {
      "repo": "your-org/your-repo",
      "title": "Agent-test: hello-world article",
      "body": "Drafted by an automated coding agent."
    }
  }
}

// Server → Client: structured result
{
  "jsonrpc": "2.0",
  "id": 7,
  "result": {
    "content": [{
      "type": "text",
      "text": "Issue #142 created"
    }],
    "structuredContent": { "issue_number": 142, "url": "https://..." }
  }
}
A2A · Agent-to-Agent

The coordination wire.

A2A is how agents discover and delegate to other agents. Each agent publishes an Agent Card — a small JSON document describing its skills, endpoints, and auth. An orchestrator reads agent cards, picks a capable peer, and submits a Task. Tasks carry an id, a state machine (submitted → working → completed / failed / canceled), and a stream of messages. Created by Google, donated alongside MCP. Production at 150+ organizations.

Orchestrator A2A CLIENT Coordinator / planner discover Agent Card /.well-known/ agent.json skills · auth · endpoints tasks/send · tasks/stream Remote Agent A2A SERVER submitted → working → completed TASK STATE MACHINE streaming results
Discovery + submit Streaming response A2A endpoint

Agent Card — the JSON that makes discovery work.

/.well-known/agent.jsonA2A v0.2
{
  "name": "code-agent",
  "description": "Coding agent that opens PRs for assigned tasks.",
  "url": "https://code-agent.example.com/a2a",
  "version": "1.0.0",
  "capabilities": {
    "streaming": true,
    "pushNotifications": true,
    "stateTransitionHistory": true
  },
  "skills": [
    {
      "id": "write_article",
      "name": "Write article and open PR",
      "inputModes": ["text"],
      "outputModes": ["application/json"]
    }
  ],
  "authentication": { "schemes": ["bearer"] }
}

Task lifecycle simulation.

Click run to watch an orchestrator discover a remote code-agent, submit a task, receive streaming progress, and collect the final result.

Sim 02 · A2A task delegation
Orchestrator → code-agent
Orchestrator
A2A CLIENT
code-agent
A2A SERVER
IDLE
The mental model: MCP runs vertically — agent reaches down to a tool. A2A runs horizontally — agent reaches sideways to another agent. The two together let you compose agent systems the way you compose Unix pipelines: small specialized components, wired through standard interfaces, swappable.

04Concepts deep dive

Twelve concepts every engineer building long-running agents should be able to draw on a whiteboard. Each one carries a definition, a claim about why it matters, a working simulation, and a real code example.

Category A · LLM concepts

Context window and compaction

LLM · A1

A context window is the bounded buffer of tokens a model can attend to in one call — Claude Opus 4.7 carries 1M, GPT-5.5 carries 1M, Gemini 3.1 Pro carries 2M as of May 2026. Once full, you either truncate, summarize, or compact. Compaction is the standard move: replace older turns with a generated summary, keep recent turns verbatim, preserve key decisions.

Why it matters. Long-running agents that cannot compact will eventually fail. Every harness either compacts automatically or forces you to do it manually.
Sim A1 · Context fills, then compacts
8K window · auto-compact at 80%
0 / 8000
compact.pyPython
def compact_if_needed(messages, limit=8000):
    tokens = count_tokens(messages)
    if tokens < limit * 0.8: return messages
    # Keep last 4 turns verbatim; summarize the rest.
    head, tail = messages[:-8], messages[-8:]
    summary = claude.summarize(head, style="key decisions + state")
    return [{"role":"system", "content": summary}] + tail

Tool use (function calling)

LLM · A2

A model emits a structured tool call instead of free text when its training has taught it that a tool can answer the current sub-goal. The host runs the tool, returns the result as a new message, and the model continues. Tool use is the mechanism that makes LLMs agents — without it, they only speak; with it, they act. MCP standardized the wire format for tool discovery and invocation.

Why it matters. Every long-running agent is, mechanically, a loop of model decides → tool runs → model reads result → repeat. The quality of your tool palette determines the agent's competence more than the model picker does.
Sim A2 · Tool call round-trip
model emits → tool runs → model reads
Model
DECIDES
Tool
EXECUTES
IDLE
tool_definition.pyAnthropic SDK
tools = [{
    "name": "get_weather",
    "description": "Get current weather for a city",
    "input_schema": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"]
    }
}]
response = client.messages.create(model="claude-opus-4-7",
                                  tools=tools, messages=[...])
# response.stop_reason == "tool_use" → run the tool, return tool_result

Streaming and Server-Sent Events

LLM · A3

Long-running agents stream their work back to clients via Server-Sent Events (SSE) — a one-way HTTP-over-text channel that delivers events as they're produced. Anthropic, OpenAI, and Google all expose the same model: data: {json}\n\n per event. Streaming is what makes the difference between a 30-second blank page and a perceived-instant response.

Why it matters. A weeks-long agent that buffers everything until done is unusable. SSE turns the durable execution log into a live UX surface.
Sim A3 · SSE stream vs batched return
same total tokens, different latency profile
Click stream to begin…
IDLE
stream.pySSE consumer
with client.messages.stream(model="claude-opus-4-7",
                            messages=[...]) as stream:
    for event in stream:
        if event.type == "content_block_delta":
            print(event.delta.text, end="", flush=True)
        elif event.type == "message_stop":
            break
Category B · Agent concepts

The agent loop — Think · Act · Observe

AGENT · B1

Every agent is the same three-step loop. Think: the model reads context and decides on a next action — answer, or invoke a tool. Act: the host runs the chosen tool. Observe: the tool result becomes a new message in the conversation, the loop repeats. Termination comes from one of three signals: the model emits a final answer, the harness hits a budget cap, or a human intervenes.

Why it matters. Every framework and managed harness is a thin variation on this loop. If you can draw the loop you can debug any harness in the world.
Sim B1 · The think-act-observe cycle
auto-running · 6s cycle
Think
model decides
Act
tool runs
Observe
result appended
RUNNING · iter 0
agent_loop.pyharness pseudocode
def run_agent(task, max_iters=50, max_cost_usd=5):
    messages = [{"role":"user", "content": task}]
    for i in range(max_iters):
        resp = think(messages)                  # model call
        if resp.stop_reason == "end_turn":
            return resp.content              # done
        if resp.cost > max_cost_usd:
            escalate("budget exceeded")        # control
        for call in resp.tool_calls:
            result = act(call)                  # run tool
            messages.append({"role":"tool", "content": result})  # observe

Sub-agents and the coordinator pattern

AGENT · B2

When one agent's tool palette grows past ~15 tools, output quality degrades. The fix is delegation: a coordinator agent holds a roster of specialists, picks one for each sub-task, and synthesizes results. Each sub-agent has its own context window — clean of the coordinator's reasoning — so output quality stays sharp. Claude Managed Agents calls these "session threads"; OpenAI Agents SDK calls them "handoffs"; LangGraph models them as sub-graphs.

Why it matters. Single-agent systems plateau at moderate complexity. Multi-agent with a coordinator is the only pattern that has shipped past 100K-line autonomous projects (Claude Agent Teams, May 2026: 16 agents wrote a Rust C-compiler).
Sim B2 · Coordinator dispatches three tasks
parallel by default · serial here for clarity
Coordinator
idle
Tasks-agent
idle
Code-agent
idle
Research-agent
idle
IDLE

Done-conditions and the evaluator split

AGENT · B3

A done-condition is a testable completion criterion written before the agent starts. It is the single most cited reliability pattern in 2026: without one, agents stop too early or run forever. A separate evaluator agent (different model, lower temperature) checks the worker's output against the done-condition. Generator and evaluator are the same loop run with different prompts; together they are the planner-worker-judge triangle.

Why it matters. Self-grading is the dominant failure mode. The cheap insurance is an evaluator agent that cannot mark its own homework.
Sim B3 · Evaluator checks the worker's output
3 criteria · pass/fail
IDLE
verifier.pyevaluator pattern
def verify(worker_output, done_condition):
    judge = claude.create(
        model="claude-sonnet-4-6",           # cheaper, lower-temp
        system="You judge whether output meets criteria. Return JSON.",
        messages=[{"role":"user", "content": f"""
            Output: {worker_output}
            Criteria: {done_condition}
            Return: {"passed": [], "failed": []}
        """}]
    )
    return json.loads(judge.content[0].text)

HITL gate at irreversible actions

AGENT · B4

A human-in-the-loop gate is a pause inserted at any action the system cannot easily undo — merge to main, payment, deploy, send, delete. The harness intercepts the tool call, emits a requires_action event, and waits for an external tool_confirmation with allow or deny. Trust tiers dial how often the gate fires: low-trust agents trip it on every irreversible action; high-trust agents only on novel ones.

Why it matters. Catastrophic agent failure stories — wiped repos, drained accounts, mass emails — are 80%+ "the gate was not wired." Wire the gate, sleep at night.
Sim B4 · Agent run · gate · resume
approve or deny
approval gate
A
IDLE
Category C · Distributed systems concepts

Journal-based replay (durable execution)

DIST · C1

A durable workflow runtime writes every completed step to an append-only journal before moving on. If the process crashes mid-workflow, on restart it replays the journal — completed steps return cached results without re-execution; the workflow resumes at the first incomplete step. This gives exactly-once semantics across crashes, deploys, and pauses, and it is the reason a Temporal workflow can pause for six months and resume cleanly.

Why it matters. Without durable execution, "long-running" tops out at the lifetime of one process. With it, agents can run for weeks. Temporal raised $300M at $5B in Feb 2026 because this primitive is now load-bearing for AI infrastructure.
Sim C1 · Crash mid-workflow, then resume from journal
six steps · crash at step 4
IDLE
workflow.pyTemporal-style
@workflow.defn
class ProcessTaskWorkflow:
    @workflow.run
    async def run(self, task_id):
        spec = await workflow.execute_activity(get_spec, task_id)
        plan = await workflow.execute_activity(make_plan, spec)
        result = await workflow.execute_activity(run_agent, plan)
        await workflow.execute_activity(apply_outcome, result)
        # Each activity is journaled. Crash → replay returns cached results
        # for completed activities; resumes at the first incomplete one.

Idempotency keys

DIST · C2

An idempotency key is a client-generated unique id attached to a request, used by the server to recognize and dedupe retries. Without one, the network's "I didn't get a response" ambiguity causes double-charges, duplicate emails, duplicate writes. With one, the server can safely return the cached result of the original call instead of re-executing.

Why it matters. Long-running agents retry. Tools they call also retry. Every external write your agent does — payment, email, ticket — must accept an idempotency key, or your weekend has a story in it.
Sim C2 · Five clicks · with key vs without
same intent, different outcomes
With idempotency key
0
$ charged
Without
0
$ charged
EACH CLICK = 1 RETRY
charge.pyStripe-style
stripe.charges.create(
    amount=5000,
    currency="usd",
    source="tok_visa",
    idempotency_key=f"task-{task_id}-charge",  # client-supplied, unique per intent
)
# Stripe stores the first response by key for 24h.
# Retries with the same key return the cached response — no double charge.

The saga pattern · compensating transactions

DIST · C3

Distributed transactions don't exist in modern systems — there's no two-phase commit across Stripe, GitHub, and your warehouse. The saga pattern replaces them: a workflow of forward steps, each with a paired compensating step that undoes it. If step 4 fails, run compensations for steps 3, 2, 1 in reverse. Saga is how durable execution handles partial failure without locks.

Why it matters. Every agent workflow that touches more than one external system needs saga thinking. The alternative is a corrupt state that nobody notices until the customer calls.
Sim C3 · 3 forward steps · step 3 fails · rollback runs
create order · charge · ship → refund · cancel
IDLE
Category D · Cloud concepts

Sandbox isolation

CLOUD · D1

An agent's tools run in a sandbox — typically a container (Docker, gVisor, Firecracker, microVM) with no host access, restricted network, and a clean filesystem destroyed at session end. The sandbox is the only thing standing between a hallucinating rm -rf / and your production. Cursor Background Agents destroy the VM after each session; Anthropic's Managed Agents use ephemeral containers; E2B and Daytona productize the pattern as a service.

Why it matters. Once you let an agent run shell commands, sandbox boundaries are the only safety property you have. Treat the boundary as an invariant, not a configuration.
Sim D1 · Inside vs outside the sandbox boundary
try to escape
Inside sandbox · agent has access
/workspace/repo
/workspace/tasks.db
tcp://github.com:443
pip · npm · git
Outside · blocked
~/.ssh/id_rsa
localhost:5432 (host db)
/var/log/auth.log
other tenants' files
BOUNDARY HOLDS

Vault-mediated credentials

CLOUD · D2

An agent that needs to call an external API gets a short-lived session token, not the real API key. The token authorizes a proxy in front of the secret vault; the proxy injects the actual credential at call time. The agent never sees, prints, or logs the real key — and if its context is leaked, the worst the attacker has is a token already revoked. Anthropic's Managed Agents implement exactly this pattern.

Why it matters. Agent context is leaky — prompt injection, tool result echoes, debug logs. A vault-mediated credential is the difference between a leaked context and a leaked production database.
Sim D2 · Agent → proxy → vault → external API
the agent never sees the credential
Agentsession token
Proxyauth check
Vaultfetch credential
ExternalAPI call
IDLE

05Voice agents

Voice changes one constraint — latency — and that single change ripples through every architectural choice. The lanes from section 02 still apply. The protocols from section 03 still work. But each component now has to be picked against a 500-millisecond budget.

The shape is familiar: a model, tools, a session. The differences are in the I/O. Audio in becomes audio out. The user expects to interrupt, be interrupted, hold pauses naturally, and never wait more than a beat. A weeks-long agent is allowed to think for ten seconds. A voice agent gets half a second before the human notices and gives up.

Cascaded vs end-to-end.

Most voice agents are cascaded — a chain of specialized models. Voice activity detection finds when the user stops talking. Speech-to-text transcribes. The LLM thinks. Text-to-speech synthesizes. Audio plays back. OpenAI's Realtime API is the prominent exception — a single speech-to-speech model that processes audio tokens directly, no separate STT or TTS. Cascaded gives more control and easier debugging; speech-to-speech gives lower latency and more natural prosody. Sierra, ElevenLabs, Vapi, Retell, LiveKit, and Pipecat all cascade. OpenAI Realtime ships end-to-end. Google's Gemini Live and Anthropic's experimental voice mode sit between the two — partially fused stages.

The latency budget, traced.

The classical 2026 cascaded pipeline sits around 690ms time-to-first-audio on a good day. Swapping in semantic VAD and Cartesia Sonic cuts that roughly in half. Watch each stage's contribution.

Sim E1 · Cascaded pipeline · time to first audio
target under 800ms · natural-feeling response
VAD
~120ms
STT
~150ms
LLM
~280ms
TTS
~100ms
Audio out
~40ms
Time to first audio · 0 ms
IDLE

Turn-taking and barge-in.

The hardest problem in voice is knowing when the user is done talking. Silence-based VAD is fast but wrong on ambiguous pauses ("I think… yeah, let's do that"). Semantic VAD reads the partial transcript and predicts end-of-utterance from meaning — much harder to fool, slightly slower. Barge-in is the inverse problem: the user starts speaking while the agent is mid-response. The agent must immediately stop playing TTS, cancel the in-flight model call, discard any pending audio, and reprocess from the new user input. Both behaviors are table stakes by 2026.

Sim E2 · Turn-taking · normal vs barge-in
user lane · agent lane · 4-second window
user
agent
0s ··········· 1s ··········· 2s ··········· 3s ··········· 4s
IDLE

The voice stack today.

Ten platforms — speech-to-speech models, cascaded agent frameworks, telephony bridges, low-latency TTS layers, and emotion-aware variants. The choice is part latency, part control over each stage, part which surface you ship to (phone, browser, embedded device).

/01
Sierra

Sierra

Vertical AI for customer service — voice and text agents that handle support, retention, and account management end-to-end. Bret Taylor's company. Ships to ADT, Sonos, Wayfair, Weight Watchers. Custom RLHF on frontier base models for vertical-specific behavior.

Best forEnterprise customer service automation where consistency and brand voice matter.
Watch forEnterprise-only contracts. Long sales cycles. Not for self-serve.
RLHF · TWILIO · ANTHROPIC · ENTERPRISE
/02
ElevenLabs

ElevenLabs Conversational AI

Full agent platform on top of ElevenLabs' TTS. 5,000+ voices, 30+ languages, sub-150ms time-to-first-audio on Flash v2.5. Strong DX with one-line agent creation. Used in consumer apps, gaming, and IVR replacement.

Best forMultilingual or voice-quality-critical agents, especially consumer-facing.
Watch forTTS is the headline; STT and LLM are partner integrations. Cost scales with audio minutes.
MULTILINGUAL · TTS · WEBRTC · TELEPHONY
/03
OpenAI

Realtime API

End-to-end speech-to-speech via GPT-5o-realtime. No separate STT or TTS — audio tokens pass straight through one model. Sub-300ms total round-trip in good conditions. The lowest-latency option that exists in May 2026.

Best forReal-time conversational agents where every millisecond matters.
Watch forOpenAI-only. Less control over intermediate stages — debugging is harder.
SPEECH-TO-SPEECH · WEBSOCKET · GPT-5O · REALTIME
/04
Vapi

Vapi

Voice agent platform with telephony built in. Pay-per-minute pricing, drag-and-drop agent builder, SIP and PSTN out of the box. Strong fit for outbound and inbound phone agents.

Best forTelephony-first voice agents — appointment setting, lead qualification, surveys.
Watch forLess control than rolling your own stack. Per-minute pricing adds up at scale.
SIP · PSTN · WEBRTC · TURN-DETECT
/05
Retell AI

Retell AI

Voice agent platform with strong telephony partnerships and customer-facing voice infrastructure. Competes head-on with Vapi. Lower-latency turn detection is their differentiator.

Best forHigh-volume phone agents — call centers, scheduling, reminders.
Watch forSmaller ecosystem than Vapi. Self-hosting story still maturing.
TELEPHONY · SEMANTIC-VAD · WEBRTC · DEEPGRAM
/06
LiveKit

LiveKit Agents

OSS agent framework on LiveKit's real-time infrastructure — the same backend OpenAI Realtime uses. Pipeline-style composition with VAD, STT, LLM, TTS as swappable components. Self-hostable end-to-end.

Best forTeams wanting full control over the pipeline, especially with custom models.
Watch forMore setup than turnkey options. You operate the infrastructure.
WEBRTC · OSS · PIPELINE · SELF-HOSTABLE
/07
Cartesia

Cartesia Sonic

Ultra-low-latency TTS — sub-90ms time-to-first-audio. State-space-model architecture instead of transformer. Often plugged in as the TTS layer in custom stacks (LiveKit, Pipecat, Vapi, Retell all support it).

Best forAny stack where TTS latency is the bottleneck.
Watch forTTS only — pair with STT and LLM separately. Smaller voice library than ElevenLabs.
SSM · SUB-100MS · WEBSOCKET · STREAMING
/08
Deepgram

Deepgram Voice Agents

STT pioneer's full agent platform. Strong telephony partnerships, very fast STT (Nova-3 family), and integrated LLM routing. Used by Bay Area startups and enterprise IVR replacements.

Best forStacks where STT accuracy is the priority — accents, noise, multilingual.
Watch forSTT-led — LLM and TTS are integrations, not their core competence.
NOVA-3 · STT · TELEPHONY · TWILIO
/09
Daily.co

Pipecat

OSS voice agent framework from Daily.co. Composable pipelines, plug-in any STT/LLM/TTS, runs over WebRTC or telephony. Strong developer DX and active community. Python-first.

Best forEngineers building custom voice agents who want OSS and pipeline control.
Watch forYou build and operate. Smaller community than the LangChain ecosystem.
PYTHON · OSS · WEBRTC · PIPELINE
/10
Hume

Hume EVI 3

Emotional voice intelligence — reads emotion in the user's voice, modulates response prosody to match. Useful for mental health, coaching, and any product where emotional resonance matters. EVI 3 launched May 2025.

Best forApps where the user's emotional state should influence the response.
Watch forNiche. Emotional inference is probabilistic and can be wrong in stressful conversations.
EMOTION · PROSODY · MULTIMODAL · WEBSOCKET
The architectural choice is rarely "which voice agent platform." It's which surface, which latency budget, which level of control over each pipeline stage. Consumer apps with brand-voice needs gravitate toward ElevenLabs and OpenAI Realtime. Enterprise customer-experience automation lives at Sierra. Telephony-first workloads fit Vapi or Retell. Custom stacks built for full control compose LiveKit Agents or Pipecat with Cartesia and Deepgram. Almost none of these require you to train a model — they require you to compose the right pipeline against the right budget.

06Prompt, fine-tune, or reinforce

Eight ways to change how a model behaves. Most product companies should only reach for the first four. The other four are where vertical AI lives — Sierra, Harvey, Cursor — and where the foundation labs spend their compute budget.

The training pyramid.

Three layers, each doing different work. Foundation labs train the base model. Vertical AI companies fine-tune for their domain. Product companies compose with prompts and tools. Knowing which layer you're on tells you which paradigms are even available to you, and which ones would be a category error to reach for.

Foundation labs
Pretrain · RLHF · Constitutional · RLVR for code & math
Anthropic · OpenAI · Google DeepMind · Meta · DeepSeek
Vertical AI
Fine-tune · DPO · domain-specific RLHF · distillation
Sierra · Harvey · Augment · Cursor · Cognition
Product companies
Prompts · few-shot · RAG · skills · tool composition
Almost every team shipping AI features

The pyramid is also a cost ladder. Composing with prompts and skills costs hundreds of dollars per month in model spend. Fine-tuning a small open model costs ten thousand for the training run plus serving. Training a frontier model costs hundreds of millions and a custom data center. The economics are why almost no one should train.

The spectrum, end to end.

Eight paradigms from "no weight changes" to "full reinforcement learning with an environment simulator." Click any marker to see what signal it needs, what changes underneath, and when to reach for it.

NO WEIGHTS CHANGE FULL RL WITH ENVIRONMENT

The RL feedback loop.

Reinforcement learning is the agent's natural training paradigm. An agent acts on the environment. The environment responds with a new state and a reward. The agent's policy — its weights — updates to maximize expected future reward. Repeat for millions of episodes. The structure is identical across game-playing AI, robotics, code-writing reasoning models, and Sierra's customer-retention agents. Only the environment and the reward function change.

Sim F2 · RL feedback loop · 20 training episodes
policy gradient · reward climbs over time
Agent POLICY π_θ Environment STATE · TRANSITION action a_t reward r_t · state s_{t+1} ∇θ J(π_θ) POLICY UPDATE θ ← θ + α∇θ
iteration0
avg reward0.10
policy versionθ_0
UNTRAINED

Verifiable reward vs preference reward.

The most important distinction in 2026 RL is where the reward comes from. RLVR is having the better year because verifiable signals — tests passing, programs running, math being correct — give you objective gradients without a learned reward model in the loop. RLHF still rules for open-ended generation where "good" is subjective.

RLVR · Verifiable Rewards

Reward comes from an objective check on the output.

Task implement add(a, b)
Output def add(a,b): return a+b
Verifier run pytest
Reward +1.0 · all tests pass

Used by: OpenAI o-series, DeepSeek-R1, Claude reasoning, every serious code agent.

RLHF · Preference Rewards

Reward comes from a learned model trained on human preferences.

Task haiku about debugging
Output A blinking cursor weeps…
Output B my code is broken…
Reward model picks A
Reward A: +0.7 · B: −0.3

Used by: base model alignment, Sierra's domain customization, every consumer-facing assistant.

When training actually kicks in.

A practical flow for product companies. Most of the questions resolve to "no" — that's the entire point.

  1. Can prompts + skills + RAG get you 90% of the way?

    If yes: ship. Iterate. Most product agents fit here. The rest of this list is a distraction for you.

    → stay in tier 3, the product layer
  2. Is per-call inference cost or latency a bottleneck?

    If yes: distill. Train a small open model (Llama 4 8B, Qwen 3 7B) to mimic your prompt-engineered behavior. Don't train from scratch — distillation is cheap because the labels come from your large model's outputs.

    → distillation, sometimes SFT
  3. Is behavior brittle across runs even with strong prompts?

    If yes: SFT on 500–1,000 carefully labeled examples. Dramatic consistency gains for a one-time engineering cost in the thousands, not the millions.

    → supervised fine-tuning
  4. Is "good" subjective and only obvious post-hoc?

    If yes: DPO on preference pairs collected from your real traffic — chosen-vs-rejected. Cheaper and simpler than full RLHF, and the 2024-2025 papers showed it's competitive on quality.

    → direct preference optimization
  5. Is the signal automatically verifiable?

    If yes: RLVR with a verifier function. This is how every reasoning model in 2026 was trained — let the model attempt, let the verifier judge, update the policy. Works for code, math, structured-output tasks, anything with a deterministic grader.

    → RL with verifiable rewards
  6. Is it a multi-step task with sparse rewards in a real environment?

    If yes: deep RL with PPO or GRPO. Be warned: you probably don't have an environment simulator, and building one usually costs more than the training run. Reserved for game-playing, robotics, and a handful of specialized agentic benchmarks.

    → full deep RL (rare in product work)
For most product companies: you don't need to train anything. Composition with prompts, skills, MCP tools, and RAG over your private corpus gets you to 90% of useful agent behavior with no machine learning team and no GPU spend. Reach for training only when composition demonstrably can't get you there — when prompt brittleness causes real failures at scale, when latency demands a smaller model, or when the signal you need is verifiable and frontier models still don't reliably hit it. When training does become necessary, the cheapest first move is usually a fine-tuned embedding model, not a fine-tuned generator. Compose first. Train only when you can't compose your way there.