The long-running agent stack

Voice changes one constraint — latency — and that single change ripples through every architectural choice. The lanes from section 02 still apply. The protocols from section 03 still work. But each component now has to be picked against a 500-millisecond budget.

The shape is familiar: a model, tools, a session. The differences are in the I/O. Audio in becomes audio out. The user expects to interrupt, be interrupted, hold pauses naturally, and never wait more than a beat. A weeks-long agent is allowed to think for ten seconds. A voice agent gets half a second before the human notices and gives up.

Cascaded vs end-to-end.

Most voice agents are cascaded — a chain of specialized models. Voice activity detection finds when the user stops talking. Speech-to-text transcribes. The LLM thinks. Text-to-speech synthesizes. Audio plays back. OpenAI's Realtime API is the prominent exception — a single speech-to-speech model that processes audio tokens directly, no separate STT or TTS. Cascaded gives more control and easier debugging; speech-to-speech gives lower latency and more natural prosody. Sierra, ElevenLabs, Vapi, Retell, LiveKit, and Pipecat all cascade. OpenAI Realtime ships end-to-end. Google's Gemini Live and Anthropic's experimental voice mode sit between the two — partially fused stages.

The latency budget, traced.

The classical 2026 cascaded pipeline sits around 690ms time-to-first-audio on a good day. Swapping in semantic VAD and Cartesia Sonic cuts that roughly in half. Watch each stage's contribution.

Turn-taking and barge-in.

The hardest problem in voice is knowing when the user is done talking. Silence-based VAD is fast but wrong on ambiguous pauses ("I think… yeah, let's do that"). Semantic VAD reads the partial transcript and predicts end-of-utterance from meaning — much harder to fool, slightly slower. Barge-in is the inverse problem: the user starts speaking while the agent is mid-response. The agent must immediately stop playing TTS, cancel the in-flight model call, discard any pending audio, and reprocess from the new user input. Both behaviors are table stakes by 2026.

The voice stack today.

Ten platforms — speech-to-speech models, cascaded agent frameworks, telephony bridges, low-latency TTS layers, and emotion-aware variants. The choice is part latency, part control over each stage, part which surface you ship to (phone, browser, embedded device).

/01

Sierra

Vertical AI for customer service — voice and text agents that handle support, retention, and account management end-to-end. Bret Taylor's company. Ships to ADT, Sonos, Wayfair, Weight Watchers. Custom RLHF on frontier base models for vertical-specific behavior.

Best forEnterprise customer service automation where consistency and brand voice matter.

Watch forEnterprise-only contracts. Long sales cycles. Not for self-serve.

RLHF · TWILIO · ANTHROPIC · ENTERPRISE

/02

ElevenLabs

ElevenLabs Conversational AI

Full agent platform on top of ElevenLabs' TTS. 5,000+ voices, 30+ languages, sub-150ms time-to-first-audio on Flash v2.5. Strong DX with one-line agent creation. Used in consumer apps, gaming, and IVR replacement.

Best forMultilingual or voice-quality-critical agents, especially consumer-facing.

Watch forTTS is the headline; STT and LLM are partner integrations. Cost scales with audio minutes.

MULTILINGUAL · TTS · WEBRTC · TELEPHONY

/03

OpenAI

Realtime API

End-to-end speech-to-speech via GPT-5o-realtime. No separate STT or TTS — audio tokens pass straight through one model. Sub-300ms total round-trip in good conditions. The lowest-latency option that exists in May 2026.

Best forReal-time conversational agents where every millisecond matters.

Watch forOpenAI-only. Less control over intermediate stages — debugging is harder.

SPEECH-TO-SPEECH · WEBSOCKET · GPT-5O · REALTIME

/04

Vapi

Voice agent platform with telephony built in. Pay-per-minute pricing, drag-and-drop agent builder, SIP and PSTN out of the box. Strong fit for outbound and inbound phone agents.

Best forTelephony-first voice agents — appointment setting, lead qualification, surveys.

Watch forLess control than rolling your own stack. Per-minute pricing adds up at scale.

SIP · PSTN · WEBRTC · TURN-DETECT

/05

Retell AI

Voice agent platform with strong telephony partnerships and customer-facing voice infrastructure. Competes head-on with Vapi. Lower-latency turn detection is their differentiator.

Best forHigh-volume phone agents — call centers, scheduling, reminders.

Watch forSmaller ecosystem than Vapi. Self-hosting story still maturing.

TELEPHONY · SEMANTIC-VAD · WEBRTC · DEEPGRAM

/06

LiveKit

LiveKit Agents

OSS agent framework on LiveKit's real-time infrastructure — the same backend OpenAI Realtime uses. Pipeline-style composition with VAD, STT, LLM, TTS as swappable components. Self-hostable end-to-end.

Best forTeams wanting full control over the pipeline, especially with custom models.

Watch forMore setup than turnkey options. You operate the infrastructure.

WEBRTC · OSS · PIPELINE · SELF-HOSTABLE

/07

Cartesia

Cartesia Sonic

Ultra-low-latency TTS — sub-90ms time-to-first-audio. State-space-model architecture instead of transformer. Often plugged in as the TTS layer in custom stacks (LiveKit, Pipecat, Vapi, Retell all support it).

Best forAny stack where TTS latency is the bottleneck.

Watch forTTS only — pair with STT and LLM separately. Smaller voice library than ElevenLabs.

SSM · SUB-100MS · WEBSOCKET · STREAMING

/08

Deepgram

Deepgram Voice Agents

STT pioneer's full agent platform. Strong telephony partnerships, very fast STT (Nova-3 family), and integrated LLM routing. Used by Bay Area startups and enterprise IVR replacements.

Best forStacks where STT accuracy is the priority — accents, noise, multilingual.

Watch forSTT-led — LLM and TTS are integrations, not their core competence.

NOVA-3 · STT · TELEPHONY · TWILIO

/09

Daily.co

Pipecat

OSS voice agent framework from Daily.co. Composable pipelines, plug-in any STT/LLM/TTS, runs over WebRTC or telephony. Strong developer DX and active community. Python-first.

Best forEngineers building custom voice agents who want OSS and pipeline control.

Watch forYou build and operate. Smaller community than the LangChain ecosystem.

PYTHON · OSS · WEBRTC · PIPELINE

/10

Hume

Hume EVI 3

Emotional voice intelligence — reads emotion in the user's voice, modulates response prosody to match. Useful for mental health, coaching, and any product where emotional resonance matters. EVI 3 launched May 2025.

Best forApps where the user's emotional state should influence the response.

Watch forNiche. Emotional inference is probabilistic and can be wrong in stressful conversations.

EMOTION · PROSODY · MULTIMODAL · WEBSOCKET

01The shape

Managed harnesses

Frameworks

Specialized agents

Durable execution

02The four lanes

03Protocols deep dive

The tool and data wire.

The three primitives.

Wire format simulation.

Sim 01 · MCP message flow

The coordination wire.

Agent Card — the JSON that makes discovery work.

Task lifecycle simulation.

Sim 02 · A2A task delegation

04Concepts deep dive

Context window and compaction

Sim A1 · Context fills, then compacts

Tool use (function calling)

Sim A2 · Tool call round-trip

Streaming and Server-Sent Events

Sim A3 · SSE stream vs batched return

The agent loop — Think · Act · Observe

Sim B1 · The think-act-observe cycle

Sub-agents and the coordinator pattern

Sim B2 · Coordinator dispatches three tasks

Done-conditions and the evaluator split

Sim B3 · Evaluator checks the worker's output

HITL gate at irreversible actions

Sim B4 · Agent run · gate · resume

Journal-based replay (durable execution)

Sim C1 · Crash mid-workflow, then resume from journal

Idempotency keys

Sim C2 · Five clicks · with key vs without

The saga pattern · compensating transactions

Sim C3 · 3 forward steps · step 3 fails · rollback runs

Sandbox isolation

Sim D1 · Inside vs outside the sandbox boundary

Vault-mediated credentials

Sim D2 · Agent → proxy → vault → external API

05Voice agents

Cascaded vs end-to-end.

The latency budget, traced.

Sim E1 · Cascaded pipeline · time to first audio

Turn-taking and barge-in.

Sim E2 · Turn-taking · normal vs barge-in

The voice stack today.

Sierra

ElevenLabs Conversational AI

Realtime API

Vapi

Retell AI

LiveKit Agents

Cartesia Sonic

Deepgram Voice Agents

Pipecat

Hume EVI 3

06Prompt, fine-tune, or reinforce

The training pyramid.

The spectrum, end to end.

The RL feedback loop.

Sim F2 · RL feedback loop · 20 training episodes

Verifiable reward vs preference reward.

RLVR · Verifiable Rewards

RLHF · Preference Rewards

When training actually kicks in.

Can prompts + skills + RAG get you 90% of the way?

Is per-call inference cost or latency a bottleneck?

Is behavior brittle across runs even with strong prompts?

Is "good" subjective and only obvious post-hoc?

Is the signal automatically verifiable?

Is it a multi-step task with sparse rewards in a real environment?