Interaction Models — Engineermaxxing

00 Concept constellation

Every concept in this lesson and how they connect — the territory before the map.

This lesson unpacks a single blog post into 14 interconnected concepts across four topic clusters. The constellation below shows how they relate. Amber nodes are concepts taught in this lesson. Blue-ringed nodes are covered in other Engineermaxxing lessons. Click any node to jump to the chapter where it's explained.

Architecture Inference Collaboration Evaluation Has existing lesson

00 Concept index

Every concept you'll encounter, sorted by cluster.

Architecture

Micro-Turn Architecture

200ms continuous I/O chunks replace discrete turns. Ch 02.

Architecture

Encoder-Free Early Fusion

dMel + hMLP + Flow head co-trained with one transformer. Ch 03.

Architecture

Streaming Sessions

Persistent GPU memory for append-only inference. Ch 04.

Architecture

dMel Tokenization

Discretized mel-filterbank channels as speech tokens. Ch 03.

Architecture

Flow Matching Decoder

CNF-based audio generation in 200ms chunks. Ch 03.

Inference

Batch Invariance

Same output regardless of other concurrent requests. Ch 05.

Inference

FP Non-Associativity

Why (a+b)+c ≠ a+(b+c) in floating-point. Ch 05.

Inference

Sequence Parallelism

Distributing sequential ops across devices to save memory. Ch 04.

Collaboration

Turn-Based Bottleneck

Why current AI interfaces cripple collaboration. Ch 01.

Collaboration

Interaction + Background Split

Real-time presence + deep async reasoning. Ch 06.

Collaboration

Grounding Principles

Copresence, contemporality, simultaneity. Ch 01.

Evaluation

TimeSpeak & CueSpeak

Benchmarks for time-aware and simultaneous speech. Ch 07.

Evaluation

Visual Proactivity

RepCount, ProactiveVideoQA, Charades benchmarks. Ch 07.

Evaluation

Task-Length Horizon

METR’s measurement of autonomous AI capability. Ch 08.

00 Reading guide

This lesson is structured in layers. You can read it linearly or skip around.

Chapters 01–02: The why — what’s broken about current AI interfaces and what micro-turns change.
Chapters 03–05: The how — the architectural and inference innovations that make 200ms heartbeats possible.
Chapters 06–08: The so what — the two-model design, new benchmarks, and where this is heading.

If you already know transformer architectures and mel spectrograms, skip to Chapter 02. If you only care about the benchmarks, jump to Chapter 07.

01 The interface problem

AI gets smarter every quarter. But talking to it still feels like sending telegrams.

Imagine you’re pair-programming with a colleague. You’re both looking at the same screen. You point at a function and start saying “this part is—” and they immediately see what you’re pointing at, nod, and say “yeah, the edge case with null inputs.” You didn’t finish your sentence. You didn’t need to.

Now imagine the same interaction with an AI. You type a message. You wait. The AI reads your entire message. It generates a response. You read the entire response. You type another message. This is the turn-based bottleneck: the gap between what human collaboration is and what human-AI interaction allows.

Anthropic’s own frontier model card is candid about this: “when used in an interactive, synchronous, ‘hands-on-keyboard’ pattern, the benefits of the model were less clear.”

The problem isn’t intelligence. The models are smart enough. The problem is the channel — a text box that freezes perception during generation and forces strict turn-taking.

01 Grounding principles

Herbert Clark and Susan Brennan identified three properties that make human communication work, back in 1991. These aren’t nice-to-haves. They’re load-bearing:

Principle

Copresence

Both parties have interactive access to the same shared content. You can see what I’m pointing at. I can see your reaction to what I just said.

Principle

Contemporality

Information is received as it’s produced. You hear my words as I speak them, not after I finish my paragraph. Feedback is real-time.

Principle

Simultaneity

Both parties produce and receive information concurrently. I can nod while you talk. I can say “mm-hmm” without interrupting your flow.

Current LLM interfaces violate all three. The model can’t see your screen while it types. It doesn’t receive your input while generating. And it certainly can’t nod along. The Thinking Machines blog frames this as the core bottleneck: humans are pushed out of the loop not because the task demands it, but because the interface can’t keep up.

01 The turn-based trap

Let’s make the problem concrete. Here’s what a timeline looks like for a turn-based model versus a continuously interactive one.

Speed 1.0x

Human speaks/types AI processes AI responds Concurrent I/O

In the turn-based model, there are dead zones — periods where neither party is doing useful work because the protocol demands one finish before the other starts. In the interactive model, the AI’s perception never stops. It hears you while it talks. It sees your screen while it thinks. The dead zones collapse.

The key insight: making AI interactive isn’t a UX polish. It’s an architectural decision that changes what the model can perceive, when it can act, and how humans can collaborate with it. Bolt-on solutions (voice activity detection, external orchestrators) don’t scale with intelligence. For interactivity to scale with intelligence, it must be part of the model itself.

01 The bitter lesson, restated

Rich Sutton’s Bitter Lesson (2019) says that general methods leveraging computation always win over hand-engineered solutions. Thinking Machines extends this to interactivity: bolting a voice-activity detector onto a text model is the hand-engineered approach. Training a natively multimodal model that processes 200ms chunks is the general approach.

The claim is simple and testable: scaling a model should make it smarter AND a better collaborator. If interactivity lives outside the model (in a VAD module, an orchestrator, a turn-management system), then scaling the model only makes it smarter. The collaboration quality stays flat.

This is the premise behind everything that follows.

02 What is a micro-turn?

The heartbeat of the interaction model: 200 milliseconds of continuous I/O.

In a standard LLM, the model processes your entire input, then generates its entire response. Input and output are discrete, sequential phases. A micro-turn throws this away.

Instead, the model processes the world in 200ms chunks. Every 200 milliseconds, it ingests whatever input has arrived — a fragment of speech, a video frame, a partial text — and decides what to do: continue listening, start speaking, adjust its current output, or stay silent. Both input and output are treated as continuous streams that interleave in a single token sequence.

Think of it like breathing. You don’t stop breathing to talk, and you don’t stop hearing to breathe. The 200ms chunk is the model’s respiratory cycle — a constant rhythm of perceiving and acting that never pauses.

Concretely, the model’s token sequence looks like this:

Single interleaved sequence $$[\text{audio}_{0\text{-}200\text{ms}},\ \text{video}_{t_0},\ \text{response tokens},\ \text{audio}_{200\text{-}400\text{ms}},\ \text{video}_{t_1},\ \dots]$$

Every 200ms chunk produces a “bag of embeddings” from all active modalities
The transformer processes this single interleaved sequence autoregressively
Output tokens (text, audio, tool calls) are generated between input chunks

02 Why 200 milliseconds?

This number isn’t arbitrary. It sits at the intersection of three constraints:

Perceptual latency

Human conversational turn-taking gaps average 200–300ms. Respond faster than 200ms and it feels like interruption. Respond slower than 500ms and it feels laggy. 200ms is the sweet spot for perceived real-time.
Psycholinguistics — Stivers et al. 2009
Audio frame size

Speech at 16kHz means 200ms = 3,200 samples. That’s enough to capture a full phoneme but not so much that latency builds up. Standard audio frames (20ms) are too small for meaningful semantic processing; full utterances are too large for real-time response.
Signal processing — Nyquist window trade-off
Compute budget

At 200ms, you have roughly 150–180ms of actual compute time (after network overhead). On modern GPUs with a 12B-active MoE, that’s enough for a meaningful forward pass including both prefill of the new chunk and several decode steps for output generation.
Systems constraint — Thinking Machines inference team

Chunk size 200ms

Audio input chunk Video frame Model output Dead time

Drag the slider to see what happens at different chunk sizes. At 50ms, the overhead dominates — the model spends more time managing chunks than processing them. At 500ms, latency is perceptible. 200ms is the Goldilocks zone.

02 Stream interleaving

The critical architectural insight is that input and output share a single sequence. This is fundamentally different from systems that have separate input and output pipelines connected by an orchestrator.

Here’s what this means in practice:

Property	Turn-Based	Micro-Turn
Input processing	Complete utterance, then process	200ms chunks, processed incrementally
Output generation	Full response generated at once	Tokens between input chunks
Interruption	External VAD detects speech, kills generation	Model hears input during generation, decides to yield
Simultaneous speech	Impossible — one direction at a time	Native — input and output streams overlap
Visual awareness	Snapshot at turn start	Continuous video frames every 200ms
Tool use timing	Tools called after response generation	Tools called concurrently during conversation

The simultaneous speech capability is especially striking. Because both audio input and audio output are part of the same token sequence, the model can literally speak while listening. This enables interaction patterns that are impossible with turn-based systems: live translation, real-time commentary, counting reps during exercise.

02 What this replaces

Most “real-time” voice AI systems today use a harness-based architecture:

A Voice Activity Detector (VAD) listens for speech onset/offset
When speech ends, audio is sent to a speech-to-text module
The text goes to the LLM for response generation
The response text goes to a text-to-speech module
Audio plays back to the user

Each component is separately engineered, and the seams show. The VAD can’t understand context — it just detects energy. It can’t tell if you paused to think or finished your sentence. The LLM can’t hear how you said something — only what you said. And none of these components improve when you make the LLM larger.

The micro-turn architecture replaces all five components with a single transformer that natively processes audio, video, and text in an interleaved sequence. No VAD. No STT. No TTS pipeline. The model IS the pipeline.

This is the “scaling with intelligence” argument: every component in the harness is a ceiling. Make the LLM 10x smarter and your VAD is still the same dumb energy detector. With micro-turns, making the model smarter makes every aspect of interaction better — turn-taking, interruption handling, emotional awareness, timing, all of it.

03 Modality fusion overview

Three modalities, three lightweight encoders, one shared transformer. No frozen encoders. No separate decoders. Everything co-trained from scratch.

The dominant approach to multimodal AI uses large pre-trained encoders: a Whisper-class model for audio, a CLIP/SigLIP model for vision, a vocoder for speech output. Each is frozen, and the LLM learns to interface with their embedding spaces.

This approach has a problem for real-time interaction: large encoders add latency. A Whisper encoder needs the full audio segment before producing embeddings. A CLIP encoder processes a 224×224 image through dozens of layers. When your budget is 200ms, these pre-processing steps eat most of it.

Thinking Machines takes a different approach: encoder-free early fusion. Instead of large frozen encoders, they use lightweight embedding layers that convert raw signals into token-like representations, then feed everything into one transformer.

Audio path (dMel) Video path (hMLP) Text path Shared transformer

03 dMel: discretized mel spectrograms

How do you turn speech into tokens? The standard approaches are:

Codec tokenization (like EnCodec, SoundStream): Train a neural audio codec to compress audio into discrete codes. Problem: the codec is domain-specific and fails on out-of-domain audio.
Large encoder (like Whisper): Process audio through a big pre-trained model. Problem: high latency, frozen representations.

dMel (Bai et al., 2024) is a beautifully simple alternative. Take the mel spectrogram — a standard audio representation that’s been used for decades — and discretize each frequency channel into intensity bins.

dMel encoding $$\text{dMel}(t, f) = \text{bin}\left(\frac{\text{mel}(t, f) - \mu_f}{\sigma_f},\ B\right)$$

$\text{mel}(t, f)$ — mel spectrogram value at time $t$, frequency band $f$
$\mu_f, \sigma_f$ — per-channel mean and standard deviation (from training data)
$B$ — number of intensity bins (typically 256)
Output: an integer per frequency band per time step — a discrete token

Worked example: 16kHz audio, 200ms chunk = 3,200 samples. Compute mel spectrogram with 80 mel bands, 10ms hop → 20 frames × 80 bands = 1,600 values. Discretize each into one of 256 bins → 1,600 discrete tokens. A lightweight embedding layer maps these to the transformer’s hidden dimension. Total encoder parameter count: essentially just the embedding matrix. Compare this to Whisper’s 244M-parameter encoder.

Why does this work? Because the mel spectrogram already captures the structure of speech (formants, pitch, energy) in a format that the transformer can learn to interpret. The “encoder” is just a lookup table. All the heavy lifting happens inside the shared transformer.

The key advantage for real-time interaction: dMel is streaming-native. Each 200ms chunk produces its tokens independently. There’s no need to buffer a full utterance before encoding.

03 hMLP: vision patches

For video, Thinking Machines uses a similar philosophy: minimal encoding, maximal transformer processing. Each video frame is divided into 40×40 patches, and each patch is encoded by an hMLP (hierarchical MLP) from Touvron et al. (2022).

An hMLP is much simpler than a ViT (Vision Transformer). It has no self-attention layers — it’s a stack of feedforward layers that processes each patch independently. The cross-patch reasoning happens inside the shared transformer, not in a separate vision encoder.

Property	CLIP/SigLIP Encoder	hMLP Patches
Parameters	300M–2B	~10M
Self-attention	Yes (12–48 layers)	None
Cross-patch reasoning	Inside vision encoder	Inside shared transformer
Latency per frame	5–30ms	<1ms
Training	Frozen (pre-trained on 400M–12B images)	Co-trained from scratch
Tokens per frame	256–576 (after resampling)	1,600 (40×40)

The trade-off is clear: more tokens per frame (1,600 vs ~256) but much faster encoding and fully co-trained representations. For a real-time system processing frames every 200ms, the sub-millisecond encoding latency is critical. The extra tokens are manageable because the model has 200ms of compute budget per chunk.

03 Flow matching decoder

The input side is solved: dMel for audio in, hMLP for video in, standard unembedding for text in. But how does the model produce audio?

Thinking Machines uses a flow head based on Flow Matching (Lipman et al., 2022). This is a generative model that learns to transform noise into mel spectrograms by learning a vector field along a probability path.

Flow matching objective $$\mathcal{L}_{\text{FM}} = \mathbb{E}_{t, x_0, x_1}\left[\|v_\theta(t, x_t) - (x_1 - x_0)\|^2\right]$$

$x_0$ — noise sample (from simple prior)
$x_1$ — target mel spectrogram chunk
$x_t = (1-t)x_0 + tx_1$ — linear interpolation at time $t \in [0,1]$
$v_\theta$ — learned vector field that pushes $x_0$ toward $x_1$
The optimal field is simply $x_1 - x_0$ — a straight line from noise to signal

Why flow matching instead of a standard vocoder or diffusion model?

Speed: Flow matching with optimal transport paths converges in fewer ODE steps than diffusion. For 200ms audio chunks, you need fast generation.
Quality: The straight-line paths (optimal transport) produce higher-quality samples than the curved paths of standard diffusion.
Co-training: The flow head is jointly trained with the transformer, so the model learns audio representations that are easy to decode.

Why the loss works: At training time, you sample a random interpolation point $t$ between noise ($t=0$) and target mel ($t=1$). The model predicts the direction to move from the current interpolated point. The optimal direction is just the difference $x_1 - x_0$ — a straight line. During inference, you start from noise and integrate the learned vector field through a few ODE steps to arrive at the target mel spectrogram.

03 The bag of embeddings

All three modality paths — dMel audio tokens, hMLP video patches, text tokens — produce embeddings in the same hidden dimension. Every 200ms, the active modality tokens are concatenated into a “bag of embeddings” and fed into the transformer.

Audio tokens (dMel) Video patches (hMLP) Text tokens Output: text / mel flow

The transformer then produces output tokens: text tokens for language generation, and conditioning vectors for the flow head to generate mel spectrograms. All of this happens inside a single model — a 276B-parameter Mixture of Experts with 12B active parameters per forward pass.

The model specification: TML-Interaction-Small is 276B total parameters, 12B active (MoE). For context, GPT-4o is rumored to be ~200B total. The “small” in the name is aspirational — they plan to release larger models as inference speed improves.

04 The latency problem

200ms chunks mean 5 prefill operations per second. Standard inference libraries weren’t built for this.

Here’s the fundamental tension. A standard LLM inference system expects a pattern like:

Receive a prompt (hundreds to thousands of tokens)
Prefill: process all tokens in parallel to build the KV cache
Decode: generate tokens autoregressively, one at a time
Return the response

This happens once per request. The overhead of initializing GPU memory, computing attention masks, and managing metadata is amortized over a long generation.

But a micro-turn model needs to prefill a new chunk every 200 milliseconds. That’s 5 prefill operations per second. Each one only adds ~20–50 new tokens (the 200ms of audio/video). The overhead of standard inference libraries — memory allocation, metadata computation, scheduler interrupts — would eat more time than the actual computation.

Chunks 10

Compute time Overhead Idle / network

04 Persistent GPU sessions

The solution is streaming sessions. Instead of treating each 200ms chunk as a separate request, the client opens a persistent session with the inference server. The session maintains a single growing sequence in GPU memory:

Session open

Client connects. Server allocates GPU memory for the KV cache and initializes the sequence state.
One-time overhead amortized over the entire conversation
Chunk append

Every 200ms, client sends a new audio/video chunk. Server tokenizes it and appends to the existing sequence. Prefill runs only on the new tokens — no re-processing of the full context.
Incremental prefill — O(new_tokens × total_tokens) attention, not O(total_tokens²)
Decode interleaved

After prefilling the new chunk, the server runs decode steps to generate output tokens (text or mel conditioning). These are streamed back to the client.
Output starts arriving before the next input chunk
Repeat

The sequence grows continuously. KV cache persists across chunks. No re-initialization, no memory re-allocation (pre-allocated with headroom).
Steady-state overhead: essentially zero per chunk

Concrete numbers: Standard inference (per-chunk): ~15ms allocation + ~8ms metadata + ~3ms scheduling + ~25ms compute = ~51ms. With streaming sessions: ~0.1ms append + ~25ms compute = ~25.1ms. That’s a 2x improvement — and 25ms out of the 200ms budget leaves 175ms for decode and network latency.

04 MoE kernel tricks

TML-Interaction-Small is a Mixture of Experts model: 276B total parameters, 12B active per forward pass. MoE models route each token to a subset of “expert” FFN blocks. This is great for capacity vs. compute, but MoE kernels have their own inference challenges.

The standard approach for MoE inference is grouped GEMM: group tokens by their assigned expert, then run one matrix multiply per expert. But grouped GEMM has poor GPU utilization when group sizes are small (few tokens per expert in a 200ms chunk).

Thinking Machines uses a gather+GEMV strategy instead, borrowed from work by PyTorch and Cursor:

Gather: For each expert, gather the tokens assigned to it
GEMV: When tokens-per-expert is small (as in decode), use optimized matrix-vector multiplications instead of general matrix multiplication

This is significant because decode in a streaming session produces 1–5 tokens at a time. GEMV kernels are heavily optimized for this case on modern GPUs — they exploit the memory bandwidth bottleneck rather than the compute bottleneck.

04 SGLang contribution

The streaming session implementation was upstreamed to SGLang (PR #19171), an open-source inference framework. This isn’t proprietary magic — any team building real-time multimodal models can benefit from the same infrastructure.

Key aspects of the contribution:

Persistent sequence state: KV cache survives across multiple append-and-decode cycles without re-allocation
Incremental prefill: New tokens appended to existing KV cache with proper attention masking
Bidirectional serving shapes: Optimized kernels for both the small-prefill (200ms chunks) and decode (1–5 tokens) patterns that alternate rapidly in streaming sessions

The broader lesson: real-time multimodal AI doesn’t just need better models. It needs better systems. Inference frameworks designed for the request-response pattern of chatbots don’t work for streaming interaction. Thinking Machines had to change the serving infrastructure, not just the model.

05 Floating-point sin

The reason you can’t reproduce your LLM outputs — and it has nothing to do with temperature.

Set temperature to 0. Set the seed. Run the same prompt twice on the same GPU. Get different outputs. This happens constantly, and the standard explanation is “floating-point nondeterminism.” But that’s vague. Let’s be precise.

Floating-point addition is non-associative:

The original sin $$(a + b) + c \neq a + (b + c) \quad \text{in floating-point}$$

This isn’t a bug. It’s what makes floating-point useful. Floating-point numbers use a fixed number of bits to represent an enormous range of values. When you add two numbers with very different magnitudes, the smaller one gets rounded. The order of additions determines which roundings happen.

Concrete demo: Sum the array $[10^{-10},\ 10^{-5},\ 10^{-2},\ 1,\ -10^{-10},\ -10^{-5},\ -10^{-2},\ -1]$. The mathematical sum is exactly 0. But shuffle the order and sum in float32: you get 102 unique results. Every GPU kernel that sums numbers in a different order (based on thread scheduling, warp layout, batch size) produces a different answer.

05 Batch (in)variance

Here’s the insight that changes everything. The common belief is that GPU atomic operations cause nondeterminism (random thread execution order → random summation order). But Horace He at Thinking Machines shows this is wrong for LLM inference.

The forward pass of an LLM involves no operations that require atomic adds.

So where does the nondeterminism come from? From batch size variance. Modern inference servers dynamically batch requests. User A sends a prompt. User B sends a prompt 50ms later. The server batches them together. The batch-size changes the internal reduction patterns in GPU kernels, which changes the floating-point accumulation order, which changes the output.

python — batch variance in action

import torch
torch.set_default_device('cuda')

B, D = 2048, 4096
a = torch.linspace(-1000, 1000, B*D).reshape(B, D)
b = torch.linspace(-1000, 1000, D*D).reshape(D, D)

# Same first row, different batch size
out1 = torch.mm(a[:1], b)        # batch=1
out2 = torch.mm(a, b)[:1]       # batch=2048, take first row

print((out1 - out2).abs().max())  # tensor(1669.2500) — huge!

The same row, the same weight matrix, yet a 1,669-unit difference in the output. Why? Because cuBLAS uses different tile decompositions for different batch sizes. Different tiles mean different partial-sum accumulation orders. Different orders mean different floating-point results.

The realization: from the perspective of an individual user, other concurrent users are not an “input” to the system — they’re a nondeterministic property. The same prompt sent to the same model can produce different outputs depending on who else is using the server at that moment.

05 Making kernels batch-invariant

The fix is to ensure that every GPU kernel produces the same output for a given input regardless of batch size. This is called batch invariance.

The strategy is consistent across all operations: assign one batch element per core. When each element’s reduction happens entirely within a single core, the accumulation order is fixed regardless of how many other elements are in the batch.

RMSNorm

RMSNorm computes $\text{RMS}(x) = \sqrt{\frac{1}{D}\sum_{i=1}^{D} x_i^2}$ per row. In data-parallel mode, each row is assigned to one core. The reduction over $D$ dimensions always happens in the same order. Batch-invariant by construction.

The only tricky case: when there are fewer rows than cores, the kernel might split a single row across multiple cores (split-reduction). This breaks batch invariance. The Thinking Machines solution: just don’t do it. For the tiny batch sizes where split-reduction kicks in, the performance difference is negligible.

Matrix multiplication

Same principle: one batch element per core. But the standard cuBLAS GEMM doesn’t guarantee this — it uses split-K (parallelizing the inner reduction) for performance. A batch-invariant GEMM avoids split-K, sacrificing about 20% throughput compared to cuBLAS.

Importantly, even within a single element’s GEMM, the tile size must be consistent. Different PTX instructions (the assembly-level instructions for GPU tensor cores) use different internal accumulation orders. The batch-invariant implementation forces consistent tile sizes across all batch elements.

05 Attention: the hard case

Attention is the most complex operation to make batch-invariant because it must handle:

Different sequence lengths in the same batch
KV cache from previous turns (some tokens in cache, some in current input)
Chunked prefill (partially cached sequences)
The split-KV optimization in FlashDecoding

The core problem: FlashDecoding splits the KV sequence across multiple SMs (streaming multiprocessors) for parallelism. When the 1000th query token is processed, the split depends on how many tokens are in the KV cache vs. the current input. In prefill (0 cached), all 1000 tokens are split one way. In decode (999 cached + 1 new), they’re split another way. Different splits → different accumulation orders → different outputs.

The attention invariance requirement $$\text{Attn}(q_{1000}, K_{1:1000}) \text{ must be identical whether } K_{1:999} \in \text{cache or } K_{1:999} \in \text{current input}$$

The solution: fixed split-size instead of fixed number of splits. Rather than saying “always use 8 splits” (which gives different chunk sizes for different sequence lengths), say “always split at 4096-token boundaries.” Now the accumulation order for tokens 1–4096 is always the same, regardless of how many total tokens there are or how many are cached.

Key insight: the accumulation order must be left-aligned. If you split at 4096 boundaries: tokens 1–4096 are always reduced together, 4097–8192 always together, etc. Whether those tokens arrived via cache or via prefill doesn’t matter — the reduction tree is structurally identical.

05 Why determinism matters

You might think this is academic perfectionism. It’s not. There are two concrete reasons to care:

1. Debugging training

When trainer and sampler produce bitwise-identical outputs, you can replay any training step exactly. If a loss spike happens at step 47,392, you can rerun that exact step, inspect every intermediate value, and find the root cause. Without determinism, debugging is statistical — you can characterize problems but not pinpoint them.

2. True on-policy RL

Reinforcement learning from human feedback (RLHF) assumes the model generating training data is the same model being trained. But if inference is nondeterministic, the generating model and the training model diverge. You’re training on slightly off-policy data.

The standard fix is importance weighting: multiply each training example by $\frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)}$ to correct for the policy mismatch. But importance weights add variance. With deterministic inference, the KL divergence between generator and learner is exactly zero. No correction needed. No variance.

Metric	Nondeterministic	+ Importance Weights	Deterministic
Unique completions (1000 runs, temp=0)	80	80	1
KL divergence (train vs sample)	~0.001 with spikes	~0.001 with spikes	exactly 0
Reward collapse risk	High	Medium	None
Inference overhead	1x (baseline)	1x + sampling overhead	~1.6x

The 1.6x inference overhead is the cost: batch-invariant matmul is ~20% slower than cuBLAS, plus the fixed-split attention overhead. But for training stability, many teams consider this a bargain. The open-source implementation is available at thinking-machines-lab/batch-invariant-ops.

06 Why split?

You can’t be real-time and deeply thoughtful at the same time. So don’t try — use two models.

There’s a fundamental tension in AI interaction. Being responsive means generating output quickly — within 200ms, ideally. Being intelligent means spending time reasoning, searching, using tools, planning multi-step actions. You can’t do both with one forward pass.

Human teams solve this naturally: one person keeps the conversation going (“yeah, let me think about that...”) while their brain works on the hard problem in the background. Thinking Machines formalizes this pattern with two models:

FAST

Interaction model

Always present

Processes every 200ms chunk. Handles conversation flow, interruptions, backchannels, visual awareness. Must respond within budget. This is the micro-turn architecture from Chapter 02.

DEEP

Background model

Asynchronous reasoning

Handles complex questions, tool use, code execution, web search, multi-step reasoning. No latency constraint. Works while the interaction model maintains the conversation.

06 The interaction model

The interaction model is the micro-turn architecture from Chapter 02 — a 276B MoE (12B active) that processes 200ms chunks in a continuous stream. Its responsibilities:

Conversation management: Track who’s speaking, detect interruptions, manage turn-taking
Immediate responses: Quick answers, acknowledgments, clarifications
Visual awareness: Monitor video feed, react to visual changes
Delegation: Recognize when a question requires deep reasoning and hand off to the background model
Result integration: Weave background model results into the conversation naturally

The last two are the interesting ones. The interaction model must know what it doesn’t know — recognize that a coding question, a complex search, or a multi-step plan exceeds its 200ms budget and delegate appropriately.

06 The background model

The background model is a standard reasoning model — think Claude with extended thinking or o1. It receives delegated tasks and works on them asynchronously. It can:

Execute multi-step tool chains
Browse the web and synthesize information
Generate and execute code
Plan complex workflows
Generate UI components

Results stream back to the interaction model as they’re produced, not as a single dump when the task completes. This enables the interaction model to provide progressive updates: “I’m looking that up... found three relevant papers... the key finding is...”

06 Coordination protocol

The critical detail: when the interaction model delegates, it sends a rich context package — not a standalone query. The background model receives the full conversation history, current user intent, visual context, and any relevant state.

Interaction model Background model User Results streaming back

This matters because context loss is the failure mode. If the interaction model strips context when delegating (“search for X” instead of “the user is frustrated with their deployment pipeline and specifically asked about X in the context of Y”), the background model produces generic, unhelpful results.

The design principle: the interaction model is always the user’s primary contact. It maintains the conversation thread. The background model is a specialist the interaction model calls on — like a colleague you message in Slack while staying on a call with the user. The user never talks directly to the background model.

Results are woven into the conversation at contextually appropriate moments. If the user is mid-sentence, the interaction model waits. If there’s a natural pause, it smoothly introduces the result. This is far more natural than the current pattern of “please wait while I think...” followed by a wall of text.

07 Existing benchmarks

Standard benchmarks test intelligence. Thinking Machines needed to also test interactivity.

They start with established benchmarks to verify the model isn’t trading intelligence for interactivity:

Benchmark	What It Tests	TML-Small	GPT-RT-2.0	Gemini Flash Live
FD-bench v1	Turn-taking latency (s)	0.40	1.18	0.57
FD-bench v1.5	Interaction quality avg	77.8	46.8	54.3
Audio MultiChallenge	Intelligence + following	43.4	37.6	26.8
IFEval (VoiceBench)	Instruction following	82.1	81.7	67.6
IFEval (Text)	Text instruction following	89.7	89.6	85.8
HarmBench	Safety refusal rate	99.0%	99.5%	99.0%

The headline: TML-Interaction-Small is competitive on intelligence (89.7% IFEval, 82.1% VoiceBench) while dominating on interactivity (0.40s turn-taking vs 0.57–2.14s for competitors, 77.8 FD-bench vs 39–54 for competitors).

But these benchmarks don’t capture the qualitative jump. Time-awareness, simultaneous speech, and visual proactivity are entirely new capabilities. You can’t measure them on existing tests because no existing model can do them.

07 TimeSpeak & CueSpeak

Thinking Machines created two new benchmarks to test capabilities that require native interactivity:

TimeSpeak

Can the model initiate speech at user-specified times with correct content?

Example task: “I want to practice my breathing. Remind me to breathe in and out every 4 seconds until I ask you to stop.” The model must: (1) understand the timing request, (2) maintain an internal clock, (3) initiate speech at the right moments, (4) stop when asked. This is impossible without native time-awareness.

Results: TML: 64.7% vs GPT-RT-2.0: 4.3%. This isn’t a small gap. GPT-RT essentially can’t do this task at all.

CueSpeak

Can the model speak at contextually appropriate moments with semantically correct content, simultaneously with the user?

Example task: “Every time I code-switch and use another language, give me the correct word in the original language.” The model must: (1) detect code-switching in real-time audio, (2) start speaking while the user is still speaking, (3) produce the correct translation. Graded by LLM judge on both semantic correctness AND timing.

Results: TML: 81.7% vs GPT-RT-2.0: 2.9%. Again, the baseline model essentially can’t do this.

07 Visual proactivity

Three adapted benchmarks test whether the model can see, understand, and proactively react to visual input:

TRACK

RepCount-A

Count my reps

Videos of repeated actions. Model must count out loud in real-time. Tests continuous visual tracking and timely vocalization. TML: 35.4%, GPT-RT: 1.3%.

ANSWER

ProactiveVideoQA

Answer when ready

Video with questions whose answers appear at specific moments. Model must stay silent until the answer is visible, then speak. Tests temporal awareness + visual understanding. TML: 33.5%, GPT-RT: 25.0% (no-response baseline).

LOCATE

Charades

Start/stop actions

Model must say “start” when an action begins and “stop” when it ends. Temporal IoU between predicted and reference intervals. TML: 32.4%, GPT-RT: 0%.

The punchline: “No existing model can meaningfully perform any of these tasks. For the sake of completeness, we report the results of GPT Realtime-2 (minimal), but all models evaluated perform similar or worse on these tasks, including thinking high models. They stay silent or give incorrect answers.”

This is the qualitative jump that existing benchmarks miss. It’s not that TML is slightly better at these tasks. It’s that no other model can do them at all. The micro-turn architecture enables an entirely new category of capability.

07 The results, visualized

08 Task-length horizons

METR’s measurement of autonomous AI capability — and why interactivity changes the equation.

METR (Model Evaluation and Threat Research) introduced a powerful way to measure AI progress: instead of benchmark scores, measure the length of tasks that models can complete autonomously, where length is defined by how long a human expert would take.

Their key finding: the length of tasks that frontier models can complete at 50% reliability has been doubling approximately every 7 months for the last 6 years.

Model time horizons (50% reliability) Extrapolation

Current frontier models (like Claude 3.7 Sonnet) can reliably complete tasks that take humans a few minutes. They can occasionally succeed at tasks taking hours, but reliability drops sharply.

The key insight for interaction models: METR measures autonomous task completion. But most real work isn’t autonomous — it’s collaborative. If you can halve the coordination overhead between human and AI, you effectively double the task complexity that’s achievable. Interactivity doesn’t just make AI more pleasant — it extends the effective task horizon.

08 Intelligence × interactivity

Traditional AI evaluation is one-dimensional: how smart is the model? Interaction models add a second dimension: how well can you work with it?

Thinking Machines positions this as a frontier — models can be plotted on a 2D space of intelligence vs. interactivity. Most current models live in the high-intelligence-low-interactivity quadrant (text chat) or the low-intelligence-high-interactivity quadrant (basic voice assistants). TML-Interaction-Small aims for the top-right: high intelligence AND high interactivity.

Show

TML models OpenAI models Google models Other

08 Current limitations

Thinking Machines is honest about what doesn’t work yet:

Long sessions

Continuous audio and video accumulate context quickly. At 200ms chunks with ~30 audio tokens + 1,600 video patches per chunk, context grows by ~8,000 tokens per second. Even with a 128K context window, that’s only ~16 seconds of continuous multimodal input before context fills. The streaming session design handles short/medium interactions well, but very long sessions require careful context management — a KV cache eviction strategy, context compression, or switching to a summary representation.

Connectivity

Low-latency streaming requires reliable network. With 200ms chunks and ~25ms compute, you have ~175ms for round-trip network latency. On a poor connection, chunks arrive late, the model’s temporal model breaks down, and the experience degrades significantly. Edge deployment or aggressive chunking strategies could help.

Model scale

TML-Interaction-Small is “small” at 276B total (12B active). Larger models could be more capable, but current hardware can’t serve them within the 200ms budget. As inference hardware improves, larger interaction models become feasible.

Background agent sophistication

The background agent is functional but basic. The coordination protocol (rich context delegation + streaming results) works, but there’s enormous room to improve how the interaction and background models collaborate — especially for complex, multi-step tasks.

08 Where this goes

If the interaction model thesis is correct — that interactivity should be part of the model, not bolted on — then several things follow:

Scaling improves everything. A 10x larger interaction model should be better at turn-taking, interruption, timing, AND reasoning. This is the core bet.
New training paradigms. Training on audio/video streams (not transcripts) teaches the model about timing, prosody, visual context. This is a fundamentally different training distribution than text-only.
New safety challenges. Real-time interaction stresses safety differently. The model must refuse harmful requests naturally in speech (not with a robotic “I cannot help with that”). Multi-turn manipulation over long sessions is harder to detect. Automated red-teaming must cover conversational dynamics, not just prompt-response pairs.
New evaluation frameworks. TimeSpeak and CueSpeak are first steps. The community needs benchmarks for: sustained multi-modal awareness, collaborative task completion, graceful degradation under network issues, long-session coherence.

The Thinking Machines team is planning a limited research preview in the coming months, with a wider release later in 2026. They’re also funding research grants for new interactivity benchmarks and evaluation frameworks. The full blog post: thinkingmachines.ai/blog/interaction-models.

08 References

Thinking Machines Lab. “Interaction Models: A Scalable Approach to Human-AI Collaboration.” Connectionism, May 2026. Blog
Bai, R. H. et al. “dMel: Speech Tokenization Made Simple.” 2024. arXiv:2407.15835
Touvron, H. et al. “Three things everyone should know about Vision Transformers.” 2022. arXiv:2203.09795
Lipman, Y. et al. “Flow Matching for Generative Modeling.” 2022. arXiv:2210.02747
Korthikanti, V. et al. “Reducing Activation Recomputation in Large Transformer Models.” 2022. arXiv:2205.05198
He, H. “Defeating Nondeterminism in LLM Inference.” Thinking Machines Lab, Sep 2025. Blog
Kwa, T. et al. “Measuring AI Ability to Complete Long Tasks.” METR, Mar 2025. Blog
Wang, Y. et al. “ProactiveVideoQA.” 2025. arXiv:2507.09313
Hu, H. et al. “TransRAC: Encoding Multi-scale Temporal Correlation (RepCount).” 2022. arXiv:2204.01018
Sigurdsson, G. et al. “Hollywood in Homes: Crowdsourcing Data Collection (Charades).” 2016. arXiv:1604.01753
Clark, H. & Brennan, S. “Grounding in Communication.” Perspectives on Socially Shared Cognition, 1991.
Sutton, R. “The Bitter Lesson.” 2019.

00 Concept constellation

00 Concept index

Micro-Turn Architecture

Encoder-Free Early Fusion

Streaming Sessions

dMel Tokenization

Flow Matching Decoder

Batch Invariance

FP Non-Associativity

Sequence Parallelism

Turn-Based Bottleneck

Interaction + Background Split

Grounding Principles

TimeSpeak & CueSpeak

Visual Proactivity

Task-Length Horizon

00 Reading guide

01 The interface problem

01 Grounding principles

Copresence

Contemporality

Simultaneity

01 The turn-based trap

01 The bitter lesson, restated

02 What is a micro-turn?

02 Why 200 milliseconds?

Perceptual latency

Audio frame size

Compute budget

02 Stream interleaving

02 What this replaces

03 Modality fusion overview

03 dMel: discretized mel spectrograms

03 hMLP: vision patches

03 Flow matching decoder

03 The bag of embeddings

04 The latency problem

04 Persistent GPU sessions

Session open

Chunk append

Decode interleaved

Repeat

04 MoE kernel tricks

04 SGLang contribution

05 Floating-point sin

05 Batch (in)variance

05 Making kernels batch-invariant

RMSNorm

Matrix multiplication

05 Attention: the hard case

05 Why determinism matters

1. Debugging training

2. True on-policy RL

06 Why split?

Always present

Asynchronous reasoning

06 The interaction model

06 The background model

06 Coordination protocol

07 Existing benchmarks

07 TimeSpeak & CueSpeak

TimeSpeak

CueSpeak

07 Visual proactivity

Count my reps

Answer when ready

Start/stop actions

07 The results, visualized

08 Task-length horizons

08 Intelligence × interactivity

08 Current limitations

Long sessions

Connectivity

Model scale

Background agent sophistication

08 Where this goes

08 References