Based on Thinking Machines Lab · May 2026

Interaction models.

How Thinking Machines built an AI that listens while it talks, sees while it thinks, and stays present in 200-millisecond heartbeats — a teardown of the architecture, inference tricks, and new benchmarks that make real-time human-AI collaboration possible.

SOURCE Thinking Machines Blog DEPTH concept-to-kernel PAPERS 11 referenced

00 Concept constellation

Every concept in this lesson and how they connect — the territory before the map.

This lesson unpacks a single blog post into 14 interconnected concepts across four topic clusters. The constellation below shows how they relate. Amber nodes are concepts taught in this lesson. Blue-ringed nodes are covered in other Engineermaxxing lessons. Click any node to jump to the chapter where it's explained.

Architecture Inference Collaboration Evaluation Has existing lesson

00 Concept index

Every concept you'll encounter, sorted by cluster.

Architecture

Micro-Turn Architecture

200ms continuous I/O chunks replace discrete turns. Ch 02.

Architecture

Encoder-Free Early Fusion

dMel + hMLP + Flow head co-trained with one transformer. Ch 03.

Architecture

Streaming Sessions

Persistent GPU memory for append-only inference. Ch 04.

Architecture

dMel Tokenization

Discretized mel-filterbank channels as speech tokens. Ch 03.

Architecture

Flow Matching Decoder

CNF-based audio generation in 200ms chunks. Ch 03.

Inference

Batch Invariance

Same output regardless of other concurrent requests. Ch 05.

Inference

FP Non-Associativity

Why (a+b)+c ≠ a+(b+c) in floating-point. Ch 05.

Inference

Sequence Parallelism

Distributing sequential ops across devices to save memory. Ch 04.

Collaboration

Turn-Based Bottleneck

Why current AI interfaces cripple collaboration. Ch 01.

Collaboration

Interaction + Background Split

Real-time presence + deep async reasoning. Ch 06.

Collaboration

Grounding Principles

Copresence, contemporality, simultaneity. Ch 01.

Evaluation

TimeSpeak & CueSpeak

Benchmarks for time-aware and simultaneous speech. Ch 07.

Evaluation

Visual Proactivity

RepCount, ProactiveVideoQA, Charades benchmarks. Ch 07.

Evaluation

Task-Length Horizon

METR’s measurement of autonomous AI capability. Ch 08.

00 Reading guide

This lesson is structured in layers. You can read it linearly or skip around.

  • Chapters 01–02: The why — what’s broken about current AI interfaces and what micro-turns change.
  • Chapters 03–05: The how — the architectural and inference innovations that make 200ms heartbeats possible.
  • Chapters 06–08: The so what — the two-model design, new benchmarks, and where this is heading.

If you already know transformer architectures and mel spectrograms, skip to Chapter 02. If you only care about the benchmarks, jump to Chapter 07.

01 The interface problem

AI gets smarter every quarter. But talking to it still feels like sending telegrams.

Imagine you’re pair-programming with a colleague. You’re both looking at the same screen. You point at a function and start saying “this part is—” and they immediately see what you’re pointing at, nod, and say “yeah, the edge case with null inputs.” You didn’t finish your sentence. You didn’t need to.

Now imagine the same interaction with an AI. You type a message. You wait. The AI reads your entire message. It generates a response. You read the entire response. You type another message. This is the turn-based bottleneck: the gap between what human collaboration is and what human-AI interaction allows.

Anthropic’s own frontier model card is candid about this: “when used in an interactive, synchronous, ‘hands-on-keyboard’ pattern, the benefits of the model were less clear.”

The problem isn’t intelligence. The models are smart enough. The problem is the channel — a text box that freezes perception during generation and forces strict turn-taking.

01 Grounding principles

Herbert Clark and Susan Brennan identified three properties that make human communication work, back in 1991. These aren’t nice-to-haves. They’re load-bearing:

01
Principle

Copresence

Both parties have interactive access to the same shared content. You can see what I’m pointing at. I can see your reaction to what I just said.

02
Principle

Contemporality

Information is received as it’s produced. You hear my words as I speak them, not after I finish my paragraph. Feedback is real-time.

03
Principle

Simultaneity

Both parties produce and receive information concurrently. I can nod while you talk. I can say “mm-hmm” without interrupting your flow.

Current LLM interfaces violate all three. The model can’t see your screen while it types. It doesn’t receive your input while generating. And it certainly can’t nod along. The Thinking Machines blog frames this as the core bottleneck: humans are pushed out of the loop not because the task demands it, but because the interface can’t keep up.

01 The turn-based trap

Let’s make the problem concrete. Here’s what a timeline looks like for a turn-based model versus a continuously interactive one.

Human speaks/types AI processes AI responds Concurrent I/O

In the turn-based model, there are dead zones — periods where neither party is doing useful work because the protocol demands one finish before the other starts. In the interactive model, the AI’s perception never stops. It hears you while it talks. It sees your screen while it thinks. The dead zones collapse.

The key insight: making AI interactive isn’t a UX polish. It’s an architectural decision that changes what the model can perceive, when it can act, and how humans can collaborate with it. Bolt-on solutions (voice activity detection, external orchestrators) don’t scale with intelligence. For interactivity to scale with intelligence, it must be part of the model itself.

01 The bitter lesson, restated

Rich Sutton’s Bitter Lesson (2019) says that general methods leveraging computation always win over hand-engineered solutions. Thinking Machines extends this to interactivity: bolting a voice-activity detector onto a text model is the hand-engineered approach. Training a natively multimodal model that processes 200ms chunks is the general approach.

The claim is simple and testable: scaling a model should make it smarter AND a better collaborator. If interactivity lives outside the model (in a VAD module, an orchestrator, a turn-management system), then scaling the model only makes it smarter. The collaboration quality stays flat.

This is the premise behind everything that follows.

02 What is a micro-turn?

The heartbeat of the interaction model: 200 milliseconds of continuous I/O.

In a standard LLM, the model processes your entire input, then generates its entire response. Input and output are discrete, sequential phases. A micro-turn throws this away.

Instead, the model processes the world in 200ms chunks. Every 200 milliseconds, it ingests whatever input has arrived — a fragment of speech, a video frame, a partial text — and decides what to do: continue listening, start speaking, adjust its current output, or stay silent. Both input and output are treated as continuous streams that interleave in a single token sequence.

Think of it like breathing. You don’t stop breathing to talk, and you don’t stop hearing to breathe. The 200ms chunk is the model’s respiratory cycle — a constant rhythm of perceiving and acting that never pauses.

Concretely, the model’s token sequence looks like this:

Single interleaved sequence $$[\text{audio}_{0\text{-}200\text{ms}},\ \text{video}_{t_0},\ \text{response tokens},\ \text{audio}_{200\text{-}400\text{ms}},\ \text{video}_{t_1},\ \dots]$$
  • Every 200ms chunk produces a “bag of embeddings” from all active modalities
  • The transformer processes this single interleaved sequence autoregressively
  • Output tokens (text, audio, tool calls) are generated between input chunks

02 Why 200 milliseconds?

This number isn’t arbitrary. It sits at the intersection of three constraints:

  1. Perceptual latency

    Human conversational turn-taking gaps average 200–300ms. Respond faster than 200ms and it feels like interruption. Respond slower than 500ms and it feels laggy. 200ms is the sweet spot for perceived real-time.

    Psycholinguistics — Stivers et al. 2009
  2. Audio frame size

    Speech at 16kHz means 200ms = 3,200 samples. That’s enough to capture a full phoneme but not so much that latency builds up. Standard audio frames (20ms) are too small for meaningful semantic processing; full utterances are too large for real-time response.

    Signal processing — Nyquist window trade-off
  3. Compute budget

    At 200ms, you have roughly 150–180ms of actual compute time (after network overhead). On modern GPUs with a 12B-active MoE, that’s enough for a meaningful forward pass including both prefill of the new chunk and several decode steps for output generation.

    Systems constraint — Thinking Machines inference team
Audio input chunk Video frame Model output Dead time

Drag the slider to see what happens at different chunk sizes. At 50ms, the overhead dominates — the model spends more time managing chunks than processing them. At 500ms, latency is perceptible. 200ms is the Goldilocks zone.

02 Stream interleaving

The critical architectural insight is that input and output share a single sequence. This is fundamentally different from systems that have separate input and output pipelines connected by an orchestrator.

Here’s what this means in practice:

PropertyTurn-BasedMicro-Turn
Input processingComplete utterance, then process200ms chunks, processed incrementally
Output generationFull response generated at onceTokens between input chunks
InterruptionExternal VAD detects speech, kills generationModel hears input during generation, decides to yield
Simultaneous speechImpossible — one direction at a timeNative — input and output streams overlap
Visual awarenessSnapshot at turn startContinuous video frames every 200ms
Tool use timingTools called after response generationTools called concurrently during conversation

The simultaneous speech capability is especially striking. Because both audio input and audio output are part of the same token sequence, the model can literally speak while listening. This enables interaction patterns that are impossible with turn-based systems: live translation, real-time commentary, counting reps during exercise.

02 What this replaces

Most “real-time” voice AI systems today use a harness-based architecture:

  1. A Voice Activity Detector (VAD) listens for speech onset/offset
  2. When speech ends, audio is sent to a speech-to-text module
  3. The text goes to the LLM for response generation
  4. The response text goes to a text-to-speech module
  5. Audio plays back to the user

Each component is separately engineered, and the seams show. The VAD can’t understand context — it just detects energy. It can’t tell if you paused to think or finished your sentence. The LLM can’t hear how you said something — only what you said. And none of these components improve when you make the LLM larger.

The micro-turn architecture replaces all five components with a single transformer that natively processes audio, video, and text in an interleaved sequence. No VAD. No STT. No TTS pipeline. The model IS the pipeline.

This is the “scaling with intelligence” argument: every component in the harness is a ceiling. Make the LLM 10x smarter and your VAD is still the same dumb energy detector. With micro-turns, making the model smarter makes every aspect of interaction better — turn-taking, interruption handling, emotional awareness, timing, all of it.

03 Modality fusion overview

Three modalities, three lightweight encoders, one shared transformer. No frozen encoders. No separate decoders. Everything co-trained from scratch.

The dominant approach to multimodal AI uses large pre-trained encoders: a Whisper-class model for audio, a CLIP/SigLIP model for vision, a vocoder for speech output. Each is frozen, and the LLM learns to interface with their embedding spaces.

This approach has a problem for real-time interaction: large encoders add latency. A Whisper encoder needs the full audio segment before producing embeddings. A CLIP encoder processes a 224×224 image through dozens of layers. When your budget is 200ms, these pre-processing steps eat most of it.

Thinking Machines takes a different approach: encoder-free early fusion. Instead of large frozen encoders, they use lightweight embedding layers that convert raw signals into token-like representations, then feed everything into one transformer.

Audio path (dMel) Video path (hMLP) Text path Shared transformer

03 dMel: discretized mel spectrograms

How do you turn speech into tokens? The standard approaches are:

  • Codec tokenization (like EnCodec, SoundStream): Train a neural audio codec to compress audio into discrete codes. Problem: the codec is domain-specific and fails on out-of-domain audio.
  • Large encoder (like Whisper): Process audio through a big pre-trained model. Problem: high latency, frozen representations.

dMel (Bai et al., 2024) is a beautifully simple alternative. Take the mel spectrogram — a standard audio representation that’s been used for decades — and discretize each frequency channel into intensity bins.

dMel encoding $$\text{dMel}(t, f) = \text{bin}\left(\frac{\text{mel}(t, f) - \mu_f}{\sigma_f},\ B\right)$$
  • $\text{mel}(t, f)$ — mel spectrogram value at time $t$, frequency band $f$
  • $\mu_f, \sigma_f$ — per-channel mean and standard deviation (from training data)
  • $B$ — number of intensity bins (typically 256)
  • Output: an integer per frequency band per time step — a discrete token
Worked example: 16kHz audio, 200ms chunk = 3,200 samples. Compute mel spectrogram with 80 mel bands, 10ms hop → 20 frames × 80 bands = 1,600 values. Discretize each into one of 256 bins → 1,600 discrete tokens. A lightweight embedding layer maps these to the transformer’s hidden dimension. Total encoder parameter count: essentially just the embedding matrix. Compare this to Whisper’s 244M-parameter encoder.

Why does this work? Because the mel spectrogram already captures the structure of speech (formants, pitch, energy) in a format that the transformer can learn to interpret. The “encoder” is just a lookup table. All the heavy lifting happens inside the shared transformer.

The key advantage for real-time interaction: dMel is streaming-native. Each 200ms chunk produces its tokens independently. There’s no need to buffer a full utterance before encoding.

03 hMLP: vision patches

For video, Thinking Machines uses a similar philosophy: minimal encoding, maximal transformer processing. Each video frame is divided into 40×40 patches, and each patch is encoded by an hMLP (hierarchical MLP) from Touvron et al. (2022).

An hMLP is much simpler than a ViT (Vision Transformer). It has no self-attention layers — it’s a stack of feedforward layers that processes each patch independently. The cross-patch reasoning happens inside the shared transformer, not in a separate vision encoder.

PropertyCLIP/SigLIP EncoderhMLP Patches
Parameters300M–2B~10M
Self-attentionYes (12–48 layers)None
Cross-patch reasoningInside vision encoderInside shared transformer
Latency per frame5–30ms<1ms
TrainingFrozen (pre-trained on 400M–12B images)Co-trained from scratch
Tokens per frame256–576 (after resampling)1,600 (40×40)

The trade-off is clear: more tokens per frame (1,600 vs ~256) but much faster encoding and fully co-trained representations. For a real-time system processing frames every 200ms, the sub-millisecond encoding latency is critical. The extra tokens are manageable because the model has 200ms of compute budget per chunk.

03 Flow matching decoder

The input side is solved: dMel for audio in, hMLP for video in, standard unembedding for text in. But how does the model produce audio?

Thinking Machines uses a flow head based on Flow Matching (Lipman et al., 2022). This is a generative model that learns to transform noise into mel spectrograms by learning a vector field along a probability path.

Flow matching objective $$\mathcal{L}_{\text{FM}} = \mathbb{E}_{t, x_0, x_1}\left[\|v_\theta(t, x_t) - (x_1 - x_0)\|^2\right]$$
  • $x_0$ — noise sample (from simple prior)
  • $x_1$ — target mel spectrogram chunk
  • $x_t = (1-t)x_0 + tx_1$ — linear interpolation at time $t \in [0,1]$
  • $v_\theta$ — learned vector field that pushes $x_0$ toward $x_1$
  • The optimal field is simply $x_1 - x_0$ — a straight line from noise to signal

Why flow matching instead of a standard vocoder or diffusion model?

  1. Speed: Flow matching with optimal transport paths converges in fewer ODE steps than diffusion. For 200ms audio chunks, you need fast generation.
  2. Quality: The straight-line paths (optimal transport) produce higher-quality samples than the curved paths of standard diffusion.
  3. Co-training: The flow head is jointly trained with the transformer, so the model learns audio representations that are easy to decode.
Why the loss works: At training time, you sample a random interpolation point $t$ between noise ($t=0$) and target mel ($t=1$). The model predicts the direction to move from the current interpolated point. The optimal direction is just the difference $x_1 - x_0$ — a straight line. During inference, you start from noise and integrate the learned vector field through a few ODE steps to arrive at the target mel spectrogram.

03 The bag of embeddings

All three modality paths — dMel audio tokens, hMLP video patches, text tokens — produce embeddings in the same hidden dimension. Every 200ms, the active modality tokens are concatenated into a “bag of embeddings” and fed into the transformer.

Audio tokens (dMel) Video patches (hMLP) Text tokens Output: text / mel flow

The transformer then produces output tokens: text tokens for language generation, and conditioning vectors for the flow head to generate mel spectrograms. All of this happens inside a single model — a 276B-parameter Mixture of Experts with 12B active parameters per forward pass.

The model specification: TML-Interaction-Small is 276B total parameters, 12B active (MoE). For context, GPT-4o is rumored to be ~200B total. The “small” in the name is aspirational — they plan to release larger models as inference speed improves.

04 The latency problem

200ms chunks mean 5 prefill operations per second. Standard inference libraries weren’t built for this.

Here’s the fundamental tension. A standard LLM inference system expects a pattern like:

  1. Receive a prompt (hundreds to thousands of tokens)
  2. Prefill: process all tokens in parallel to build the KV cache
  3. Decode: generate tokens autoregressively, one at a time
  4. Return the response

This happens once per request. The overhead of initializing GPU memory, computing attention masks, and managing metadata is amortized over a long generation.

But a micro-turn model needs to prefill a new chunk every 200 milliseconds. That’s 5 prefill operations per second. Each one only adds ~20–50 new tokens (the 200ms of audio/video). The overhead of standard inference libraries — memory allocation, metadata computation, scheduler interrupts — would eat more time than the actual computation.

Compute time Overhead Idle / network

04 Persistent GPU sessions

The solution is streaming sessions. Instead of treating each 200ms chunk as a separate request, the client opens a persistent session with the inference server. The session maintains a single growing sequence in GPU memory:

  1. Session open

    Client connects. Server allocates GPU memory for the KV cache and initializes the sequence state.

    One-time overhead amortized over the entire conversation
  2. Chunk append

    Every 200ms, client sends a new audio/video chunk. Server tokenizes it and appends to the existing sequence. Prefill runs only on the new tokens — no re-processing of the full context.

    Incremental prefill — O(new_tokens × total_tokens) attention, not O(total_tokens²)
  3. Decode interleaved

    After prefilling the new chunk, the server runs decode steps to generate output tokens (text or mel conditioning). These are streamed back to the client.

    Output starts arriving before the next input chunk
  4. Repeat

    The sequence grows continuously. KV cache persists across chunks. No re-initialization, no memory re-allocation (pre-allocated with headroom).

    Steady-state overhead: essentially zero per chunk
Concrete numbers: Standard inference (per-chunk): ~15ms allocation + ~8ms metadata + ~3ms scheduling + ~25ms compute = ~51ms. With streaming sessions: ~0.1ms append + ~25ms compute = ~25.1ms. That’s a 2x improvement — and 25ms out of the 200ms budget leaves 175ms for decode and network latency.

04 MoE kernel tricks

TML-Interaction-Small is a Mixture of Experts model: 276B total parameters, 12B active per forward pass. MoE models route each token to a subset of “expert” FFN blocks. This is great for capacity vs. compute, but MoE kernels have their own inference challenges.

The standard approach for MoE inference is grouped GEMM: group tokens by their assigned expert, then run one matrix multiply per expert. But grouped GEMM has poor GPU utilization when group sizes are small (few tokens per expert in a 200ms chunk).

Thinking Machines uses a gather+GEMV strategy instead, borrowed from work by PyTorch and Cursor:

  1. Gather: For each expert, gather the tokens assigned to it
  2. GEMV: When tokens-per-expert is small (as in decode), use optimized matrix-vector multiplications instead of general matrix multiplication

This is significant because decode in a streaming session produces 1–5 tokens at a time. GEMV kernels are heavily optimized for this case on modern GPUs — they exploit the memory bandwidth bottleneck rather than the compute bottleneck.

04 SGLang contribution

The streaming session implementation was upstreamed to SGLang (PR #19171), an open-source inference framework. This isn’t proprietary magic — any team building real-time multimodal models can benefit from the same infrastructure.

Key aspects of the contribution:

  • Persistent sequence state: KV cache survives across multiple append-and-decode cycles without re-allocation
  • Incremental prefill: New tokens appended to existing KV cache with proper attention masking
  • Bidirectional serving shapes: Optimized kernels for both the small-prefill (200ms chunks) and decode (1–5 tokens) patterns that alternate rapidly in streaming sessions
The broader lesson: real-time multimodal AI doesn’t just need better models. It needs better systems. Inference frameworks designed for the request-response pattern of chatbots don’t work for streaming interaction. Thinking Machines had to change the serving infrastructure, not just the model.

05 Floating-point sin

The reason you can’t reproduce your LLM outputs — and it has nothing to do with temperature.

Set temperature to 0. Set the seed. Run the same prompt twice on the same GPU. Get different outputs. This happens constantly, and the standard explanation is “floating-point nondeterminism.” But that’s vague. Let’s be precise.

Floating-point addition is non-associative:

The original sin $$(a + b) + c \neq a + (b + c) \quad \text{in floating-point}$$

This isn’t a bug. It’s what makes floating-point useful. Floating-point numbers use a fixed number of bits to represent an enormous range of values. When you add two numbers with very different magnitudes, the smaller one gets rounded. The order of additions determines which roundings happen.

Concrete demo: Sum the array $[10^{-10},\ 10^{-5},\ 10^{-2},\ 1,\ -10^{-10},\ -10^{-5},\ -10^{-2},\ -1]$. The mathematical sum is exactly 0. But shuffle the order and sum in float32: you get 102 unique results. Every GPU kernel that sums numbers in a different order (based on thread scheduling, warp layout, batch size) produces a different answer.

05 Batch (in)variance

Here’s the insight that changes everything. The common belief is that GPU atomic operations cause nondeterminism (random thread execution order → random summation order). But Horace He at Thinking Machines shows this is wrong for LLM inference.

The forward pass of an LLM involves no operations that require atomic adds.

So where does the nondeterminism come from? From batch size variance. Modern inference servers dynamically batch requests. User A sends a prompt. User B sends a prompt 50ms later. The server batches them together. The batch-size changes the internal reduction patterns in GPU kernels, which changes the floating-point accumulation order, which changes the output.

python — batch variance in action
import torch
torch.set_default_device('cuda')

B, D = 2048, 4096
a = torch.linspace(-1000, 1000, B*D).reshape(B, D)
b = torch.linspace(-1000, 1000, D*D).reshape(D, D)

# Same first row, different batch size
out1 = torch.mm(a[:1], b)        # batch=1
out2 = torch.mm(a, b)[:1]       # batch=2048, take first row

print((out1 - out2).abs().max())  # tensor(1669.2500) — huge!

The same row, the same weight matrix, yet a 1,669-unit difference in the output. Why? Because cuBLAS uses different tile decompositions for different batch sizes. Different tiles mean different partial-sum accumulation orders. Different orders mean different floating-point results.

The realization: from the perspective of an individual user, other concurrent users are not an “input” to the system — they’re a nondeterministic property. The same prompt sent to the same model can produce different outputs depending on who else is using the server at that moment.

05 Making kernels batch-invariant

The fix is to ensure that every GPU kernel produces the same output for a given input regardless of batch size. This is called batch invariance.

The strategy is consistent across all operations: assign one batch element per core. When each element’s reduction happens entirely within a single core, the accumulation order is fixed regardless of how many other elements are in the batch.

RMSNorm

RMSNorm computes $\text{RMS}(x) = \sqrt{\frac{1}{D}\sum_{i=1}^{D} x_i^2}$ per row. In data-parallel mode, each row is assigned to one core. The reduction over $D$ dimensions always happens in the same order. Batch-invariant by construction.

The only tricky case: when there are fewer rows than cores, the kernel might split a single row across multiple cores (split-reduction). This breaks batch invariance. The Thinking Machines solution: just don’t do it. For the tiny batch sizes where split-reduction kicks in, the performance difference is negligible.

Matrix multiplication

Same principle: one batch element per core. But the standard cuBLAS GEMM doesn’t guarantee this — it uses split-K (parallelizing the inner reduction) for performance. A batch-invariant GEMM avoids split-K, sacrificing about 20% throughput compared to cuBLAS.

Importantly, even within a single element’s GEMM, the tile size must be consistent. Different PTX instructions (the assembly-level instructions for GPU tensor cores) use different internal accumulation orders. The batch-invariant implementation forces consistent tile sizes across all batch elements.

05 Attention: the hard case

Attention is the most complex operation to make batch-invariant because it must handle:

  • Different sequence lengths in the same batch
  • KV cache from previous turns (some tokens in cache, some in current input)
  • Chunked prefill (partially cached sequences)
  • The split-KV optimization in FlashDecoding

The core problem: FlashDecoding splits the KV sequence across multiple SMs (streaming multiprocessors) for parallelism. When the 1000th query token is processed, the split depends on how many tokens are in the KV cache vs. the current input. In prefill (0 cached), all 1000 tokens are split one way. In decode (999 cached + 1 new), they’re split another way. Different splits → different accumulation orders → different outputs.

The attention invariance requirement $$\text{Attn}(q_{1000}, K_{1:1000}) \text{ must be identical whether } K_{1:999} \in \text{cache or } K_{1:999} \in \text{current input}$$

The solution: fixed split-size instead of fixed number of splits. Rather than saying “always use 8 splits” (which gives different chunk sizes for different sequence lengths), say “always split at 4096-token boundaries.” Now the accumulation order for tokens 1–4096 is always the same, regardless of how many total tokens there are or how many are cached.

Key insight: the accumulation order must be left-aligned. If you split at 4096 boundaries: tokens 1–4096 are always reduced together, 4097–8192 always together, etc. Whether those tokens arrived via cache or via prefill doesn’t matter — the reduction tree is structurally identical.

05 Why determinism matters

You might think this is academic perfectionism. It’s not. There are two concrete reasons to care:

1. Debugging training

When trainer and sampler produce bitwise-identical outputs, you can replay any training step exactly. If a loss spike happens at step 47,392, you can rerun that exact step, inspect every intermediate value, and find the root cause. Without determinism, debugging is statistical — you can characterize problems but not pinpoint them.

2. True on-policy RL

Reinforcement learning from human feedback (RLHF) assumes the model generating training data is the same model being trained. But if inference is nondeterministic, the generating model and the training model diverge. You’re training on slightly off-policy data.

The standard fix is importance weighting: multiply each training example by $\frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)}$ to correct for the policy mismatch. But importance weights add variance. With deterministic inference, the KL divergence between generator and learner is exactly zero. No correction needed. No variance.

MetricNondeterministic+ Importance WeightsDeterministic
Unique completions (1000 runs, temp=0)80801
KL divergence (train vs sample)~0.001 with spikes~0.001 with spikesexactly 0
Reward collapse riskHighMediumNone
Inference overhead1x (baseline)1x + sampling overhead~1.6x

The 1.6x inference overhead is the cost: batch-invariant matmul is ~20% slower than cuBLAS, plus the fixed-split attention overhead. But for training stability, many teams consider this a bargain. The open-source implementation is available at thinking-machines-lab/batch-invariant-ops.

06 Why split?

You can’t be real-time and deeply thoughtful at the same time. So don’t try — use two models.

There’s a fundamental tension in AI interaction. Being responsive means generating output quickly — within 200ms, ideally. Being intelligent means spending time reasoning, searching, using tools, planning multi-step actions. You can’t do both with one forward pass.

Human teams solve this naturally: one person keeps the conversation going (“yeah, let me think about that...”) while their brain works on the hard problem in the background. Thinking Machines formalizes this pattern with two models:

FAST
Interaction model

Always present

Processes every 200ms chunk. Handles conversation flow, interruptions, backchannels, visual awareness. Must respond within budget. This is the micro-turn architecture from Chapter 02.

DEEP
Background model

Asynchronous reasoning

Handles complex questions, tool use, code execution, web search, multi-step reasoning. No latency constraint. Works while the interaction model maintains the conversation.

06 The interaction model

The interaction model is the micro-turn architecture from Chapter 02 — a 276B MoE (12B active) that processes 200ms chunks in a continuous stream. Its responsibilities:

  • Conversation management: Track who’s speaking, detect interruptions, manage turn-taking
  • Immediate responses: Quick answers, acknowledgments, clarifications
  • Visual awareness: Monitor video feed, react to visual changes
  • Delegation: Recognize when a question requires deep reasoning and hand off to the background model
  • Result integration: Weave background model results into the conversation naturally

The last two are the interesting ones. The interaction model must know what it doesn’t know — recognize that a coding question, a complex search, or a multi-step plan exceeds its 200ms budget and delegate appropriately.

06 The background model

The background model is a standard reasoning model — think Claude with extended thinking or o1. It receives delegated tasks and works on them asynchronously. It can:

  • Execute multi-step tool chains
  • Browse the web and synthesize information
  • Generate and execute code
  • Plan complex workflows
  • Generate UI components

Results stream back to the interaction model as they’re produced, not as a single dump when the task completes. This enables the interaction model to provide progressive updates: “I’m looking that up... found three relevant papers... the key finding is...”

06 Coordination protocol

The critical detail: when the interaction model delegates, it sends a rich context package — not a standalone query. The background model receives the full conversation history, current user intent, visual context, and any relevant state.

Interaction model Background model User Results streaming back

This matters because context loss is the failure mode. If the interaction model strips context when delegating (“search for X” instead of “the user is frustrated with their deployment pipeline and specifically asked about X in the context of Y”), the background model produces generic, unhelpful results.

The design principle: the interaction model is always the user’s primary contact. It maintains the conversation thread. The background model is a specialist the interaction model calls on — like a colleague you message in Slack while staying on a call with the user. The user never talks directly to the background model.

Results are woven into the conversation at contextually appropriate moments. If the user is mid-sentence, the interaction model waits. If there’s a natural pause, it smoothly introduces the result. This is far more natural than the current pattern of “please wait while I think...” followed by a wall of text.

07 Existing benchmarks

Standard benchmarks test intelligence. Thinking Machines needed to also test interactivity.

They start with established benchmarks to verify the model isn’t trading intelligence for interactivity:

BenchmarkWhat It TestsTML-SmallGPT-RT-2.0Gemini Flash Live
FD-bench v1Turn-taking latency (s)0.401.180.57
FD-bench v1.5Interaction quality avg77.846.854.3
Audio MultiChallengeIntelligence + following43.437.626.8
IFEval (VoiceBench)Instruction following82.181.767.6
IFEval (Text)Text instruction following89.789.685.8
HarmBenchSafety refusal rate99.0%99.5%99.0%

The headline: TML-Interaction-Small is competitive on intelligence (89.7% IFEval, 82.1% VoiceBench) while dominating on interactivity (0.40s turn-taking vs 0.57–2.14s for competitors, 77.8 FD-bench vs 39–54 for competitors).

But these benchmarks don’t capture the qualitative jump. Time-awareness, simultaneous speech, and visual proactivity are entirely new capabilities. You can’t measure them on existing tests because no existing model can do them.

07 TimeSpeak & CueSpeak

Thinking Machines created two new benchmarks to test capabilities that require native interactivity:

TimeSpeak

Can the model initiate speech at user-specified times with correct content?

Example task: “I want to practice my breathing. Remind me to breathe in and out every 4 seconds until I ask you to stop.” The model must: (1) understand the timing request, (2) maintain an internal clock, (3) initiate speech at the right moments, (4) stop when asked. This is impossible without native time-awareness.

Results: TML: 64.7% vs GPT-RT-2.0: 4.3%. This isn’t a small gap. GPT-RT essentially can’t do this task at all.

CueSpeak

Can the model speak at contextually appropriate moments with semantically correct content, simultaneously with the user?

Example task: “Every time I code-switch and use another language, give me the correct word in the original language.” The model must: (1) detect code-switching in real-time audio, (2) start speaking while the user is still speaking, (3) produce the correct translation. Graded by LLM judge on both semantic correctness AND timing.

Results: TML: 81.7% vs GPT-RT-2.0: 2.9%. Again, the baseline model essentially can’t do this.

07 Visual proactivity

Three adapted benchmarks test whether the model can see, understand, and proactively react to visual input:

TRACK
RepCount-A

Count my reps

Videos of repeated actions. Model must count out loud in real-time. Tests continuous visual tracking and timely vocalization. TML: 35.4%, GPT-RT: 1.3%.

ANSWER
ProactiveVideoQA

Answer when ready

Video with questions whose answers appear at specific moments. Model must stay silent until the answer is visible, then speak. Tests temporal awareness + visual understanding. TML: 33.5%, GPT-RT: 25.0% (no-response baseline).

LOCATE
Charades

Start/stop actions

Model must say “start” when an action begins and “stop” when it ends. Temporal IoU between predicted and reference intervals. TML: 32.4%, GPT-RT: 0%.

The punchline: “No existing model can meaningfully perform any of these tasks. For the sake of completeness, we report the results of GPT Realtime-2 (minimal), but all models evaluated perform similar or worse on these tasks, including thinking high models. They stay silent or give incorrect answers.”

This is the qualitative jump that existing benchmarks miss. It’s not that TML is slightly better at these tasks. It’s that no other model can do them at all. The micro-turn architecture enables an entirely new category of capability.

07 The results, visualized

TML-Interaction-Small GPT-Realtime-2.0 Gemini Flash Live

08 Task-length horizons

METR’s measurement of autonomous AI capability — and why interactivity changes the equation.

METR (Model Evaluation and Threat Research) introduced a powerful way to measure AI progress: instead of benchmark scores, measure the length of tasks that models can complete autonomously, where length is defined by how long a human expert would take.

Their key finding: the length of tasks that frontier models can complete at 50% reliability has been doubling approximately every 7 months for the last 6 years.

Model time horizons (50% reliability) Extrapolation

Current frontier models (like Claude 3.7 Sonnet) can reliably complete tasks that take humans a few minutes. They can occasionally succeed at tasks taking hours, but reliability drops sharply.

The key insight for interaction models: METR measures autonomous task completion. But most real work isn’t autonomous — it’s collaborative. If you can halve the coordination overhead between human and AI, you effectively double the task complexity that’s achievable. Interactivity doesn’t just make AI more pleasant — it extends the effective task horizon.

08 Intelligence × interactivity

Traditional AI evaluation is one-dimensional: how smart is the model? Interaction models add a second dimension: how well can you work with it?

Thinking Machines positions this as a frontier — models can be plotted on a 2D space of intelligence vs. interactivity. Most current models live in the high-intelligence-low-interactivity quadrant (text chat) or the low-intelligence-high-interactivity quadrant (basic voice assistants). TML-Interaction-Small aims for the top-right: high intelligence AND high interactivity.

TML models OpenAI models Google models Other

08 Current limitations

Thinking Machines is honest about what doesn’t work yet:

Long sessions

Continuous audio and video accumulate context quickly. At 200ms chunks with ~30 audio tokens + 1,600 video patches per chunk, context grows by ~8,000 tokens per second. Even with a 128K context window, that’s only ~16 seconds of continuous multimodal input before context fills. The streaming session design handles short/medium interactions well, but very long sessions require careful context management — a KV cache eviction strategy, context compression, or switching to a summary representation.

Connectivity

Low-latency streaming requires reliable network. With 200ms chunks and ~25ms compute, you have ~175ms for round-trip network latency. On a poor connection, chunks arrive late, the model’s temporal model breaks down, and the experience degrades significantly. Edge deployment or aggressive chunking strategies could help.

Model scale

TML-Interaction-Small is “small” at 276B total (12B active). Larger models could be more capable, but current hardware can’t serve them within the 200ms budget. As inference hardware improves, larger interaction models become feasible.

Background agent sophistication

The background agent is functional but basic. The coordination protocol (rich context delegation + streaming results) works, but there’s enormous room to improve how the interaction and background models collaborate — especially for complex, multi-step tasks.

08 Where this goes

If the interaction model thesis is correct — that interactivity should be part of the model, not bolted on — then several things follow:

  1. Scaling improves everything. A 10x larger interaction model should be better at turn-taking, interruption, timing, AND reasoning. This is the core bet.
  2. New training paradigms. Training on audio/video streams (not transcripts) teaches the model about timing, prosody, visual context. This is a fundamentally different training distribution than text-only.
  3. New safety challenges. Real-time interaction stresses safety differently. The model must refuse harmful requests naturally in speech (not with a robotic “I cannot help with that”). Multi-turn manipulation over long sessions is harder to detect. Automated red-teaming must cover conversational dynamics, not just prompt-response pairs.
  4. New evaluation frameworks. TimeSpeak and CueSpeak are first steps. The community needs benchmarks for: sustained multi-modal awareness, collaborative task completion, graceful degradation under network issues, long-session coherence.
The Thinking Machines team is planning a limited research preview in the coming months, with a wider release later in 2026. They’re also funding research grants for new interactivity benchmarks and evaluation frameworks. The full blog post: thinkingmachines.ai/blog/interaction-models.

08 References

  1. Thinking Machines Lab. “Interaction Models: A Scalable Approach to Human-AI Collaboration.” Connectionism, May 2026. Blog
  2. Bai, R. H. et al. “dMel: Speech Tokenization Made Simple.” 2024. arXiv:2407.15835
  3. Touvron, H. et al. “Three things everyone should know about Vision Transformers.” 2022. arXiv:2203.09795
  4. Lipman, Y. et al. “Flow Matching for Generative Modeling.” 2022. arXiv:2210.02747
  5. Korthikanti, V. et al. “Reducing Activation Recomputation in Large Transformer Models.” 2022. arXiv:2205.05198
  6. He, H. “Defeating Nondeterminism in LLM Inference.” Thinking Machines Lab, Sep 2025. Blog
  7. Kwa, T. et al. “Measuring AI Ability to Complete Long Tasks.” METR, Mar 2025. Blog
  8. Wang, Y. et al. “ProactiveVideoQA.” 2025. arXiv:2507.09313
  9. Hu, H. et al. “TransRAC: Encoding Multi-scale Temporal Correlation (RepCount).” 2022. arXiv:2204.01018
  10. Sigurdsson, G. et al. “Hollywood in Homes: Crowdsourcing Data Collection (Charades).” 2016. arXiv:1604.01753
  11. Clark, H. & Brennan, S. “Grounding in Communication.” Perspectives on Socially Shared Cognition, 1991.
  12. Sutton, R. “The Bitter Lesson.” 2019.