How Thinking Machines built an AI that listens while it talks, sees while it thinks, and stays present in 200-millisecond heartbeats — a teardown of the architecture, inference tricks, and new benchmarks that make real-time human-AI collaboration possible.
Every concept in this lesson and how they connect — the territory before the map.
This lesson unpacks a single blog post into 14 interconnected concepts across four topic clusters. The constellation below shows how they relate. Amber nodes are concepts taught in this lesson. Blue-ringed nodes are covered in other Engineermaxxing lessons. Click any node to jump to the chapter where it's explained.
Every concept you'll encounter, sorted by cluster.
200ms continuous I/O chunks replace discrete turns. Ch 02.
dMel + hMLP + Flow head co-trained with one transformer. Ch 03.
Persistent GPU memory for append-only inference. Ch 04.
Discretized mel-filterbank channels as speech tokens. Ch 03.
CNF-based audio generation in 200ms chunks. Ch 03.
Same output regardless of other concurrent requests. Ch 05.
Why (a+b)+c ≠ a+(b+c) in floating-point. Ch 05.
Distributing sequential ops across devices to save memory. Ch 04.
Why current AI interfaces cripple collaboration. Ch 01.
Real-time presence + deep async reasoning. Ch 06.
Copresence, contemporality, simultaneity. Ch 01.
Benchmarks for time-aware and simultaneous speech. Ch 07.
RepCount, ProactiveVideoQA, Charades benchmarks. Ch 07.
METR’s measurement of autonomous AI capability. Ch 08.
This lesson is structured in layers. You can read it linearly or skip around.
If you already know transformer architectures and mel spectrograms, skip to Chapter 02. If you only care about the benchmarks, jump to Chapter 07.
AI gets smarter every quarter. But talking to it still feels like sending telegrams.
Imagine you’re pair-programming with a colleague. You’re both looking at the same screen. You point at a function and start saying “this part is—” and they immediately see what you’re pointing at, nod, and say “yeah, the edge case with null inputs.” You didn’t finish your sentence. You didn’t need to.
Now imagine the same interaction with an AI. You type a message. You wait. The AI reads your entire message. It generates a response. You read the entire response. You type another message. This is the turn-based bottleneck: the gap between what human collaboration is and what human-AI interaction allows.
Anthropic’s own frontier model card is candid about this: “when used in an interactive, synchronous, ‘hands-on-keyboard’ pattern, the benefits of the model were less clear.”
The problem isn’t intelligence. The models are smart enough. The problem is the channel — a text box that freezes perception during generation and forces strict turn-taking.
Herbert Clark and Susan Brennan identified three properties that make human communication work, back in 1991. These aren’t nice-to-haves. They’re load-bearing:
Both parties have interactive access to the same shared content. You can see what I’m pointing at. I can see your reaction to what I just said.
Information is received as it’s produced. You hear my words as I speak them, not after I finish my paragraph. Feedback is real-time.
Both parties produce and receive information concurrently. I can nod while you talk. I can say “mm-hmm” without interrupting your flow.
Current LLM interfaces violate all three. The model can’t see your screen while it types. It doesn’t receive your input while generating. And it certainly can’t nod along. The Thinking Machines blog frames this as the core bottleneck: humans are pushed out of the loop not because the task demands it, but because the interface can’t keep up.
Let’s make the problem concrete. Here’s what a timeline looks like for a turn-based model versus a continuously interactive one.
In the turn-based model, there are dead zones — periods where neither party is doing useful work because the protocol demands one finish before the other starts. In the interactive model, the AI’s perception never stops. It hears you while it talks. It sees your screen while it thinks. The dead zones collapse.
Rich Sutton’s Bitter Lesson (2019) says that general methods leveraging computation always win over hand-engineered solutions. Thinking Machines extends this to interactivity: bolting a voice-activity detector onto a text model is the hand-engineered approach. Training a natively multimodal model that processes 200ms chunks is the general approach.
The claim is simple and testable: scaling a model should make it smarter AND a better collaborator. If interactivity lives outside the model (in a VAD module, an orchestrator, a turn-management system), then scaling the model only makes it smarter. The collaboration quality stays flat.
This is the premise behind everything that follows.
The heartbeat of the interaction model: 200 milliseconds of continuous I/O.
In a standard LLM, the model processes your entire input, then generates its entire response. Input and output are discrete, sequential phases. A micro-turn throws this away.
Instead, the model processes the world in 200ms chunks. Every 200 milliseconds, it ingests whatever input has arrived — a fragment of speech, a video frame, a partial text — and decides what to do: continue listening, start speaking, adjust its current output, or stay silent. Both input and output are treated as continuous streams that interleave in a single token sequence.
Concretely, the model’s token sequence looks like this:
This number isn’t arbitrary. It sits at the intersection of three constraints:
Human conversational turn-taking gaps average 200–300ms. Respond faster than 200ms and it feels like interruption. Respond slower than 500ms and it feels laggy. 200ms is the sweet spot for perceived real-time.
Psycholinguistics — Stivers et al. 2009Speech at 16kHz means 200ms = 3,200 samples. That’s enough to capture a full phoneme but not so much that latency builds up. Standard audio frames (20ms) are too small for meaningful semantic processing; full utterances are too large for real-time response.
Signal processing — Nyquist window trade-offAt 200ms, you have roughly 150–180ms of actual compute time (after network overhead). On modern GPUs with a 12B-active MoE, that’s enough for a meaningful forward pass including both prefill of the new chunk and several decode steps for output generation.
Systems constraint — Thinking Machines inference teamDrag the slider to see what happens at different chunk sizes. At 50ms, the overhead dominates — the model spends more time managing chunks than processing them. At 500ms, latency is perceptible. 200ms is the Goldilocks zone.
The critical architectural insight is that input and output share a single sequence. This is fundamentally different from systems that have separate input and output pipelines connected by an orchestrator.
Here’s what this means in practice:
| Property | Turn-Based | Micro-Turn |
|---|---|---|
| Input processing | Complete utterance, then process | 200ms chunks, processed incrementally |
| Output generation | Full response generated at once | Tokens between input chunks |
| Interruption | External VAD detects speech, kills generation | Model hears input during generation, decides to yield |
| Simultaneous speech | Impossible — one direction at a time | Native — input and output streams overlap |
| Visual awareness | Snapshot at turn start | Continuous video frames every 200ms |
| Tool use timing | Tools called after response generation | Tools called concurrently during conversation |
The simultaneous speech capability is especially striking. Because both audio input and audio output are part of the same token sequence, the model can literally speak while listening. This enables interaction patterns that are impossible with turn-based systems: live translation, real-time commentary, counting reps during exercise.
Most “real-time” voice AI systems today use a harness-based architecture:
Each component is separately engineered, and the seams show. The VAD can’t understand context — it just detects energy. It can’t tell if you paused to think or finished your sentence. The LLM can’t hear how you said something — only what you said. And none of these components improve when you make the LLM larger.
This is the “scaling with intelligence” argument: every component in the harness is a ceiling. Make the LLM 10x smarter and your VAD is still the same dumb energy detector. With micro-turns, making the model smarter makes every aspect of interaction better — turn-taking, interruption handling, emotional awareness, timing, all of it.
Three modalities, three lightweight encoders, one shared transformer. No frozen encoders. No separate decoders. Everything co-trained from scratch.
The dominant approach to multimodal AI uses large pre-trained encoders: a Whisper-class model for audio, a CLIP/SigLIP model for vision, a vocoder for speech output. Each is frozen, and the LLM learns to interface with their embedding spaces.
This approach has a problem for real-time interaction: large encoders add latency. A Whisper encoder needs the full audio segment before producing embeddings. A CLIP encoder processes a 224×224 image through dozens of layers. When your budget is 200ms, these pre-processing steps eat most of it.
Thinking Machines takes a different approach: encoder-free early fusion. Instead of large frozen encoders, they use lightweight embedding layers that convert raw signals into token-like representations, then feed everything into one transformer.
How do you turn speech into tokens? The standard approaches are:
dMel (Bai et al., 2024) is a beautifully simple alternative. Take the mel spectrogram — a standard audio representation that’s been used for decades — and discretize each frequency channel into intensity bins.
Why does this work? Because the mel spectrogram already captures the structure of speech (formants, pitch, energy) in a format that the transformer can learn to interpret. The “encoder” is just a lookup table. All the heavy lifting happens inside the shared transformer.
The key advantage for real-time interaction: dMel is streaming-native. Each 200ms chunk produces its tokens independently. There’s no need to buffer a full utterance before encoding.
For video, Thinking Machines uses a similar philosophy: minimal encoding, maximal transformer processing. Each video frame is divided into 40×40 patches, and each patch is encoded by an hMLP (hierarchical MLP) from Touvron et al. (2022).
An hMLP is much simpler than a ViT (Vision Transformer). It has no self-attention layers — it’s a stack of feedforward layers that processes each patch independently. The cross-patch reasoning happens inside the shared transformer, not in a separate vision encoder.
| Property | CLIP/SigLIP Encoder | hMLP Patches |
|---|---|---|
| Parameters | 300M–2B | ~10M |
| Self-attention | Yes (12–48 layers) | None |
| Cross-patch reasoning | Inside vision encoder | Inside shared transformer |
| Latency per frame | 5–30ms | <1ms |
| Training | Frozen (pre-trained on 400M–12B images) | Co-trained from scratch |
| Tokens per frame | 256–576 (after resampling) | 1,600 (40×40) |
The trade-off is clear: more tokens per frame (1,600 vs ~256) but much faster encoding and fully co-trained representations. For a real-time system processing frames every 200ms, the sub-millisecond encoding latency is critical. The extra tokens are manageable because the model has 200ms of compute budget per chunk.
The input side is solved: dMel for audio in, hMLP for video in, standard unembedding for text in. But how does the model produce audio?
Thinking Machines uses a flow head based on Flow Matching (Lipman et al., 2022). This is a generative model that learns to transform noise into mel spectrograms by learning a vector field along a probability path.
Why flow matching instead of a standard vocoder or diffusion model?
All three modality paths — dMel audio tokens, hMLP video patches, text tokens — produce embeddings in the same hidden dimension. Every 200ms, the active modality tokens are concatenated into a “bag of embeddings” and fed into the transformer.
The transformer then produces output tokens: text tokens for language generation, and conditioning vectors for the flow head to generate mel spectrograms. All of this happens inside a single model — a 276B-parameter Mixture of Experts with 12B active parameters per forward pass.
200ms chunks mean 5 prefill operations per second. Standard inference libraries weren’t built for this.
Here’s the fundamental tension. A standard LLM inference system expects a pattern like:
This happens once per request. The overhead of initializing GPU memory, computing attention masks, and managing metadata is amortized over a long generation.
But a micro-turn model needs to prefill a new chunk every 200 milliseconds. That’s 5 prefill operations per second. Each one only adds ~20–50 new tokens (the 200ms of audio/video). The overhead of standard inference libraries — memory allocation, metadata computation, scheduler interrupts — would eat more time than the actual computation.
The solution is streaming sessions. Instead of treating each 200ms chunk as a separate request, the client opens a persistent session with the inference server. The session maintains a single growing sequence in GPU memory:
Client connects. Server allocates GPU memory for the KV cache and initializes the sequence state.
One-time overhead amortized over the entire conversationEvery 200ms, client sends a new audio/video chunk. Server tokenizes it and appends to the existing sequence. Prefill runs only on the new tokens — no re-processing of the full context.
Incremental prefill — O(new_tokens × total_tokens) attention, not O(total_tokens²)After prefilling the new chunk, the server runs decode steps to generate output tokens (text or mel conditioning). These are streamed back to the client.
Output starts arriving before the next input chunkThe sequence grows continuously. KV cache persists across chunks. No re-initialization, no memory re-allocation (pre-allocated with headroom).
Steady-state overhead: essentially zero per chunkTML-Interaction-Small is a Mixture of Experts model: 276B total parameters, 12B active per forward pass. MoE models route each token to a subset of “expert” FFN blocks. This is great for capacity vs. compute, but MoE kernels have their own inference challenges.
The standard approach for MoE inference is grouped GEMM: group tokens by their assigned expert, then run one matrix multiply per expert. But grouped GEMM has poor GPU utilization when group sizes are small (few tokens per expert in a 200ms chunk).
Thinking Machines uses a gather+GEMV strategy instead, borrowed from work by PyTorch and Cursor:
This is significant because decode in a streaming session produces 1–5 tokens at a time. GEMV kernels are heavily optimized for this case on modern GPUs — they exploit the memory bandwidth bottleneck rather than the compute bottleneck.
The streaming session implementation was upstreamed to SGLang (PR #19171), an open-source inference framework. This isn’t proprietary magic — any team building real-time multimodal models can benefit from the same infrastructure.
Key aspects of the contribution:
The reason you can’t reproduce your LLM outputs — and it has nothing to do with temperature.
Set temperature to 0. Set the seed. Run the same prompt twice on the same GPU. Get different outputs. This happens constantly, and the standard explanation is “floating-point nondeterminism.” But that’s vague. Let’s be precise.
Floating-point addition is non-associative:
This isn’t a bug. It’s what makes floating-point useful. Floating-point numbers use a fixed number of bits to represent an enormous range of values. When you add two numbers with very different magnitudes, the smaller one gets rounded. The order of additions determines which roundings happen.
Here’s the insight that changes everything. The common belief is that GPU atomic operations cause nondeterminism (random thread execution order → random summation order). But Horace He at Thinking Machines shows this is wrong for LLM inference.
The forward pass of an LLM involves no operations that require atomic adds.
So where does the nondeterminism come from? From batch size variance. Modern inference servers dynamically batch requests. User A sends a prompt. User B sends a prompt 50ms later. The server batches them together. The batch-size changes the internal reduction patterns in GPU kernels, which changes the floating-point accumulation order, which changes the output.
import torch torch.set_default_device('cuda') B, D = 2048, 4096 a = torch.linspace(-1000, 1000, B*D).reshape(B, D) b = torch.linspace(-1000, 1000, D*D).reshape(D, D) # Same first row, different batch size out1 = torch.mm(a[:1], b) # batch=1 out2 = torch.mm(a, b)[:1] # batch=2048, take first row print((out1 - out2).abs().max()) # tensor(1669.2500) — huge!
The same row, the same weight matrix, yet a 1,669-unit difference in the output. Why? Because cuBLAS uses different tile decompositions for different batch sizes. Different tiles mean different partial-sum accumulation orders. Different orders mean different floating-point results.
The fix is to ensure that every GPU kernel produces the same output for a given input regardless of batch size. This is called batch invariance.
The strategy is consistent across all operations: assign one batch element per core. When each element’s reduction happens entirely within a single core, the accumulation order is fixed regardless of how many other elements are in the batch.
RMSNorm computes $\text{RMS}(x) = \sqrt{\frac{1}{D}\sum_{i=1}^{D} x_i^2}$ per row. In data-parallel mode, each row is assigned to one core. The reduction over $D$ dimensions always happens in the same order. Batch-invariant by construction.
The only tricky case: when there are fewer rows than cores, the kernel might split a single row across multiple cores (split-reduction). This breaks batch invariance. The Thinking Machines solution: just don’t do it. For the tiny batch sizes where split-reduction kicks in, the performance difference is negligible.
Same principle: one batch element per core. But the standard cuBLAS GEMM doesn’t guarantee this — it uses split-K (parallelizing the inner reduction) for performance. A batch-invariant GEMM avoids split-K, sacrificing about 20% throughput compared to cuBLAS.
Importantly, even within a single element’s GEMM, the tile size must be consistent. Different PTX instructions (the assembly-level instructions for GPU tensor cores) use different internal accumulation orders. The batch-invariant implementation forces consistent tile sizes across all batch elements.
Attention is the most complex operation to make batch-invariant because it must handle:
The core problem: FlashDecoding splits the KV sequence across multiple SMs (streaming multiprocessors) for parallelism. When the 1000th query token is processed, the split depends on how many tokens are in the KV cache vs. the current input. In prefill (0 cached), all 1000 tokens are split one way. In decode (999 cached + 1 new), they’re split another way. Different splits → different accumulation orders → different outputs.
The solution: fixed split-size instead of fixed number of splits. Rather than saying “always use 8 splits” (which gives different chunk sizes for different sequence lengths), say “always split at 4096-token boundaries.” Now the accumulation order for tokens 1–4096 is always the same, regardless of how many total tokens there are or how many are cached.
You might think this is academic perfectionism. It’s not. There are two concrete reasons to care:
When trainer and sampler produce bitwise-identical outputs, you can replay any training step exactly. If a loss spike happens at step 47,392, you can rerun that exact step, inspect every intermediate value, and find the root cause. Without determinism, debugging is statistical — you can characterize problems but not pinpoint them.
Reinforcement learning from human feedback (RLHF) assumes the model generating training data is the same model being trained. But if inference is nondeterministic, the generating model and the training model diverge. You’re training on slightly off-policy data.
The standard fix is importance weighting: multiply each training example by $\frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)}$ to correct for the policy mismatch. But importance weights add variance. With deterministic inference, the KL divergence between generator and learner is exactly zero. No correction needed. No variance.
| Metric | Nondeterministic | + Importance Weights | Deterministic |
|---|---|---|---|
| Unique completions (1000 runs, temp=0) | 80 | 80 | 1 |
| KL divergence (train vs sample) | ~0.001 with spikes | ~0.001 with spikes | exactly 0 |
| Reward collapse risk | High | Medium | None |
| Inference overhead | 1x (baseline) | 1x + sampling overhead | ~1.6x |
The 1.6x inference overhead is the cost: batch-invariant matmul is ~20% slower than cuBLAS, plus the fixed-split attention overhead. But for training stability, many teams consider this a bargain. The open-source implementation is available at thinking-machines-lab/batch-invariant-ops.
You can’t be real-time and deeply thoughtful at the same time. So don’t try — use two models.
There’s a fundamental tension in AI interaction. Being responsive means generating output quickly — within 200ms, ideally. Being intelligent means spending time reasoning, searching, using tools, planning multi-step actions. You can’t do both with one forward pass.
Human teams solve this naturally: one person keeps the conversation going (“yeah, let me think about that...”) while their brain works on the hard problem in the background. Thinking Machines formalizes this pattern with two models:
Processes every 200ms chunk. Handles conversation flow, interruptions, backchannels, visual awareness. Must respond within budget. This is the micro-turn architecture from Chapter 02.
Handles complex questions, tool use, code execution, web search, multi-step reasoning. No latency constraint. Works while the interaction model maintains the conversation.
The interaction model is the micro-turn architecture from Chapter 02 — a 276B MoE (12B active) that processes 200ms chunks in a continuous stream. Its responsibilities:
The last two are the interesting ones. The interaction model must know what it doesn’t know — recognize that a coding question, a complex search, or a multi-step plan exceeds its 200ms budget and delegate appropriately.
The background model is a standard reasoning model — think Claude with extended thinking or o1. It receives delegated tasks and works on them asynchronously. It can:
Results stream back to the interaction model as they’re produced, not as a single dump when the task completes. This enables the interaction model to provide progressive updates: “I’m looking that up... found three relevant papers... the key finding is...”
The critical detail: when the interaction model delegates, it sends a rich context package — not a standalone query. The background model receives the full conversation history, current user intent, visual context, and any relevant state.
This matters because context loss is the failure mode. If the interaction model strips context when delegating (“search for X” instead of “the user is frustrated with their deployment pipeline and specifically asked about X in the context of Y”), the background model produces generic, unhelpful results.
Results are woven into the conversation at contextually appropriate moments. If the user is mid-sentence, the interaction model waits. If there’s a natural pause, it smoothly introduces the result. This is far more natural than the current pattern of “please wait while I think...” followed by a wall of text.
Standard benchmarks test intelligence. Thinking Machines needed to also test interactivity.
They start with established benchmarks to verify the model isn’t trading intelligence for interactivity:
| Benchmark | What It Tests | TML-Small | GPT-RT-2.0 | Gemini Flash Live |
|---|---|---|---|---|
| FD-bench v1 | Turn-taking latency (s) | 0.40 | 1.18 | 0.57 |
| FD-bench v1.5 | Interaction quality avg | 77.8 | 46.8 | 54.3 |
| Audio MultiChallenge | Intelligence + following | 43.4 | 37.6 | 26.8 |
| IFEval (VoiceBench) | Instruction following | 82.1 | 81.7 | 67.6 |
| IFEval (Text) | Text instruction following | 89.7 | 89.6 | 85.8 |
| HarmBench | Safety refusal rate | 99.0% | 99.5% | 99.0% |
The headline: TML-Interaction-Small is competitive on intelligence (89.7% IFEval, 82.1% VoiceBench) while dominating on interactivity (0.40s turn-taking vs 0.57–2.14s for competitors, 77.8 FD-bench vs 39–54 for competitors).
But these benchmarks don’t capture the qualitative jump. Time-awareness, simultaneous speech, and visual proactivity are entirely new capabilities. You can’t measure them on existing tests because no existing model can do them.
Thinking Machines created two new benchmarks to test capabilities that require native interactivity:
Can the model initiate speech at user-specified times with correct content?
Results: TML: 64.7% vs GPT-RT-2.0: 4.3%. This isn’t a small gap. GPT-RT essentially can’t do this task at all.
Can the model speak at contextually appropriate moments with semantically correct content, simultaneously with the user?
Results: TML: 81.7% vs GPT-RT-2.0: 2.9%. Again, the baseline model essentially can’t do this.
Three adapted benchmarks test whether the model can see, understand, and proactively react to visual input:
Videos of repeated actions. Model must count out loud in real-time. Tests continuous visual tracking and timely vocalization. TML: 35.4%, GPT-RT: 1.3%.
Video with questions whose answers appear at specific moments. Model must stay silent until the answer is visible, then speak. Tests temporal awareness + visual understanding. TML: 33.5%, GPT-RT: 25.0% (no-response baseline).
Model must say “start” when an action begins and “stop” when it ends. Temporal IoU between predicted and reference intervals. TML: 32.4%, GPT-RT: 0%.
This is the qualitative jump that existing benchmarks miss. It’s not that TML is slightly better at these tasks. It’s that no other model can do them at all. The micro-turn architecture enables an entirely new category of capability.
METR’s measurement of autonomous AI capability — and why interactivity changes the equation.
METR (Model Evaluation and Threat Research) introduced a powerful way to measure AI progress: instead of benchmark scores, measure the length of tasks that models can complete autonomously, where length is defined by how long a human expert would take.
Their key finding: the length of tasks that frontier models can complete at 50% reliability has been doubling approximately every 7 months for the last 6 years.
Current frontier models (like Claude 3.7 Sonnet) can reliably complete tasks that take humans a few minutes. They can occasionally succeed at tasks taking hours, but reliability drops sharply.
Traditional AI evaluation is one-dimensional: how smart is the model? Interaction models add a second dimension: how well can you work with it?
Thinking Machines positions this as a frontier — models can be plotted on a 2D space of intelligence vs. interactivity. Most current models live in the high-intelligence-low-interactivity quadrant (text chat) or the low-intelligence-high-interactivity quadrant (basic voice assistants). TML-Interaction-Small aims for the top-right: high intelligence AND high interactivity.
Thinking Machines is honest about what doesn’t work yet:
Continuous audio and video accumulate context quickly. At 200ms chunks with ~30 audio tokens + 1,600 video patches per chunk, context grows by ~8,000 tokens per second. Even with a 128K context window, that’s only ~16 seconds of continuous multimodal input before context fills. The streaming session design handles short/medium interactions well, but very long sessions require careful context management — a KV cache eviction strategy, context compression, or switching to a summary representation.
Low-latency streaming requires reliable network. With 200ms chunks and ~25ms compute, you have ~175ms for round-trip network latency. On a poor connection, chunks arrive late, the model’s temporal model breaks down, and the experience degrades significantly. Edge deployment or aggressive chunking strategies could help.
TML-Interaction-Small is “small” at 276B total (12B active). Larger models could be more capable, but current hardware can’t serve them within the 200ms budget. As inference hardware improves, larger interaction models become feasible.
The background agent is functional but basic. The coordination protocol (rich context delegation + streaming results) works, but there’s enormous room to improve how the interaction and background models collaborate — especially for complex, multi-step tasks.
If the interaction model thesis is correct — that interactivity should be part of the model, not bolted on — then several things follow: