Xiaomi Embodied Intelligence Team — 2026

OneVL: One-Step Latent Reasoning

The first latent Chain-of-Thought method to surpass explicit CoT — dual-modal auxiliary decoders compress reasoning into causal dynamics, then prefill inference matches answer-only speed.

Prerequisites: VLMs + Chain-of-Thought reasoning + autonomous driving basics
10
Chapters
5+
Simulations

Chapter 0: The Problem

You are building an autonomous driving system. Your car has cameras, and you want a Vision-Language Model (VLM) to look at the road, reason about what is happening, and output a trajectory — a sequence of (x, y) waypoints that tell the car where to drive over the next few seconds.

There is a well-known trick to make VLMs smarter: Chain-of-Thought (CoT) reasoning. Instead of asking the model to directly predict the trajectory, you ask it to first explain what it sees. "There is a pedestrian crossing from the left. The light is green. I should slow down and wait." Then it outputs the trajectory. This extra verbalization forces the model to organize its understanding before acting, and it consistently improves accuracy.

But here is the problem. In autonomous driving, you do not have seconds to spare. A typical CoT response generates 200-400 tokens of reasoning before the trajectory. At ~50 tokens per second, that is 4-8 seconds of autoregressive generation. A car moving at 60 km/h covers 67-133 meters in that time. That reasoning, helpful as it is, could get someone killed.

The dilemma: Chain-of-Thought reasoning makes VLAs significantly more accurate. But autoregressive generation of CoT tokens is far too slow for real-time deployment. You need the accuracy of reasoning without the latency of verbalization.

The obvious solution: skip the CoT and train the model to output trajectories directly. This is fast — only a handful of tokens — but accuracy drops. The model never gets to "think before it acts." You lose the entire benefit of chain-of-thought.

What if you could keep the thinking but skip the talking? What if the model could reason in some compressed, internal representation — a few latent tokens instead of hundreds of text tokens — and still get the accuracy boost? That is the question OneVL answers.

The CoT Speed Problem

Toggle between explicit CoT (slow, accurate) and answer-only (fast, less accurate). Watch the latency and accuracy trade-off. The bar chart shows why neither approach works for real-time driving.

Why is explicit Chain-of-Thought reasoning impractical for autonomous driving, despite improving accuracy?

Chapter 1: The Key Insight

Previous work on "latent CoT" (methods like COCONUT and CODI) tried to compress text-based reasoning into a smaller set of latent tokens. The idea was simple: instead of generating 200 tokens of text, generate 20 latent tokens that encode the same information. But every single one of these methods underperforms explicit CoT. Compressing language reasoning into fewer tokens consistently loses important information.

OneVL's insight is that language is already an abstraction. When a driver thinks "the pedestrian is crossing from the left," those words are a compressed description of a rich visual scene — the pedestrian's position, velocity, trajectory, the geometry of the crosswalk, the motion of other agents. Language reasoning compresses the world into symbols. Compressing those symbols further loses the causal structure they represent.

The key insight: Instead of compressing language reasoning alone, compress BOTH language reasoning AND visual future prediction into latent tokens. This forces the latent representation to encode causal dynamics — the actual physics of the scene — not just symbolic abstractions. Tighter compression that captures causal structure produces more generalizable representations than verbose token-by-token reasoning.

Concretely, OneVL introduces two types of latent tokens in the response sequence:

Each type of latent token is supervised by its own auxiliary decoder. The language decoder reconstructs CoT text from the language latent tokens. The visual decoder predicts future camera frames from the visual latent tokens. At inference, both decoders are discarded — only the latent tokens remain, prefilled in a single parallel pass.

Explicit CoT
VLM generates 200+ text tokens of reasoning, then trajectory. Accurate but slow.
↓ compress
Prior Latent CoT
Replace text with ~20 latent tokens supervised by language decoder only. Faster, but loses accuracy vs explicit CoT.
↓ add visual supervision
OneVL
35 visual + 20 language latent tokens, supervised by BOTH language and visual decoders. Surpasses explicit CoT at answer-only speed.

This is the first latent CoT method to surpass explicit CoT on every benchmark tested. The dual-modal supervision is the difference.

Why does compressing language reasoning alone (prior latent CoT) fail to match explicit CoT?

Chapter 2: Three Paradigms

To understand where OneVL sits, you need to see the three paradigms for reasoning in VLAs side by side. Each represents a different trade-off between accuracy and speed.

Paradigm 1: Explicit CoT. The model generates a full text explanation before the answer. Every reasoning step is verbalized: "I see a red light ahead. There is a car stopped in the right lane. I should decelerate and stay in the left lane." This is ~200-400 tokens generated autoregressively, one by one. Maximum accuracy, but 4-8 seconds of latency on a single GPU.

Paradigm 2: Implicit latent CoT (COCONUT, CODI, SIM-CoT). Replace the text reasoning with a fixed number of latent tokens. A language decoder supervises these latent tokens to encode the same reasoning. At inference, no decoder is needed — just the latent tokens. Faster, but every method in this category underperforms explicit CoT. Compressing language alone is a lossy operation.

Paradigm 3: OneVL. Replace text reasoning with two types of latent tokens — visual and language — supervised by two auxiliary decoders. The visual decoder forces the model to predict what the world will look like in the future, grounding the latent representation in physical causality. At inference, both decoders are discarded. Latency matches answer-only prediction. Accuracy surpasses explicit CoT.

Why does dual supervision help? Consider what each decoder forces the latent tokens to encode. The language decoder demands symbolic reasoning ("the car is stopping"). The visual decoder demands spatiotemporal prediction ("these pixels will move here in 0.5 seconds"). Together, they force the latent tokens to capture both the what and the where/when of the scene — a richer representation than either alone.
Three Paradigms Compared

Click each paradigm to see its token flow: how reasoning tokens are generated, what decoders are involved, and the resulting speed/accuracy trade-off. The animation shows the autoregressive generation process.

ParadigmReasoning TokensSupervisionLatencyvs Explicit CoT
Explicit CoT200-400 textCross-entropy on text~8sBaseline
COCONUT / CODI~20 latentLanguage decoder only~4.5sWorse
OneVL35 visual + 20 languageLanguage + Visual decoders~4.5sBetter (first ever)
What distinguishes OneVL from prior latent CoT methods like COCONUT?

Chapter 3: Architecture

OneVL is built on Qwen3-VL-4B-Instruct, a vision-language model with three components: a Vision Transformer (ViT) that encodes camera images into patch embeddings, an MLP aligner that projects those embeddings into the LLM's token space, and a causal language model that processes the combined text + visual tokens and generates output. All three components are trainable during OneVL's training.

The input sequence looks like this:

[System Prompt, Image Tokens, User Query, <vis_latent> × Cv, <lang_latent> × Cl, Trajectory Tokens]

Where Cv = 4 special visual latent tokens (each expanded to ~9 actual VLM tokens, totaling 35) and Cl = 2 special language latent tokens (each expanded to ~10 actual VLM tokens, totaling 20). These latent tokens sit in the response sequence, between the user query and the trajectory answer.

During training, the VLM processes this entire sequence. At the positions where the latent tokens sit, the model's hidden states carry compressed reasoning information. These hidden states are extracted and fed to two auxiliary decoders:

ViT Encoder
Camera images → patch embeddings V ∈ RN×d
MLP Aligner
V → visual tokens in LLM space
Causal LLM
[System, Images, Query, Zv, Zl, Trajectory] → hidden states at every position
↓ extract Hv, Hl
Dual Decoders (training only)
Hv → Visual Decoder → future frames | Hl → Language Decoder → CoT text
Key design choice: The latent tokens are placed BEFORE the trajectory tokens in the sequence. This means the trajectory generation can attend to the latent tokens via causal attention. The latent tokens serve as compressed "context" that shapes the trajectory output — they ARE the reasoning, just in compressed form.

The hidden states at visual latent positions are Hv ∈ RCv×d, and at language latent positions Hl ∈ RCl×d, where d is the model's hidden dimension (3584 for Qwen3-VL-4B). These hidden states are the interface between the main model and the auxiliary decoders.

Why are latent tokens placed BEFORE the trajectory tokens in the sequence?

Chapter 4: Dual Decoders

The two auxiliary decoders are the heart of OneVL. They exist solely during training to force the latent tokens to encode meaningful information. At inference, they are thrown away. Understanding their architecture and data flow is essential.

Language Auxiliary Decoder (Dl)

This decoder takes two inputs: the hidden states Hl at the language latent positions and the ViT image embeddings V. It must reconstruct the full Chain-of-Thought text that the model would have generated if asked to reason explicitly.

Zl = [Wl(V), Wl(Hl)]

Where Wl is a learned MLP projection that maps both inputs to a shared space. The ViT embeddings V provide the raw visual context, and Hl provides the compressed reasoning. These are concatenated and fed through a small autoregressive decoder that predicts CoT tokens via cross-entropy loss.

The decoder's input tensor shape: V is projected to RN×ddec, Hl is projected to RCl×ddec, concatenated along the sequence dimension to form Zl ∈ R(N+Cl)×ddec.

Visual Auxiliary Decoder (Dv)

This decoder takes the hidden states Hv at the visual latent positions and the ViT embeddings V. It must predict what the scene will look like at two future timesteps: +0.5 seconds and +1.0 seconds.

Zv = [Wv(V), Wv(Hv)]

The future frames are not predicted as raw pixels. Instead, each frame is first tokenized by an IBQ (Improved BERT Quantizer) visual tokenizer with a codebook of 131,072 entries. The visual decoder predicts these discrete tokens via cross-entropy loss — the same way a language model predicts word tokens, but over a vocabulary of visual patches.

Why discrete visual tokens? Predicting raw pixels would require regression loss (MSE), which is noisy and hard to optimize. Discrete visual tokens let you use cross-entropy loss — the same stable, well-understood loss used for language. This makes the visual decoder's training consistent with the language decoder's training, and both consistent with the main model's trajectory loss.

Combined Loss

L = Lc + λl · Ll + λv · Lv

Where Lc is the main trajectory cross-entropy loss, Ll is the language decoder's CoT reconstruction loss, and Lv is the visual decoder's future-frame prediction loss. The weights are λl = 1.0 and λv = 0.1. The visual loss is weighted lower because predicting future frames is a harder task — it requires understanding 3D scene dynamics — and early training produces noisier gradients.

Dual Decoder Data Flow (SHOWCASE)

Toggle between Training mode (decoders active, gradients flowing) and Inference mode (decoders discarded, prefill shown). The diagram shows tensor shapes at each stage. Click components to highlight the data flow path.

Why is the visual loss weight (λv = 0.1) lower than the language loss weight (λl = 1.0)?

Chapter 5: Training Pipeline

You cannot train everything at once from scratch. Think about why: the visual decoder needs meaningful hidden states at the latent positions to learn useful gradients. But the main model starts with random latent representations — its hidden states at those positions carry no information. The visual decoder would receive garbage inputs, compute garbage gradients, and send garbage signals back to the main model. Chaos.

OneVL solves this with a carefully staged training pipeline — four phases that progressively warm up each component before coupling them together.

Preliminary: Visual Decoder Pretraining

Before any latent tokens are involved, the visual decoder learns to predict future frames from ViT embeddings alone. Input: V (image patch embeddings from the ViT). Output: discrete visual tokens for frames at +0.5s and +1.0s. This teaches the decoder the basic structure of future prediction — what roads, cars, and pedestrians look like over time — without requiring any latent conditioning. The main model is not involved.

Stage 0: Main Model Warmup

The latent tokens are inserted into the sequence and the main model trains on the trajectory loss Lc only. No decoder gradients. The model learns to produce useful hidden states at the latent positions as a byproduct of minimizing the trajectory loss. By the end of Stage 0, the hidden states Hv and Hl carry some structured information about the scene, even though no decoder has seen them yet.

Stage 1: Auxiliary Decoder Warmup

The main model is frozen. Both decoders train to align with the now-stable latent representations. The language decoder learns to map Hl to CoT text. The visual decoder — already pretrained on ViT embeddings — now also conditions on Hv. Since the main model's weights are frozen, the decoders get consistent input and can converge without chasing a moving target.

Stage 2: Joint End-to-End Fine-Tuning

Everything trains jointly. The main model, both decoders, the ViT, the aligner — all parameters update simultaneously. Now decoder gradients flow back into the main model through Hv and Hl, creating a virtuous cycle: better latent representations help the decoders predict better, and better decoder gradients help the main model produce richer latent representations.

Why staged training works: Each stage establishes a stable foundation for the next. The preliminary phase gives the visual decoder basic competence. Stage 0 gives the main model informative latent states. Stage 1 aligns decoders to those states without moving the target. Stage 2 couples everything once all components can give useful gradients. This prevents the chicken-and-egg problem of joint training from scratch.
Training Pipeline Stages

Step through the four training stages. Each stage shows which components are trainable (highlighted) vs frozen (dimmed), and what losses are active. Watch how the system progressively warms up.

PhaseMain ModelLang DecoderVis DecoderActive Loss
PreliminaryN/AN/ATrainLv (ViT only)
Stage 0TrainN/AFrozenLc
Stage 1FrozenTrainTrainLl + Lv
Stage 2TrainTrainTrainLc + Ll + Lv
Why is the main model frozen during Stage 1 (auxiliary decoder warmup)?

Chapter 6: Prefill Inference

Here is where the speed trick becomes concrete. After training, you throw away both decoders entirely. They are never run at inference time. So what do the latent tokens do?

The input sequence at inference looks like this:

[System, User Query, Image, <vis_latent> × 35, <lang_latent> × 20, <trajectory start>]

All of these tokens — including the 55 latent tokens — are part of the prompt, not the response. They are processed in the prefill phase, where modern transformer architectures process the entire input in a single parallel forward pass. There is no autoregressive generation for these tokens. The KV cache is computed once, in parallel, for the entire prompt.

Why prefill is fast: Autoregressive generation produces tokens one at a time, each requiring a separate forward pass. Prefill processes ALL prompt tokens in one batched matrix multiplication. Adding 55 tokens to the prompt adds negligible time compared to the hundreds of image tokens already there. The latent tokens ride for free on the existing prefill computation.

Only the trajectory tokens are generated autoregressively. With a typical trajectory of 10-20 waypoints (20-40 tokens), this takes well under a second. Total inference time:

Compare this to explicit CoT: ~4 seconds of prefill PLUS ~4 seconds of autoregressive CoT generation PLUS ~0.5 seconds of trajectory decode = ~8.5 seconds total. OneVL cuts this roughly in half.

For real deployment, the autoregressive trajectory head is replaced with an MLP trajectory head that predicts all waypoints in a single forward pass. This reduces total inference to 0.24 seconds (4.16 Hz), which is only 5.4% of the AR CoT latency. That is real-time driving.

Latency Comparison

Compare inference latency across four modes. The stacked bars show prefill time (blue) vs autoregressive decode time (orange). Notice how OneVL's latent tokens add almost nothing to prefill, while explicit CoT's text tokens add seconds of decode.

ModePrefillCoT DecodeTraj DecodeTotal
Explicit CoT~3.5s~4.0s~0.5s~8.0s
Answer-Only~3.5s0s~0.5s~4.0s
OneVL (AR traj)~3.5s0s~0.5s~4.0s
OneVL (MLP traj)~0.2s0s~0.04s0.24s
Why does adding 55 latent tokens to the prompt add almost zero inference latency?

Chapter 7: Results

OneVL is evaluated on four autonomous driving benchmarks, each testing different aspects of trajectory prediction. The results are striking: OneVL is the first latent CoT method to surpass explicit CoT on all four.

NAVSIM

A closed-loop navigation benchmark using the PDM (Planning-Decision-Making) score. OneVL achieves 88.84, beating explicit CoT at 88.29 and the previous state-of-the-art LaST-VLA at 87.30. Inference latency: 4.46 seconds vs explicit CoT's 8.73 seconds.

ROADWork

Driving through construction zones with altered road geometry. Average Displacement Error (ADE, lower is better). OneVL: 12.49 vs explicit CoT at 13.18. The previous best non-VLA method (YNet) scored 22.68 — nearly double the error.

Impromptu

Natural-language instructions for novel driving scenarios. OneVL: 1.34 ADE vs explicit CoT at 1.42 and the previous best Impromptu VLA at 1.60.

APR1

Long-horizon trajectory prediction. OneVL: 2.77 ADE vs explicit CoT at 2.99 and Cosmos-Reason at 2.86.

The headline result: Prior latent CoT methods (COCONUT, CODI, SIM-CoT) ALL underperform explicit CoT. OneVL is the FIRST to surpass it — and it does so on every benchmark, not just one. This confirms that dual-modal supervision (language + visual) is the critical ingredient.
Results Across 4 Benchmarks

Bar chart showing OneVL vs explicit CoT vs prior SOTA on each benchmark. Green indicates OneVL beats all baselines. Hover over bars for exact values.

Which statement about the results is correct?

Chapter 8: Ablations

The paper includes careful ablations that isolate each component's contribution. These answer the "what if we removed X?" questions.

Without Visual Decoder

Remove Dv, keep only Dl. This is equivalent to prior latent CoT methods. Performance drops below explicit CoT on most benchmarks. On NAVSIM: 88.12 (vs 88.84 with both decoders). The visual decoder's spatiotemporal grounding is essential.

Without Language Decoder

Remove Dl, keep only Dv. Performance also drops: NAVSIM 88.47. The language decoder provides complementary symbolic supervision that the visual decoder alone cannot.

Without Staged Training

Train everything jointly from scratch (skip the staged warmup). Performance collapses: NAVSIM drops to 87.01. This confirms the chicken-and-egg problem — random latent states produce garbage decoder gradients that destabilize training.

Number of Latent Tokens

Reducing visual latent tokens from 4 to 2 (17 actual tokens) drops NAVSIM from 88.84 to 88.35. Increasing to 8 (70 actual tokens) gives 88.89 — marginal improvement at double the token count. The 4-token configuration is the sweet spot.

The ablation story: Both decoders are necessary. The visual decoder provides the larger improvement (removing it hurts more than removing the language decoder), but the combination is greater than either alone. And staged training is not optional — it is essential for stable convergence.
ConfigurationNAVSIM PDM↑Δ vs Full
OneVL (full)88.84
No visual decoder88.12-0.72
No language decoder88.47-0.37
No staged training87.01-1.83
2 visual latent tokens88.35-0.49
8 visual latent tokens88.89+0.05
Explicit CoT baseline88.29-0.55
Which ablation causes the largest performance drop on NAVSIM?

Chapter 9: Connections

OneVL in one sentence: Compress CoT reasoning into latent tokens supervised by dual auxiliary decoders (language + visual world model), discard decoders at inference, and prefill latent tokens in one parallel pass to match answer-only speed while surpassing explicit CoT accuracy.

Cheat Sheet

ComponentDetail
Base ModelQwen3-VL-4B-Instruct (ViT + MLP + LLM, all trainable)
Visual Latent Tokens4 tokens → 35 actual VLM tokens
Language Latent Tokens2 tokens → 20 actual VLM tokens
Language DecoderMLP proj + autoregressive decoder, CE loss on CoT text
Visual DecoderMLP proj + decoder, CE loss on IBQ visual tokens (131K codebook)
LossL = Lc + 1.0·Ll + 0.1·Lv
TrainingPreliminary → Stage 0 (warmup) → Stage 1 (freeze main) → Stage 2 (joint E2E)
InferenceDecoders discarded, latent tokens prefilled, AR traj or MLP traj head
Key ResultFirst latent CoT to surpass explicit CoT on all benchmarks

Related Work

Related Lessons

What is the core theoretical reason OneVL surpasses explicit CoT?