The first latent Chain-of-Thought method to surpass explicit CoT — dual-modal auxiliary decoders compress reasoning into causal dynamics, then prefill inference matches answer-only speed.
You are building an autonomous driving system. Your car has cameras, and you want a Vision-Language Model (VLM) to look at the road, reason about what is happening, and output a trajectory — a sequence of (x, y) waypoints that tell the car where to drive over the next few seconds.
There is a well-known trick to make VLMs smarter: Chain-of-Thought (CoT) reasoning. Instead of asking the model to directly predict the trajectory, you ask it to first explain what it sees. "There is a pedestrian crossing from the left. The light is green. I should slow down and wait." Then it outputs the trajectory. This extra verbalization forces the model to organize its understanding before acting, and it consistently improves accuracy.
But here is the problem. In autonomous driving, you do not have seconds to spare. A typical CoT response generates 200-400 tokens of reasoning before the trajectory. At ~50 tokens per second, that is 4-8 seconds of autoregressive generation. A car moving at 60 km/h covers 67-133 meters in that time. That reasoning, helpful as it is, could get someone killed.
The obvious solution: skip the CoT and train the model to output trajectories directly. This is fast — only a handful of tokens — but accuracy drops. The model never gets to "think before it acts." You lose the entire benefit of chain-of-thought.
What if you could keep the thinking but skip the talking? What if the model could reason in some compressed, internal representation — a few latent tokens instead of hundreds of text tokens — and still get the accuracy boost? That is the question OneVL answers.
Toggle between explicit CoT (slow, accurate) and answer-only (fast, less accurate). Watch the latency and accuracy trade-off. The bar chart shows why neither approach works for real-time driving.
Previous work on "latent CoT" (methods like COCONUT and CODI) tried to compress text-based reasoning into a smaller set of latent tokens. The idea was simple: instead of generating 200 tokens of text, generate 20 latent tokens that encode the same information. But every single one of these methods underperforms explicit CoT. Compressing language reasoning into fewer tokens consistently loses important information.
OneVL's insight is that language is already an abstraction. When a driver thinks "the pedestrian is crossing from the left," those words are a compressed description of a rich visual scene — the pedestrian's position, velocity, trajectory, the geometry of the crosswalk, the motion of other agents. Language reasoning compresses the world into symbols. Compressing those symbols further loses the causal structure they represent.
Concretely, OneVL introduces two types of latent tokens in the response sequence:
Each type of latent token is supervised by its own auxiliary decoder. The language decoder reconstructs CoT text from the language latent tokens. The visual decoder predicts future camera frames from the visual latent tokens. At inference, both decoders are discarded — only the latent tokens remain, prefilled in a single parallel pass.
This is the first latent CoT method to surpass explicit CoT on every benchmark tested. The dual-modal supervision is the difference.
To understand where OneVL sits, you need to see the three paradigms for reasoning in VLAs side by side. Each represents a different trade-off between accuracy and speed.
Paradigm 1: Explicit CoT. The model generates a full text explanation before the answer. Every reasoning step is verbalized: "I see a red light ahead. There is a car stopped in the right lane. I should decelerate and stay in the left lane." This is ~200-400 tokens generated autoregressively, one by one. Maximum accuracy, but 4-8 seconds of latency on a single GPU.
Paradigm 2: Implicit latent CoT (COCONUT, CODI, SIM-CoT). Replace the text reasoning with a fixed number of latent tokens. A language decoder supervises these latent tokens to encode the same reasoning. At inference, no decoder is needed — just the latent tokens. Faster, but every method in this category underperforms explicit CoT. Compressing language alone is a lossy operation.
Paradigm 3: OneVL. Replace text reasoning with two types of latent tokens — visual and language — supervised by two auxiliary decoders. The visual decoder forces the model to predict what the world will look like in the future, grounding the latent representation in physical causality. At inference, both decoders are discarded. Latency matches answer-only prediction. Accuracy surpasses explicit CoT.
Click each paradigm to see its token flow: how reasoning tokens are generated, what decoders are involved, and the resulting speed/accuracy trade-off. The animation shows the autoregressive generation process.
| Paradigm | Reasoning Tokens | Supervision | Latency | vs Explicit CoT |
|---|---|---|---|---|
| Explicit CoT | 200-400 text | Cross-entropy on text | ~8s | Baseline |
| COCONUT / CODI | ~20 latent | Language decoder only | ~4.5s | Worse |
| OneVL | 35 visual + 20 language | Language + Visual decoders | ~4.5s | Better (first ever) |
OneVL is built on Qwen3-VL-4B-Instruct, a vision-language model with three components: a Vision Transformer (ViT) that encodes camera images into patch embeddings, an MLP aligner that projects those embeddings into the LLM's token space, and a causal language model that processes the combined text + visual tokens and generates output. All three components are trainable during OneVL's training.
The input sequence looks like this:
Where Cv = 4 special visual latent tokens (each expanded to ~9 actual VLM tokens, totaling 35) and Cl = 2 special language latent tokens (each expanded to ~10 actual VLM tokens, totaling 20). These latent tokens sit in the response sequence, between the user query and the trajectory answer.
During training, the VLM processes this entire sequence. At the positions where the latent tokens sit, the model's hidden states carry compressed reasoning information. These hidden states are extracted and fed to two auxiliary decoders:
The hidden states at visual latent positions are Hv ∈ RCv×d, and at language latent positions Hl ∈ RCl×d, where d is the model's hidden dimension (3584 for Qwen3-VL-4B). These hidden states are the interface between the main model and the auxiliary decoders.
The two auxiliary decoders are the heart of OneVL. They exist solely during training to force the latent tokens to encode meaningful information. At inference, they are thrown away. Understanding their architecture and data flow is essential.
This decoder takes two inputs: the hidden states Hl at the language latent positions and the ViT image embeddings V. It must reconstruct the full Chain-of-Thought text that the model would have generated if asked to reason explicitly.
Where Wl is a learned MLP projection that maps both inputs to a shared space. The ViT embeddings V provide the raw visual context, and Hl provides the compressed reasoning. These are concatenated and fed through a small autoregressive decoder that predicts CoT tokens via cross-entropy loss.
The decoder's input tensor shape: V is projected to RN×ddec, Hl is projected to RCl×ddec, concatenated along the sequence dimension to form Zl ∈ R(N+Cl)×ddec.
This decoder takes the hidden states Hv at the visual latent positions and the ViT embeddings V. It must predict what the scene will look like at two future timesteps: +0.5 seconds and +1.0 seconds.
The future frames are not predicted as raw pixels. Instead, each frame is first tokenized by an IBQ (Improved BERT Quantizer) visual tokenizer with a codebook of 131,072 entries. The visual decoder predicts these discrete tokens via cross-entropy loss — the same way a language model predicts word tokens, but over a vocabulary of visual patches.
Where Lc is the main trajectory cross-entropy loss, Ll is the language decoder's CoT reconstruction loss, and Lv is the visual decoder's future-frame prediction loss. The weights are λl = 1.0 and λv = 0.1. The visual loss is weighted lower because predicting future frames is a harder task — it requires understanding 3D scene dynamics — and early training produces noisier gradients.
Toggle between Training mode (decoders active, gradients flowing) and Inference mode (decoders discarded, prefill shown). The diagram shows tensor shapes at each stage. Click components to highlight the data flow path.
You cannot train everything at once from scratch. Think about why: the visual decoder needs meaningful hidden states at the latent positions to learn useful gradients. But the main model starts with random latent representations — its hidden states at those positions carry no information. The visual decoder would receive garbage inputs, compute garbage gradients, and send garbage signals back to the main model. Chaos.
OneVL solves this with a carefully staged training pipeline — four phases that progressively warm up each component before coupling them together.
Before any latent tokens are involved, the visual decoder learns to predict future frames from ViT embeddings alone. Input: V (image patch embeddings from the ViT). Output: discrete visual tokens for frames at +0.5s and +1.0s. This teaches the decoder the basic structure of future prediction — what roads, cars, and pedestrians look like over time — without requiring any latent conditioning. The main model is not involved.
The latent tokens are inserted into the sequence and the main model trains on the trajectory loss Lc only. No decoder gradients. The model learns to produce useful hidden states at the latent positions as a byproduct of minimizing the trajectory loss. By the end of Stage 0, the hidden states Hv and Hl carry some structured information about the scene, even though no decoder has seen them yet.
The main model is frozen. Both decoders train to align with the now-stable latent representations. The language decoder learns to map Hl to CoT text. The visual decoder — already pretrained on ViT embeddings — now also conditions on Hv. Since the main model's weights are frozen, the decoders get consistent input and can converge without chasing a moving target.
Everything trains jointly. The main model, both decoders, the ViT, the aligner — all parameters update simultaneously. Now decoder gradients flow back into the main model through Hv and Hl, creating a virtuous cycle: better latent representations help the decoders predict better, and better decoder gradients help the main model produce richer latent representations.
Step through the four training stages. Each stage shows which components are trainable (highlighted) vs frozen (dimmed), and what losses are active. Watch how the system progressively warms up.
| Phase | Main Model | Lang Decoder | Vis Decoder | Active Loss |
|---|---|---|---|---|
| Preliminary | N/A | N/A | Train | Lv (ViT only) |
| Stage 0 | Train | N/A | Frozen | Lc |
| Stage 1 | Frozen | Train | Train | Ll + Lv |
| Stage 2 | Train | Train | Train | Lc + Ll + Lv |
Here is where the speed trick becomes concrete. After training, you throw away both decoders entirely. They are never run at inference time. So what do the latent tokens do?
The input sequence at inference looks like this:
All of these tokens — including the 55 latent tokens — are part of the prompt, not the response. They are processed in the prefill phase, where modern transformer architectures process the entire input in a single parallel forward pass. There is no autoregressive generation for these tokens. The KV cache is computed once, in parallel, for the entire prompt.
Only the trajectory tokens are generated autoregressively. With a typical trajectory of 10-20 waypoints (20-40 tokens), this takes well under a second. Total inference time:
Compare this to explicit CoT: ~4 seconds of prefill PLUS ~4 seconds of autoregressive CoT generation PLUS ~0.5 seconds of trajectory decode = ~8.5 seconds total. OneVL cuts this roughly in half.
For real deployment, the autoregressive trajectory head is replaced with an MLP trajectory head that predicts all waypoints in a single forward pass. This reduces total inference to 0.24 seconds (4.16 Hz), which is only 5.4% of the AR CoT latency. That is real-time driving.
Compare inference latency across four modes. The stacked bars show prefill time (blue) vs autoregressive decode time (orange). Notice how OneVL's latent tokens add almost nothing to prefill, while explicit CoT's text tokens add seconds of decode.
| Mode | Prefill | CoT Decode | Traj Decode | Total |
|---|---|---|---|---|
| Explicit CoT | ~3.5s | ~4.0s | ~0.5s | ~8.0s |
| Answer-Only | ~3.5s | 0s | ~0.5s | ~4.0s |
| OneVL (AR traj) | ~3.5s | 0s | ~0.5s | ~4.0s |
| OneVL (MLP traj) | ~0.2s | 0s | ~0.04s | 0.24s |
OneVL is evaluated on four autonomous driving benchmarks, each testing different aspects of trajectory prediction. The results are striking: OneVL is the first latent CoT method to surpass explicit CoT on all four.
A closed-loop navigation benchmark using the PDM (Planning-Decision-Making) score. OneVL achieves 88.84, beating explicit CoT at 88.29 and the previous state-of-the-art LaST-VLA at 87.30. Inference latency: 4.46 seconds vs explicit CoT's 8.73 seconds.
Driving through construction zones with altered road geometry. Average Displacement Error (ADE, lower is better). OneVL: 12.49 vs explicit CoT at 13.18. The previous best non-VLA method (YNet) scored 22.68 — nearly double the error.
Natural-language instructions for novel driving scenarios. OneVL: 1.34 ADE vs explicit CoT at 1.42 and the previous best Impromptu VLA at 1.60.
Long-horizon trajectory prediction. OneVL: 2.77 ADE vs explicit CoT at 2.99 and Cosmos-Reason at 2.86.
Bar chart showing OneVL vs explicit CoT vs prior SOTA on each benchmark. Green indicates OneVL beats all baselines. Hover over bars for exact values.
The paper includes careful ablations that isolate each component's contribution. These answer the "what if we removed X?" questions.
Remove Dv, keep only Dl. This is equivalent to prior latent CoT methods. Performance drops below explicit CoT on most benchmarks. On NAVSIM: 88.12 (vs 88.84 with both decoders). The visual decoder's spatiotemporal grounding is essential.
Remove Dl, keep only Dv. Performance also drops: NAVSIM 88.47. The language decoder provides complementary symbolic supervision that the visual decoder alone cannot.
Train everything jointly from scratch (skip the staged warmup). Performance collapses: NAVSIM drops to 87.01. This confirms the chicken-and-egg problem — random latent states produce garbage decoder gradients that destabilize training.
Reducing visual latent tokens from 4 to 2 (17 actual tokens) drops NAVSIM from 88.84 to 88.35. Increasing to 8 (70 actual tokens) gives 88.89 — marginal improvement at double the token count. The 4-token configuration is the sweet spot.
| Configuration | NAVSIM PDM↑ | Δ vs Full |
|---|---|---|
| OneVL (full) | 88.84 | — |
| No visual decoder | 88.12 | -0.72 |
| No language decoder | 88.47 | -0.37 |
| No staged training | 87.01 | -1.83 |
| 2 visual latent tokens | 88.35 | -0.49 |
| 8 visual latent tokens | 88.89 | +0.05 |
| Explicit CoT baseline | 88.29 | -0.55 |
| Component | Detail |
|---|---|
| Base Model | Qwen3-VL-4B-Instruct (ViT + MLP + LLM, all trainable) |
| Visual Latent Tokens | 4 tokens → 35 actual VLM tokens |
| Language Latent Tokens | 2 tokens → 20 actual VLM tokens |
| Language Decoder | MLP proj + autoregressive decoder, CE loss on CoT text |
| Visual Decoder | MLP proj + decoder, CE loss on IBQ visual tokens (131K codebook) |
| Loss | L = Lc + 1.0·Ll + 0.1·Lv |
| Training | Preliminary → Stage 0 (warmup) → Stage 1 (freeze main) → Stage 2 (joint E2E) |
| Inference | Decoders discarded, latent tokens prefilled, AR traj or MLP traj head |
| Key Result | First latent CoT to surpass explicit CoT on all benchmarks |