OneVL — Veanors

Chapter 0: The Problem

You are building an autonomous driving system. Your car has cameras, and you want a Vision-Language Model (VLM) to look at the road, reason about what is happening, and output a trajectory — a sequence of (x, y) waypoints that tell the car where to drive over the next few seconds.

There is a well-known trick to make VLMs smarter: Chain-of-Thought (CoT) reasoning. Instead of asking the model to directly predict the trajectory, you ask it to first explain what it sees. "There is a pedestrian crossing from the left. The light is green. I should slow down and wait." Then it outputs the trajectory. This extra verbalization forces the model to organize its understanding before acting, and it consistently improves accuracy.

But here is the problem. In autonomous driving, you do not have seconds to spare. A typical CoT response generates 200-400 tokens of reasoning before the trajectory. At ~50 tokens per second, that is 4-8 seconds of autoregressive generation. A car moving at 60 km/h covers 67-133 meters in that time. That reasoning, helpful as it is, could get someone killed.

The dilemma: Chain-of-Thought reasoning makes VLAs significantly more accurate. But autoregressive generation of CoT tokens is far too slow for real-time deployment. You need the accuracy of reasoning without the latency of verbalization.

The obvious solution: skip the CoT and train the model to output trajectories directly. This is fast — only a handful of tokens — but accuracy drops. The model never gets to "think before it acts." You lose the entire benefit of chain-of-thought.

What if you could keep the thinking but skip the talking? What if the model could reason in some compressed, internal representation — a few latent tokens instead of hundreds of text tokens — and still get the accuracy boost? That is the question OneVL answers.

The CoT Speed Problem

Toggle between explicit CoT (slow, accurate) and answer-only (fast, less accurate). Watch the latency and accuracy trade-off. The bar chart shows why neither approach works for real-time driving.

Why is explicit Chain-of-Thought reasoning impractical for autonomous driving, despite improving accuracy?

Because CoT reasoning uses too much GPU memory Because autoregressively generating hundreds of reasoning tokens adds seconds of latency that a moving vehicle cannot afford Because language models cannot understand camera images

Chapter 1: The Key Insight

Previous work on "latent CoT" (methods like COCONUT and CODI) tried to compress text-based reasoning into a smaller set of latent tokens. The idea was simple: instead of generating 200 tokens of text, generate 20 latent tokens that encode the same information. But every single one of these methods underperforms explicit CoT. Compressing language reasoning into fewer tokens consistently loses important information.

OneVL's insight is that language is already an abstraction. When a driver thinks "the pedestrian is crossing from the left," those words are a compressed description of a rich visual scene — the pedestrian's position, velocity, trajectory, the geometry of the crosswalk, the motion of other agents. Language reasoning compresses the world into symbols. Compressing those symbols further loses the causal structure they represent.

The key insight: Instead of compressing language reasoning alone, compress BOTH language reasoning AND visual future prediction into latent tokens. This forces the latent representation to encode causal dynamics — the actual physics of the scene — not just symbolic abstractions. Tighter compression that captures causal structure produces more generalizable representations than verbose token-by-token reasoning.

Concretely, OneVL introduces two types of latent tokens in the response sequence:

Visual latent tokens (Z_v): 4 special tokens that expand to 35 actual VLM tokens. These encode spatial-temporal reasoning — what the road and agents will look like in the near future.
Language latent tokens (Z_l): 2 special tokens that expand to 20 actual VLM tokens. These encode the language reasoning that would have been spelled out in a full CoT.

Each type of latent token is supervised by its own auxiliary decoder. The language decoder reconstructs CoT text from the language latent tokens. The visual decoder predicts future camera frames from the visual latent tokens. At inference, both decoders are discarded — only the latent tokens remain, prefilled in a single parallel pass.

Explicit CoT

VLM generates 200+ text tokens of reasoning, then trajectory. Accurate but slow.

↓ compress

Prior Latent CoT

Replace text with ~20 latent tokens supervised by language decoder only. Faster, but loses accuracy vs explicit CoT.

↓ add visual supervision

OneVL

35 visual + 20 language latent tokens, supervised by BOTH language and visual decoders. Surpasses explicit CoT at answer-only speed.

This is the first latent CoT method to surpass explicit CoT on every benchmark tested. The dual-modal supervision is the difference.

Why does compressing language reasoning alone (prior latent CoT) fail to match explicit CoT?

Because language is already an abstraction of the visual world — compressing an abstraction further loses the causal structure it represents Because latent tokens cannot store any information Because the language decoder is not powerful enough

Chapter 2: Three Paradigms

To understand where OneVL sits, you need to see the three paradigms for reasoning in VLAs side by side. Each represents a different trade-off between accuracy and speed.

Paradigm 1: Explicit CoT. The model generates a full text explanation before the answer. Every reasoning step is verbalized: "I see a red light ahead. There is a car stopped in the right lane. I should decelerate and stay in the left lane." This is ~200-400 tokens generated autoregressively, one by one. Maximum accuracy, but 4-8 seconds of latency on a single GPU.

Paradigm 2: Implicit latent CoT (COCONUT, CODI, SIM-CoT). Replace the text reasoning with a fixed number of latent tokens. A language decoder supervises these latent tokens to encode the same reasoning. At inference, no decoder is needed — just the latent tokens. Faster, but every method in this category underperforms explicit CoT. Compressing language alone is a lossy operation.

Paradigm 3: OneVL. Replace text reasoning with two types of latent tokens — visual and language — supervised by two auxiliary decoders. The visual decoder forces the model to predict what the world will look like in the future, grounding the latent representation in physical causality. At inference, both decoders are discarded. Latency matches answer-only prediction. Accuracy surpasses explicit CoT.

Why does dual supervision help? Consider what each decoder forces the latent tokens to encode. The language decoder demands symbolic reasoning ("the car is stopping"). The visual decoder demands spatiotemporal prediction ("these pixels will move here in 0.5 seconds"). Together, they force the latent tokens to capture both the what and the where/when of the scene — a richer representation than either alone.

Three Paradigms Compared

Click each paradigm to see its token flow: how reasoning tokens are generated, what decoders are involved, and the resulting speed/accuracy trade-off. The animation shows the autoregressive generation process.

Paradigm	Reasoning Tokens	Supervision	Latency	vs Explicit CoT
Explicit CoT	200-400 text	Cross-entropy on text	~8s	Baseline
COCONUT / CODI	~20 latent	Language decoder only	~4.5s	Worse
OneVL	35 visual + 20 language	Language + Visual decoders	~4.5s	Better (first ever)

What distinguishes OneVL from prior latent CoT methods like COCONUT?

OneVL uses more latent tokens OneVL adds a visual world model decoder alongside the language decoder, forcing latent tokens to encode causal dynamics rather than just symbolic abstractions OneVL uses a larger base model

Chapter 3: Architecture

OneVL is built on Qwen3-VL-4B-Instruct, a vision-language model with three components: a Vision Transformer (ViT) that encodes camera images into patch embeddings, an MLP aligner that projects those embeddings into the LLM's token space, and a causal language model that processes the combined text + visual tokens and generates output. All three components are trainable during OneVL's training.

The input sequence looks like this:

[System Prompt, Image Tokens, User Query, <vis_latent> × C_v, <lang_latent> × C_l, Trajectory Tokens]

Where C_v = 4 special visual latent tokens (each expanded to ~9 actual VLM tokens, totaling 35) and C_l = 2 special language latent tokens (each expanded to ~10 actual VLM tokens, totaling 20). These latent tokens sit in the response sequence, between the user query and the trajectory answer.

During training, the VLM processes this entire sequence. At the positions where the latent tokens sit, the model's hidden states carry compressed reasoning information. These hidden states are extracted and fed to two auxiliary decoders:

ViT Encoder

Camera images → patch embeddings V ∈ R^N×d

↓

MLP Aligner

V → visual tokens in LLM space

↓

Causal LLM

[System, Images, Query, Z_v, Z_l, Trajectory] → hidden states at every position

↓ extract H_v, H_l

Dual Decoders (training only)

H_v → Visual Decoder → future frames | H_l → Language Decoder → CoT text

Key design choice: The latent tokens are placed BEFORE the trajectory tokens in the sequence. This means the trajectory generation can attend to the latent tokens via causal attention. The latent tokens serve as compressed "context" that shapes the trajectory output — they ARE the reasoning, just in compressed form.

The hidden states at visual latent positions are H_v ∈ R^C_v×d, and at language latent positions H_l ∈ R^C_l×d, where d is the model's hidden dimension (3584 for Qwen3-VL-4B). These hidden states are the interface between the main model and the auxiliary decoders.

Why are latent tokens placed BEFORE the trajectory tokens in the sequence?

To make the model generate faster So that trajectory generation can attend to latent tokens via causal attention, using the compressed reasoning as context for the answer Because the tokenizer requires that order

Chapter 4: Dual Decoders

The two auxiliary decoders are the heart of OneVL. They exist solely during training to force the latent tokens to encode meaningful information. At inference, they are thrown away. Understanding their architecture and data flow is essential.

Language Auxiliary Decoder (D_l)

This decoder takes two inputs: the hidden states H_l at the language latent positions and the ViT image embeddings V. It must reconstruct the full Chain-of-Thought text that the model would have generated if asked to reason explicitly.

Z_l = [W_l(V), W_l(H_l)]

Where W_l is a learned MLP projection that maps both inputs to a shared space. The ViT embeddings V provide the raw visual context, and H_l provides the compressed reasoning. These are concatenated and fed through a small autoregressive decoder that predicts CoT tokens via cross-entropy loss.

The decoder's input tensor shape: V is projected to R^N×d_dec, H_l is projected to R^C_l×d_dec, concatenated along the sequence dimension to form Z_l ∈ R^{(N+C_l)×d_dec}.

Visual Auxiliary Decoder (D_v)

This decoder takes the hidden states H_v at the visual latent positions and the ViT embeddings V. It must predict what the scene will look like at two future timesteps: +0.5 seconds and +1.0 seconds.

Z_v = [W_v(V), W_v(H_v)]

The future frames are not predicted as raw pixels. Instead, each frame is first tokenized by an IBQ (Improved BERT Quantizer) visual tokenizer with a codebook of 131,072 entries. The visual decoder predicts these discrete tokens via cross-entropy loss — the same way a language model predicts word tokens, but over a vocabulary of visual patches.

Why discrete visual tokens? Predicting raw pixels would require regression loss (MSE), which is noisy and hard to optimize. Discrete visual tokens let you use cross-entropy loss — the same stable, well-understood loss used for language. This makes the visual decoder's training consistent with the language decoder's training, and both consistent with the main model's trajectory loss.

Combined Loss

L = L_c + λ_l · L_l + λ_v · L_v

Where L_c is the main trajectory cross-entropy loss, L_l is the language decoder's CoT reconstruction loss, and L_v is the visual decoder's future-frame prediction loss. The weights are λ_l = 1.0 and λ_v = 0.1. The visual loss is weighted lower because predicting future frames is a harder task — it requires understanding 3D scene dynamics — and early training produces noisier gradients.

Dual Decoder Data Flow (SHOWCASE)

Toggle between Training mode (decoders active, gradients flowing) and Inference mode (decoders discarded, prefill shown). The diagram shows tensor shapes at each stage. Click components to highlight the data flow path.

Why is the visual loss weight (λ_v = 0.1) lower than the language loss weight (λ_l = 1.0)?

Because the visual decoder has fewer parameters Because predicting future frames is harder and produces noisier gradients early in training, so it needs to be weighted down to avoid destabilizing the main model Because visual tokens are less important than language tokens

Chapter 5: Training Pipeline

You cannot train everything at once from scratch. Think about why: the visual decoder needs meaningful hidden states at the latent positions to learn useful gradients. But the main model starts with random latent representations — its hidden states at those positions carry no information. The visual decoder would receive garbage inputs, compute garbage gradients, and send garbage signals back to the main model. Chaos.

OneVL solves this with a carefully staged training pipeline — four phases that progressively warm up each component before coupling them together.

Preliminary: Visual Decoder Pretraining

Before any latent tokens are involved, the visual decoder learns to predict future frames from ViT embeddings alone. Input: V (image patch embeddings from the ViT). Output: discrete visual tokens for frames at +0.5s and +1.0s. This teaches the decoder the basic structure of future prediction — what roads, cars, and pedestrians look like over time — without requiring any latent conditioning. The main model is not involved.

Stage 0: Main Model Warmup

The latent tokens are inserted into the sequence and the main model trains on the trajectory loss L_c only. No decoder gradients. The model learns to produce useful hidden states at the latent positions as a byproduct of minimizing the trajectory loss. By the end of Stage 0, the hidden states H_v and H_l carry some structured information about the scene, even though no decoder has seen them yet.

Stage 1: Auxiliary Decoder Warmup

The main model is frozen. Both decoders train to align with the now-stable latent representations. The language decoder learns to map H_l to CoT text. The visual decoder — already pretrained on ViT embeddings — now also conditions on H_v. Since the main model's weights are frozen, the decoders get consistent input and can converge without chasing a moving target.

Stage 2: Joint End-to-End Fine-Tuning

Everything trains jointly. The main model, both decoders, the ViT, the aligner — all parameters update simultaneously. Now decoder gradients flow back into the main model through H_v and H_l, creating a virtuous cycle: better latent representations help the decoders predict better, and better decoder gradients help the main model produce richer latent representations.

Why staged training works: Each stage establishes a stable foundation for the next. The preliminary phase gives the visual decoder basic competence. Stage 0 gives the main model informative latent states. Stage 1 aligns decoders to those states without moving the target. Stage 2 couples everything once all components can give useful gradients. This prevents the chicken-and-egg problem of joint training from scratch.

Training Pipeline Stages

Step through the four training stages. Each stage shows which components are trainable (highlighted) vs frozen (dimmed), and what losses are active. Watch how the system progressively warms up.

Phase	Main Model	Lang Decoder	Vis Decoder	Active Loss
Preliminary	N/A	N/A	Train	L_v (ViT only)
Stage 0	Train	N/A	Frozen	L_c
Stage 1	Frozen	Train	Train	L_l + L_v
Stage 2	Train	Train	Train	L_c + L_l + L_v

Why is the main model frozen during Stage 1 (auxiliary decoder warmup)?

So the decoders get consistent, stable latent representations to learn from — if the main model were also updating, the decoders would be chasing a moving target and might never converge To save GPU memory Because the main model has already converged

Chapter 6: Prefill Inference

Here is where the speed trick becomes concrete. After training, you throw away both decoders entirely. They are never run at inference time. So what do the latent tokens do?

The input sequence at inference looks like this:

[System, User Query, Image, <vis_latent> × 35, <lang_latent> × 20, <trajectory start>]

All of these tokens — including the 55 latent tokens — are part of the prompt, not the response. They are processed in the prefill phase, where modern transformer architectures process the entire input in a single parallel forward pass. There is no autoregressive generation for these tokens. The KV cache is computed once, in parallel, for the entire prompt.

Why prefill is fast: Autoregressive generation produces tokens one at a time, each requiring a separate forward pass. Prefill processes ALL prompt tokens in one batched matrix multiplication. Adding 55 tokens to the prompt adds negligible time compared to the hundreds of image tokens already there. The latent tokens ride for free on the existing prefill computation.

Only the trajectory tokens are generated autoregressively. With a typical trajectory of 10-20 waypoints (20-40 tokens), this takes well under a second. Total inference time:

Prefill: ~3-4 seconds (dominated by image tokens, latent tokens add nearly zero)
Trajectory decode: ~0.5 seconds (20-40 autoregressive tokens)
Total: ~4.5 seconds — essentially the same as answer-only prediction

Compare this to explicit CoT: ~4 seconds of prefill PLUS ~4 seconds of autoregressive CoT generation PLUS ~0.5 seconds of trajectory decode = ~8.5 seconds total. OneVL cuts this roughly in half.

For real deployment, the autoregressive trajectory head is replaced with an MLP trajectory head that predicts all waypoints in a single forward pass. This reduces total inference to 0.24 seconds (4.16 Hz), which is only 5.4% of the AR CoT latency. That is real-time driving.

Latency Comparison

Compare inference latency across four modes. The stacked bars show prefill time (blue) vs autoregressive decode time (orange). Notice how OneVL's latent tokens add almost nothing to prefill, while explicit CoT's text tokens add seconds of decode.

Mode	Prefill	CoT Decode	Traj Decode	Total
Explicit CoT	~3.5s	~4.0s	~0.5s	~8.0s
Answer-Only	~3.5s	0s	~0.5s	~4.0s
OneVL (AR traj)	~3.5s	0s	~0.5s	~4.0s
OneVL (MLP traj)	~0.2s	0s	~0.04s	0.24s

Why does adding 55 latent tokens to the prompt add almost zero inference latency?

Because latent tokens are smaller than regular tokens Because they are processed in the prefill phase as a single parallel forward pass, and 55 tokens are negligible compared to the hundreds of image tokens already in the prompt Because the decoders process them at inference

Chapter 7: Results

OneVL is evaluated on four autonomous driving benchmarks, each testing different aspects of trajectory prediction. The results are striking: OneVL is the first latent CoT method to surpass explicit CoT on all four.

NAVSIM

A closed-loop navigation benchmark using the PDM (Planning-Decision-Making) score. OneVL achieves 88.84, beating explicit CoT at 88.29 and the previous state-of-the-art LaST-VLA at 87.30. Inference latency: 4.46 seconds vs explicit CoT's 8.73 seconds.

ROADWork

Driving through construction zones with altered road geometry. Average Displacement Error (ADE, lower is better). OneVL: 12.49 vs explicit CoT at 13.18. The previous best non-VLA method (YNet) scored 22.68 — nearly double the error.

Impromptu

Natural-language instructions for novel driving scenarios. OneVL: 1.34 ADE vs explicit CoT at 1.42 and the previous best Impromptu VLA at 1.60.

APR1

Long-horizon trajectory prediction. OneVL: 2.77 ADE vs explicit CoT at 2.99 and Cosmos-Reason at 2.86.

The headline result: Prior latent CoT methods (COCONUT, CODI, SIM-CoT) ALL underperform explicit CoT. OneVL is the FIRST to surpass it — and it does so on every benchmark, not just one. This confirms that dual-modal supervision (language + visual) is the critical ingredient.

Results Across 4 Benchmarks

Bar chart showing OneVL vs explicit CoT vs prior SOTA on each benchmark. Green indicates OneVL beats all baselines. Hover over bars for exact values.

Which statement about the results is correct?

OneVL surpasses explicit CoT on all four benchmarks while matching answer-only latency — the first latent CoT method to achieve this Prior latent CoT methods also surpass explicit CoT on some benchmarks OneVL trades some accuracy for speed, accepting slightly worse scores than explicit CoT

Chapter 8: Ablations

The paper includes careful ablations that isolate each component's contribution. These answer the "what if we removed X?" questions.

Without Visual Decoder

Remove D_v, keep only D_l. This is equivalent to prior latent CoT methods. Performance drops below explicit CoT on most benchmarks. On NAVSIM: 88.12 (vs 88.84 with both decoders). The visual decoder's spatiotemporal grounding is essential.

Without Language Decoder

Remove D_l, keep only D_v. Performance also drops: NAVSIM 88.47. The language decoder provides complementary symbolic supervision that the visual decoder alone cannot.

Without Staged Training

Train everything jointly from scratch (skip the staged warmup). Performance collapses: NAVSIM drops to 87.01. This confirms the chicken-and-egg problem — random latent states produce garbage decoder gradients that destabilize training.

Number of Latent Tokens

Reducing visual latent tokens from 4 to 2 (17 actual tokens) drops NAVSIM from 88.84 to 88.35. Increasing to 8 (70 actual tokens) gives 88.89 — marginal improvement at double the token count. The 4-token configuration is the sweet spot.

The ablation story: Both decoders are necessary. The visual decoder provides the larger improvement (removing it hurts more than removing the language decoder), but the combination is greater than either alone. And staged training is not optional — it is essential for stable convergence.

Configuration	NAVSIM PDM↑	Δ vs Full
OneVL (full)	88.84	—
No visual decoder	88.12	-0.72
No language decoder	88.47	-0.37
No staged training	87.01	-1.83
2 visual latent tokens	88.35	-0.49
8 visual latent tokens	88.89	+0.05
Explicit CoT baseline	88.29	-0.55

Which ablation causes the largest performance drop on NAVSIM?

Removing the visual decoder Removing the staged training pipeline — joint training from scratch drops NAVSIM by 1.83 points, far more than any decoder ablation Reducing latent tokens from 4 to 2

Chapter 9: Connections

OneVL in one sentence: Compress CoT reasoning into latent tokens supervised by dual auxiliary decoders (language + visual world model), discard decoders at inference, and prefill latent tokens in one parallel pass to match answer-only speed while surpassing explicit CoT accuracy.

Cheat Sheet

Component	Detail
Base Model	Qwen3-VL-4B-Instruct (ViT + MLP + LLM, all trainable)
Visual Latent Tokens	4 tokens → 35 actual VLM tokens
Language Latent Tokens	2 tokens → 20 actual VLM tokens
Language Decoder	MLP proj + autoregressive decoder, CE loss on CoT text
Visual Decoder	MLP proj + decoder, CE loss on IBQ visual tokens (131K codebook)
Loss	L = L_c + 1.0·L_l + 0.1·L_v
Training	Preliminary → Stage 0 (warmup) → Stage 1 (freeze main) → Stage 2 (joint E2E)
Inference	Decoders discarded, latent tokens prefilled, AR traj or MLP traj head
Key Result	First latent CoT to surpass explicit CoT on all benchmarks

Related Work

COCONUT (Hao et al., 2024): Continuous CoT in latent space with language-only supervision. Underperforms explicit CoT.
CODI (Shen et al., 2025): Code-guided latent reasoning. Also underperforms explicit CoT.
SIM-CoT (Ma et al., 2025): Simulated CoT for VLAs. Improves over no-CoT but not over explicit CoT.
LaST-VLA (Zheng et al., 2025): Prior SOTA on NAVSIM (87.30) using explicit CoT reasoning.

Related Lessons

Scaling Test-Time Compute — how inference-time reasoning scales with computation budget
VLA Foundry — training frameworks for Vision-Language-Action models
π₀ — flow matching for continuous robot control
Diffusion Policy — diffusion-based action generation for robotics

What is the core theoretical reason OneVL surpasses explicit CoT?

It uses a larger model Dual-modal compression forces latent tokens to encode causal dynamics (scene physics), producing more generalizable representations than verbose symbolic reasoning It generates more reasoning tokens

OneVL: One-Step Latent Reasoning