Li, Zhang, Yu, Luo, Gao, Yang, Wang, Han, Xue, Zhu, Shen, Xu — 2026

Causal World Modeling for Robot Control

Unify video prediction and action policy in a single autoregressive diffusion framework. The robot imagines the near future, then acts on what it sees — closing the loop between world understanding and motor control.

Prerequisites: Diffusion models + Transformers (attention) + Robotic manipulation basics
10
Chapters
4
Simulations

Chapter 0: The Problem

You want a robot to make breakfast. It needs to grasp a plate, pick up bread, grab a kettle, pour water, and serve — a long sequence of precise actions, each depending on what happened before.

Today's leading approach is the Vision-Language-Action (VLA) model: take the current camera image, feed it through a big neural network, and output the next action. It's a direct, feedforward mapping: observation in, action out.

This works surprisingly well for simple tasks. But it has a deep flaw: representation entanglement. A single network must simultaneously learn three very different things:

All three are crammed into one supervision signal: "here's what the expert did." The model must compress high-dimensional visual semantics and low-dimensional motor commands into a shared representation. This leads to poor sample efficiency and brittle generalization.

The fundamental limitation: Feedforward VLAs are reactive. They see the current frame and output an action. They have no model of what will happen next. They cannot "imagine" the consequences of their actions before committing to them. When something unexpected happens mid-task, they have no mechanism to re-plan.

Think about how you pour water. You don't just react to the current visual frame. You predict where the water will go, you anticipate when the cup is full, and you adjust your pour angle in real time. You have a world model — an internal simulator of physics — and you use it constantly.

Feedforward VLA vs. World Model

Compare a reactive VLA (left) that maps observations directly to actions, versus a world model approach (right) that first predicts the future, then decides what to do. Click Play to animate.

Click Play to compare
Why do feedforward VLAs struggle with long-horizon manipulation tasks?

Chapter 1: The Key Insight

LingBot-VA's core idea: combine a video world model and an action policy in one unified framework. Instead of mapping observations directly to actions, decompose the problem into two stages:

Stage 1: Visual Dynamics
Given what you've seen so far, predict what the world will look like next. Generate future video frames.
Stage 2: Inverse Dynamics
Given where you are now and where you want to be, figure out what action gets you there.

In math, a standard VLA learns:

at ~ πθ( · | ot )

LingBot-VA instead learns two things:

Stage 1:   ot+1 ~ pθ( · | o≤t )     Stage 2:   at ~ gψ( · | ot, ot+1 )

Stage 1 predicts the next observation given history. Stage 2 infers actions from the desired visual transition. This decomposition is powerful because each stage can leverage different data:

Why this matters: By factoring the problem into "what will happen" and "what should I do," LingBot-VA can inherit rich physical priors from video pre-training. A model that has watched millions of cups being poured already knows how liquids behave — it just needs a small amount of robot data to learn the motor commands.

But the real magic is that LingBot-VA doesn't actually separate these into two independent models. It interleaves video and action tokens into a single autoregressive sequence, processing both through a shared transformer. The two stages happen jointly, each informing the other. This is what makes it different from prior work that bolted a separate action head onto a video model.

What is the key advantage of decomposing robot control into visual dynamics prediction + inverse dynamics?

Chapter 2: Background

Before diving into LingBot-VA's architecture, we need three pieces of background.

Flow Matching (Continuous Diffusion)

Standard diffusion models add noise to data and learn to reverse the process. Flow matching is a cleaner formulation: it learns a velocity field that transports noise to data along a straight path.

Given a data sample x1 and noise ε ~ N(0, I), define an interpolation path:

x(s) = (1 - s)ε + s · x1

The true velocity along this path is simply x1 - ε. The model learns to predict this velocity:

LFM = E[ || vθ(x(s), s) - (x1 - ε) ||² ]

At inference, start from pure noise (s=0) and integrate the learned velocity field to s=1. The result is a sample from the data distribution. Flow matching is used in LingBot-VA for both video frame generation and action prediction.

Autoregressive vs. Chunk-Based Generation

There are two ways to generate sequences of video frames for robot control:

Why causality matters for robots: In the real physical world, the future doesn't influence the past. If your robot bumps a cup at time t, the cup's new position at t+1 depends on the bump — not on what happens at t+2. Bidirectional models allow t+2 to influence t+1 within a chunk, which can produce physically inconsistent predictions. Causal models enforce the correct temporal direction.

KV Cache

In autoregressive transformers, each new token attends to all previous tokens. Without optimization, this means recomputing attention over the entire history at every step. The KV cache stores the key and value vectors from previous steps, so only the new token's queries need to be computed. This makes autoregressive generation efficient and enables persistent long-term memory — the model never "forgets" what happened 100 steps ago.

Why is autoregressive (causal) generation preferred over bidirectional chunk generation for robot control?

Chapter 3: Shared Latent Space

LingBot-VA's first key design: unify vision and action tokens in a single sequence, processed by a Mixture-of-Transformers (MoT) architecture.

Token Interleaving

Video frames are encoded into latent tokens using a causal Video VAE (from Wan2.2). Each frame produces N = 192 spatial tokens. Actions are projected into the same embedding space via a small MLP. The tokens are interleaved in temporal order:

[ z0, a0,1, a0,2, ..., a0,τ, z1, a1,1, ..., a1,τ, z2, ... ]

Here τ = 4 is the temporal downsampling factor. For each video frame, there are 4 action tokens, because actions run at higher frequency (50 Hz) than video (12.5 Hz). Predicting K video frames means generating 4K actions.

Mixture-of-Transformers

The key challenge: vision and action are very different modalities. Video tokens are high-dimensional (capturing rich spatial information), while action tokens are low-dimensional (7 DoF per arm). Sharing all parameters would force the model to compress both into the same representation.

MoT solves this with separate expert parameters per modality:

The video stream is large (dv = 3072, initialized from Wan2.2-5B). The action stream is small (da = 768, 4x smaller). This asymmetric design reflects the fact that action distributions are inherently simpler than visual data.

For joint attention, action tokens are projected up to the video dimension, participate in shared self-attention, then projected back down. A residual connection preserves action-specific features.

Why MoT, not just a shared transformer? Video tokens need expressive high-dimensional features to capture spatial detail. Action tokens only need to represent 7 numbers per arm. If you force them through the same projection layers, either the action stream wastes capacity or the video stream is constrained. MoT gives each modality its own "expert" while still allowing cross-modal communication through shared attention.
Mixture-of-Transformers Architecture

Vision tokens (blue) and action tokens (orange) pass through separate expert projections, then share attention. Watch how information flows through one MoT layer.

Click Play to animate

Action Network Initialization

Training the action stream from scratch is unstable. The action tokens' initial output distribution diverges from the video distribution, disrupting joint attention. LingBot-VA initializes the action network by interpolating the pretrained video weights to the smaller action dimension, scaled by √(dv/da) to preserve output variance. This ensures both streams start with comparable distributions.

Why does LingBot-VA use separate QKV projections for vision and action tokens instead of sharing all parameters?

Chapter 4: Causal Autoregressive Generation

This is the heart of LingBot-VA. At each autoregressive step, the model generates a chunk of K future video frames via flow matching, then simultaneously decodes the corresponding actions via inverse dynamics. Let's trace through exactly how this works.

Step-by-Step Generation

Given observation history z≤t and action history a<t:

  1. Sample noise: Draw ε ~ N(0, I) for the next K latent frames
  2. Denoise video chunk: Integrate the learned velocity field from s=0 to s=0.5 (partial denoising — we'll explain why shortly):
    t+1:t+K = ε + ∫00.5 vθ(z(s), s | C) ds
  3. Decode actions: Sample new noise for action tokens, then integrate from s=0 to s=1:
    at:t+K-1 = ε + ∫01 vψ(a(s), s | ẑt:t+K, C) ds
  4. Execute actions and collect new real-world observations
  5. Update KV cache with the new tokens and repeat

Causal Attention Masking

Within the interleaved sequence, a strict causal attention mask ensures that each token can only attend to tokens that appear earlier in the temporal sequence. Within a chunk, tokens can attend to each other (bidirectional within the chunk), but across chunks, attention is strictly causal. This preserves the temporal arrow of physical causality.

Teacher Forcing During Training

During training, the model uses ground-truth tokens as context (teacher forcing). This is unusually well-suited for robotics: unlike pure generative modeling where teacher forcing creates train-test mismatch, robot policies naturally receive real observations during deployment. The training and deployment regimes match.

Noisy History Augmentation

The biggest bottleneck is video denoising — video tokens vastly outnumber action tokens, and each requires multiple denoising steps. Key insight: action decoding doesn't need pixel-perfect video reconstruction. The inverse dynamics model can extract action-relevant information from partially noisy video states.

During training, with probability 0.5, the video history is augmented with noise at a random flow time saug ∈ [0.5, 1.0]. This trains the action decoder to be robust to partially denoised inputs. At inference, video tokens only need to be denoised to s=0.5 instead of s=1.0 — halving the denoising computation while maintaining action prediction quality.

The core tradeoff: Full video denoising produces beautiful frames but is computationally expensive. For robot control, we don't need beautiful frames — we need accurate actions. Partial denoising preserves the semantic structure that actions depend on while cutting compute in half.
Autoregressive Rollout

Watch the model generate frame-by-frame with action conditioning. Toggle between open-loop (no feedback) and closed-loop (real observations injected). Notice how open-loop drifts over time.

Step 0 / 8
Why does LingBot-VA only denoise video tokens to s=0.5 instead of s=1.0 during inference?

Chapter 5: Closed-Loop Rollout

Here's a problem with any world model: predicted frames drift from reality over time. Even small errors compound. After 10 steps of open-loop prediction, the model's imagined world may look nothing like the real one. The robot is acting on a hallucination.

Open-Loop vs. Closed-Loop

Open-loop: Generate an entire trajectory of predicted frames, then execute all the corresponding actions. The model never gets corrected by reality. If the first prediction is slightly off, every subsequent prediction builds on that error.

Closed-loop: After executing each action chunk, replace the model's predicted frames with the actual observation from the robot's camera. The model is continuously re-grounded in reality.

Think of it like driving. Open-loop is like closing your eyes at the start of a road and driving based on your mental model of where the road goes. Closed-loop is normal driving — you keep your eyes open and correct as you go. For long-horizon tasks, closed-loop is essential.

How LingBot-VA Closes the Loop

At each autoregressive step:

  1. The model predicts the next K frames and K actions
  2. The robot executes the K actions and captures real observations
  3. The real observations are encoded into latent tokens via the Video VAE
  4. These ground-truth tokens replace the model's predictions in the KV cache
  5. The next prediction step conditions on the corrected history

This is natural for autoregressive models: the KV cache simply gets the real tokens instead of predicted ones. No special mechanism is needed — it's the same teacher-forcing used during training.

Why Chunk-Based Methods Struggle Here

Chunk-based bidirectional models generate entire segments at once. Injecting a ground-truth observation mid-chunk would require re-generating the entire chunk, because bidirectional attention means every token in the chunk depends on every other. LingBot-VA's causal structure means you can simply append the real observation and continue generating from there.

Training-deployment alignment: During training, teacher forcing provides ground-truth context. During deployment, closed-loop rollout provides ground-truth observations. The two regimes match perfectly — the model sees the same kind of input distribution in both cases. This is a rare case where teacher forcing doesn't create distribution mismatch.
Why does closed-loop rollout align naturally with LingBot-VA's autoregressive formulation?

Chapter 6: Asynchronous Inference

Even with partial denoising and KV cache, autoregressive video-action generation takes time. If the robot has to wait for the model to finish predicting before it can move, there's a delay between seeing the world and acting on it. For real-time control at 50 Hz, this delay can be catastrophic.

The Synchronous Problem

In synchronous inference: observe → predict → execute → observe → predict → execute. The robot is idle while the model computes, and the model is idle while the robot moves. Half the time is wasted.

The Asynchronous Solution

Pipeline the computation: while the robot executes action chunk at, the model simultaneously predicts the next chunk at+1. When the robot finishes executing, the next actions are already ready. No idle time.

The FDM Grounding Trick

There's a subtle problem with naive asynchronous inference. When predicting at+1, the model doesn't have the real observation at time t (the robot hasn't finished executing yet). It has to use its own predicted frame ẑt. But this predicted frame might be stale or inaccurate.

A naive approach just uses the stale prediction. But the video model tends to "continue" its own hallucinated video rather than staying grounded in reality. Over time, the model drifts into open-loop mode.

LingBot-VA fixes this with a Forward Dynamics Model (FDM) grounding step:

  1. Wait for the most recent real observation zt-1 (from the previous execution)
  2. Use the model to "imagine" what zt looks like after applying action at to real observation zt-1
  3. Use this FDM-grounded prediction (instead of the stale one) as context for predicting at+1

This re-grounds the model in real observations at every step, even though there's a one-step delay.

The result: Asynchronous inference with FDM grounding achieves comparable success rates to synchronous inference while completing tasks 2× faster. The naive async approach (without FDM grounding) degrades significantly, especially on long-horizon tasks (32.9% vs 93.2% at Horizon=3).
Why does naive asynchronous inference (without FDM grounding) degrade on long-horizon tasks?

Chapter 7: Training

LingBot-VA follows a two-phase training pipeline: massive-scale pre-training on diverse data, then lightweight post-training on specific robot tasks.

Phase 1: Pre-Training

The backbone is Wan2.2-5B, a large-scale pretrained video generation model. The action stream (350M parameters) is added on top, bringing the total to 5.3B parameters.

Pre-training data comes from two sources:

The model is pre-trained for 1.4 trillion tokens using AdamW with cosine annealing, bfloat16 mixed precision, and classifier-free guidance.

Unified Action Representation

Different robots have different action spaces. LingBot-VA defines a universal dual-arm representation: each arm gets 7 end-effector pose dimensions + 7 joint angle dimensions + 1 gripper dimension = 15 per arm, 30 total. Robots with fewer degrees of freedom get zero-padded.

Variable Chunk Size

During training, the chunk size K is randomly sampled from [1, 4]. This teaches the model to generate coherent predictions at different temporal horizons. At inference, K=4 is used as a practical tradeoff between efficiency and responsiveness.

Phase 2: Post-Training

Adapting to a new robot platform requires remarkably little data. With just 50 demonstrations per task, the model is fine-tuned for 3K steps at a reduced learning rate (10-5). This is sufficient for effective deployment.

Data efficiency: With 50 demonstrations, LingBot-VA achieves 97.0% on LIBERO — competitive with methods trained on far more data. With just 10 demonstrations, it still reaches 81.7%. The video pre-training provides such strong physical priors that very little robot-specific data is needed to ground them.

Training Objective

The total loss combines two flow matching objectives:

L = Ldyn + λ Linv

Ldyn supervises video velocity field prediction (visual dynamics). Linv supervises action velocity field prediction (inverse dynamics), conditioned on current and next observations. Both use the same flow matching framework, with λ=1.

Why can LingBot-VA achieve strong performance with only 50 demonstrations per task?

Chapter 8: Results

LingBot-VA is evaluated on both simulation benchmarks and real-world robot tasks. The results are strong across the board.

Real-World Deployment (6 Tasks)

Six manipulation tasks spanning three categories, each with only 50 demonstrations:

TaskCategoryLingBot-VAπ0.5
Make BreakfastLong-horizon7 steps~5 steps
Unpack DeliveryLong-horizonBestLower
Insert TubesPrecisionBestLower
Pick ScrewsPrecisionBestLower
Fold ClothesDeformableBestLower
Fold PantsDeformableBestLower

LingBot-VA outperforms π0.5 on all six tasks and both metrics (success rate and progress score). The strongest gains appear on long-horizon tasks, validating the temporal memory advantage of autoregressive world modeling.

Simulation: RoboTwin 2.0 (50 Bimanual Tasks)

MethodEasy (avg)Hard (avg)
X-VLA72.972.8
π065.958.4
π0.582.776.8
Motus88.787.0
LingBot-VA92.9 (+4.2)91.6 (+4.6)

The improvement grows with task horizon: at Horizon=3 (3-step tasks), LingBot-VA gains +8.2% (Easy) and +9.1% (Hard) over the next best method.

Simulation: LIBERO (4 Suites)

MethodSpatialObjectGoalLongAvg
X-VLA98.298.697.897.698.1
π097.698.497.985.297.1
LingBot-VA98.599.697.298.598.5

Data Efficiency

On LIBERO with varying numbers of demonstrations:

Results Comparison: RoboTwin 2.0

Average success rate across 50 bimanual tasks. LingBot-VA (rightmost) consistently outperforms all baselines, with the gap widening on harder settings.

Why long-horizon gains are largest: On short tasks (Horizon=1), all methods perform well — even reactive policies can handle one-step manipulation. But as tasks grow longer, methods without persistent memory and closed-loop correction drift. LingBot-VA's KV cache maintains full context, and its closed-loop rollout continuously re-grounds predictions in reality.
Where does LingBot-VA show the largest improvement over prior methods?

Chapter 9: Connections

LingBot-VA sits at the intersection of several major threads in robotics and generative modeling. Let's map where it fits.

Relation to π0 / π0.5

π0 is a flow-matching VLA that generates action chunks conditioned on visual observations. It's a feedforward policy — no world model, no video prediction. LingBot-VA extends this paradigm by adding an explicit video world model that predicts future visual states, enabling the action decoder to reason about consequences rather than just react.

Relation to UWM (Universal World Model)

UWM uses bidirectional diffusion within chunks to jointly generate video and actions. LingBot-VA's key departure is causal autoregressive generation across chunks, which provides persistent memory (KV cache) and natural closed-loop correction. UWM's bidirectional chunks can't easily incorporate real-time feedback mid-generation.

Relation to Diffusion Policy

Diffusion Policy uses diffusion to generate action sequences, but without any video prediction component. It's purely a policy model. LingBot-VA adds the "imagination" layer — predicting what the world will look like — which provides richer conditioning for action generation.

Relation to Video World Models

Video prediction models like Genie and UniSim can imagine future frames, but they're not designed for real-time robot control. LingBot-VA bridges this gap by making video prediction fast enough (via partial denoising and async inference) and tight enough (via MoT and inverse dynamics) for closed-loop manipulation.

Cheat Sheet

AspectLingBot-VA
Core ideaVideo world model + action policy in one AR diffusion framework
ArchitectureMixture-of-Transformers (5B video + 350M action)
GenerationAutoregressive chunks, flow matching per chunk
Inference trickPartial denoising (s=0.5) + async FDM grounding
LoopClosed-loop: real observations replace predictions each step
Pre-trainingInternet video + 16K hrs robot data, 1.4T tokens
Post-training50 demos, 3K steps — sufficient for new tasks
Key result92.9% on RoboTwin (50 tasks), 98.5% on LIBERO (avg)
AdvantageLong-horizon: +8-9% over best baseline at Horizon=3
The broader lesson: Combining generation and control in one model — rather than treating them as separate modules — enables the model to use its "imagination" of the future as a rich conditioning signal for action prediction. The causal autoregressive structure is what makes this practical: it provides persistent memory, respects physical causality, and enables efficient closed-loop correction with real-world feedback.
What is LingBot-VA's key architectural departure from chunk-based video-action models like UWM?