Unify video prediction and action policy in a single autoregressive diffusion framework. The robot imagines the near future, then acts on what it sees — closing the loop between world understanding and motor control.
You want a robot to make breakfast. It needs to grasp a plate, pick up bread, grab a kettle, pour water, and serve — a long sequence of precise actions, each depending on what happened before.
Today's leading approach is the Vision-Language-Action (VLA) model: take the current camera image, feed it through a big neural network, and output the next action. It's a direct, feedforward mapping: observation in, action out.
This works surprisingly well for simple tasks. But it has a deep flaw: representation entanglement. A single network must simultaneously learn three very different things:
All three are crammed into one supervision signal: "here's what the expert did." The model must compress high-dimensional visual semantics and low-dimensional motor commands into a shared representation. This leads to poor sample efficiency and brittle generalization.
Think about how you pour water. You don't just react to the current visual frame. You predict where the water will go, you anticipate when the cup is full, and you adjust your pour angle in real time. You have a world model — an internal simulator of physics — and you use it constantly.
Compare a reactive VLA (left) that maps observations directly to actions, versus a world model approach (right) that first predicts the future, then decides what to do. Click Play to animate.
LingBot-VA's core idea: combine a video world model and an action policy in one unified framework. Instead of mapping observations directly to actions, decompose the problem into two stages:
In math, a standard VLA learns:
LingBot-VA instead learns two things:
Stage 1 predicts the next observation given history. Stage 2 infers actions from the desired visual transition. This decomposition is powerful because each stage can leverage different data:
But the real magic is that LingBot-VA doesn't actually separate these into two independent models. It interleaves video and action tokens into a single autoregressive sequence, processing both through a shared transformer. The two stages happen jointly, each informing the other. This is what makes it different from prior work that bolted a separate action head onto a video model.
Before diving into LingBot-VA's architecture, we need three pieces of background.
Standard diffusion models add noise to data and learn to reverse the process. Flow matching is a cleaner formulation: it learns a velocity field that transports noise to data along a straight path.
Given a data sample x1 and noise ε ~ N(0, I), define an interpolation path:
The true velocity along this path is simply x1 - ε. The model learns to predict this velocity:
At inference, start from pure noise (s=0) and integrate the learned velocity field to s=1. The result is a sample from the data distribution. Flow matching is used in LingBot-VA for both video frame generation and action prediction.
There are two ways to generate sequences of video frames for robot control:
In autoregressive transformers, each new token attends to all previous tokens. Without optimization, this means recomputing attention over the entire history at every step. The KV cache stores the key and value vectors from previous steps, so only the new token's queries need to be computed. This makes autoregressive generation efficient and enables persistent long-term memory — the model never "forgets" what happened 100 steps ago.
LingBot-VA's first key design: unify vision and action tokens in a single sequence, processed by a Mixture-of-Transformers (MoT) architecture.
Video frames are encoded into latent tokens using a causal Video VAE (from Wan2.2). Each frame produces N = 192 spatial tokens. Actions are projected into the same embedding space via a small MLP. The tokens are interleaved in temporal order:
Here τ = 4 is the temporal downsampling factor. For each video frame, there are 4 action tokens, because actions run at higher frequency (50 Hz) than video (12.5 Hz). Predicting K video frames means generating 4K actions.
The key challenge: vision and action are very different modalities. Video tokens are high-dimensional (capturing rich spatial information), while action tokens are low-dimensional (7 DoF per arm). Sharing all parameters would force the model to compress both into the same representation.
MoT solves this with separate expert parameters per modality:
The video stream is large (dv = 3072, initialized from Wan2.2-5B). The action stream is small (da = 768, 4x smaller). This asymmetric design reflects the fact that action distributions are inherently simpler than visual data.
For joint attention, action tokens are projected up to the video dimension, participate in shared self-attention, then projected back down. A residual connection preserves action-specific features.
Vision tokens (blue) and action tokens (orange) pass through separate expert projections, then share attention. Watch how information flows through one MoT layer.
Training the action stream from scratch is unstable. The action tokens' initial output distribution diverges from the video distribution, disrupting joint attention. LingBot-VA initializes the action network by interpolating the pretrained video weights to the smaller action dimension, scaled by √(dv/da) to preserve output variance. This ensures both streams start with comparable distributions.
This is the heart of LingBot-VA. At each autoregressive step, the model generates a chunk of K future video frames via flow matching, then simultaneously decodes the corresponding actions via inverse dynamics. Let's trace through exactly how this works.
Given observation history z≤t and action history a<t:
Within the interleaved sequence, a strict causal attention mask ensures that each token can only attend to tokens that appear earlier in the temporal sequence. Within a chunk, tokens can attend to each other (bidirectional within the chunk), but across chunks, attention is strictly causal. This preserves the temporal arrow of physical causality.
During training, the model uses ground-truth tokens as context (teacher forcing). This is unusually well-suited for robotics: unlike pure generative modeling where teacher forcing creates train-test mismatch, robot policies naturally receive real observations during deployment. The training and deployment regimes match.
The biggest bottleneck is video denoising — video tokens vastly outnumber action tokens, and each requires multiple denoising steps. Key insight: action decoding doesn't need pixel-perfect video reconstruction. The inverse dynamics model can extract action-relevant information from partially noisy video states.
During training, with probability 0.5, the video history is augmented with noise at a random flow time saug ∈ [0.5, 1.0]. This trains the action decoder to be robust to partially denoised inputs. At inference, video tokens only need to be denoised to s=0.5 instead of s=1.0 — halving the denoising computation while maintaining action prediction quality.
Watch the model generate frame-by-frame with action conditioning. Toggle between open-loop (no feedback) and closed-loop (real observations injected). Notice how open-loop drifts over time.
Here's a problem with any world model: predicted frames drift from reality over time. Even small errors compound. After 10 steps of open-loop prediction, the model's imagined world may look nothing like the real one. The robot is acting on a hallucination.
Open-loop: Generate an entire trajectory of predicted frames, then execute all the corresponding actions. The model never gets corrected by reality. If the first prediction is slightly off, every subsequent prediction builds on that error.
Closed-loop: After executing each action chunk, replace the model's predicted frames with the actual observation from the robot's camera. The model is continuously re-grounded in reality.
At each autoregressive step:
This is natural for autoregressive models: the KV cache simply gets the real tokens instead of predicted ones. No special mechanism is needed — it's the same teacher-forcing used during training.
Chunk-based bidirectional models generate entire segments at once. Injecting a ground-truth observation mid-chunk would require re-generating the entire chunk, because bidirectional attention means every token in the chunk depends on every other. LingBot-VA's causal structure means you can simply append the real observation and continue generating from there.
Even with partial denoising and KV cache, autoregressive video-action generation takes time. If the robot has to wait for the model to finish predicting before it can move, there's a delay between seeing the world and acting on it. For real-time control at 50 Hz, this delay can be catastrophic.
In synchronous inference: observe → predict → execute → observe → predict → execute. The robot is idle while the model computes, and the model is idle while the robot moves. Half the time is wasted.
Pipeline the computation: while the robot executes action chunk at, the model simultaneously predicts the next chunk at+1. When the robot finishes executing, the next actions are already ready. No idle time.
There's a subtle problem with naive asynchronous inference. When predicting at+1, the model doesn't have the real observation at time t (the robot hasn't finished executing yet). It has to use its own predicted frame ẑt. But this predicted frame might be stale or inaccurate.
A naive approach just uses the stale prediction. But the video model tends to "continue" its own hallucinated video rather than staying grounded in reality. Over time, the model drifts into open-loop mode.
LingBot-VA fixes this with a Forward Dynamics Model (FDM) grounding step:
This re-grounds the model in real observations at every step, even though there's a one-step delay.
LingBot-VA follows a two-phase training pipeline: massive-scale pre-training on diverse data, then lightweight post-training on specific robot tasks.
The backbone is Wan2.2-5B, a large-scale pretrained video generation model. The action stream (350M parameters) is added on top, bringing the total to 5.3B parameters.
Pre-training data comes from two sources:
The model is pre-trained for 1.4 trillion tokens using AdamW with cosine annealing, bfloat16 mixed precision, and classifier-free guidance.
Different robots have different action spaces. LingBot-VA defines a universal dual-arm representation: each arm gets 7 end-effector pose dimensions + 7 joint angle dimensions + 1 gripper dimension = 15 per arm, 30 total. Robots with fewer degrees of freedom get zero-padded.
During training, the chunk size K is randomly sampled from [1, 4]. This teaches the model to generate coherent predictions at different temporal horizons. At inference, K=4 is used as a practical tradeoff between efficiency and responsiveness.
Adapting to a new robot platform requires remarkably little data. With just 50 demonstrations per task, the model is fine-tuned for 3K steps at a reduced learning rate (10-5). This is sufficient for effective deployment.
The total loss combines two flow matching objectives:
Ldyn supervises video velocity field prediction (visual dynamics). Linv supervises action velocity field prediction (inverse dynamics), conditioned on current and next observations. Both use the same flow matching framework, with λ=1.
LingBot-VA is evaluated on both simulation benchmarks and real-world robot tasks. The results are strong across the board.
Six manipulation tasks spanning three categories, each with only 50 demonstrations:
| Task | Category | LingBot-VA | π0.5 |
|---|---|---|---|
| Make Breakfast | Long-horizon | 7 steps | ~5 steps |
| Unpack Delivery | Long-horizon | Best | Lower |
| Insert Tubes | Precision | Best | Lower |
| Pick Screws | Precision | Best | Lower |
| Fold Clothes | Deformable | Best | Lower |
| Fold Pants | Deformable | Best | Lower |
LingBot-VA outperforms π0.5 on all six tasks and both metrics (success rate and progress score). The strongest gains appear on long-horizon tasks, validating the temporal memory advantage of autoregressive world modeling.
| Method | Easy (avg) | Hard (avg) |
|---|---|---|
| X-VLA | 72.9 | 72.8 |
| π0 | 65.9 | 58.4 |
| π0.5 | 82.7 | 76.8 |
| Motus | 88.7 | 87.0 |
| LingBot-VA | 92.9 (+4.2) | 91.6 (+4.6) |
The improvement grows with task horizon: at Horizon=3 (3-step tasks), LingBot-VA gains +8.2% (Easy) and +9.1% (Hard) over the next best method.
| Method | Spatial | Object | Goal | Long | Avg |
|---|---|---|---|---|---|
| X-VLA | 98.2 | 98.6 | 97.8 | 97.6 | 98.1 |
| π0 | 97.6 | 98.4 | 97.9 | 85.2 | 97.1 |
| LingBot-VA | 98.5 | 99.6 | 97.2 | 98.5 | 98.5 |
On LIBERO with varying numbers of demonstrations:
Average success rate across 50 bimanual tasks. LingBot-VA (rightmost) consistently outperforms all baselines, with the gap widening on harder settings.
LingBot-VA sits at the intersection of several major threads in robotics and generative modeling. Let's map where it fits.
π0 is a flow-matching VLA that generates action chunks conditioned on visual observations. It's a feedforward policy — no world model, no video prediction. LingBot-VA extends this paradigm by adding an explicit video world model that predicts future visual states, enabling the action decoder to reason about consequences rather than just react.
UWM uses bidirectional diffusion within chunks to jointly generate video and actions. LingBot-VA's key departure is causal autoregressive generation across chunks, which provides persistent memory (KV cache) and natural closed-loop correction. UWM's bidirectional chunks can't easily incorporate real-time feedback mid-generation.
Diffusion Policy uses diffusion to generate action sequences, but without any video prediction component. It's purely a policy model. LingBot-VA adds the "imagination" layer — predicting what the world will look like — which provides richer conditioning for action generation.
Video prediction models like Genie and UniSim can imagine future frames, but they're not designed for real-time robot control. LingBot-VA bridges this gap by making video prediction fast enough (via partial denoising and async inference) and tight enough (via MoT and inverse dynamics) for closed-loop manipulation.
| Aspect | LingBot-VA |
|---|---|
| Core idea | Video world model + action policy in one AR diffusion framework |
| Architecture | Mixture-of-Transformers (5B video + 350M action) |
| Generation | Autoregressive chunks, flow matching per chunk |
| Inference trick | Partial denoising (s=0.5) + async FDM grounding |
| Loop | Closed-loop: real observations replace predictions each step |
| Pre-training | Internet video + 16K hrs robot data, 1.4T tokens |
| Post-training | 50 demos, 3K steps — sufficient for new tasks |
| Key result | 92.9% on RoboTwin (50 tasks), 98.5% on LIBERO (avg) |
| Advantage | Long-horizon: +8-9% over best baseline at Horizon=3 |