LingBot-VA — Veanors

Chapter 0: The Problem

You want a robot to make breakfast. It needs to grasp a plate, pick up bread, grab a kettle, pour water, and serve — a long sequence of precise actions, each depending on what happened before.

Today's leading approach is the Vision-Language-Action (VLA) model: take the current camera image, feed it through a big neural network, and output the next action. It's a direct, feedforward mapping: observation in, action out.

This works surprisingly well for simple tasks. But it has a deep flaw: representation entanglement. A single network must simultaneously learn three very different things:

Visual understanding: what objects are in the scene, where they are, their shapes and poses
Physical dynamics: what will happen if I push this cup? Will the bread slide off?
Motor control: what joint angles and gripper commands will move my arm to the right place?

All three are crammed into one supervision signal: "here's what the expert did." The model must compress high-dimensional visual semantics and low-dimensional motor commands into a shared representation. This leads to poor sample efficiency and brittle generalization.

The fundamental limitation: Feedforward VLAs are reactive. They see the current frame and output an action. They have no model of what will happen next. They cannot "imagine" the consequences of their actions before committing to them. When something unexpected happens mid-task, they have no mechanism to re-plan.

Think about how you pour water. You don't just react to the current visual frame. You predict where the water will go, you anticipate when the cup is full, and you adjust your pour angle in real time. You have a world model — an internal simulator of physics — and you use it constantly.

Feedforward VLA vs. World Model

Compare a reactive VLA (left) that maps observations directly to actions, versus a world model approach (right) that first predicts the future, then decides what to do. Click Play to animate.

Click Play to compare

Why do feedforward VLAs struggle with long-horizon manipulation tasks?

They must compress visual understanding, physical dynamics, and motor control into a single feedforward mapping without any model of future consequences — making them reactive rather than predictive They run too slowly for real-time control They require too much GPU memory for long video sequences

Chapter 1: The Key Insight

LingBot-VA's core idea: combine a video world model and an action policy in one unified framework. Instead of mapping observations directly to actions, decompose the problem into two stages:

Stage 1: Visual Dynamics

Given what you've seen so far, predict what the world will look like next. Generate future video frames.

↓

Stage 2: Inverse Dynamics

Given where you are now and where you want to be, figure out what action gets you there.

In math, a standard VLA learns:

a_t ~ π_θ( · | o_t )

LingBot-VA instead learns two things:

Stage 1: o_t+1 ~ p_θ( · | o_≤t ) Stage 2: a_t ~ g_ψ( · | o_t, o_t+1 )

Stage 1 predicts the next observation given history. Stage 2 infers actions from the desired visual transition. This decomposition is powerful because each stage can leverage different data:

Stage 1 can be pre-trained on billions of internet video frames. No robot data needed. The model learns physics, object dynamics, and scene evolution from the entire visual internet.
Stage 2 only needs robot demonstrations to ground visual predictions in executable actions. Much less data required.

Why this matters: By factoring the problem into "what will happen" and "what should I do," LingBot-VA can inherit rich physical priors from video pre-training. A model that has watched millions of cups being poured already knows how liquids behave — it just needs a small amount of robot data to learn the motor commands.

But the real magic is that LingBot-VA doesn't actually separate these into two independent models. It interleaves video and action tokens into a single autoregressive sequence, processing both through a shared transformer. The two stages happen jointly, each informing the other. This is what makes it different from prior work that bolted a separate action head onto a video model.

What is the key advantage of decomposing robot control into visual dynamics prediction + inverse dynamics?

It requires fewer GPU hours to train Visual dynamics can be pre-trained on massive internet video data (no robot needed), while inverse dynamics needs only a small amount of robot demonstrations — enabling data-efficient transfer It eliminates the need for language instructions

Chapter 2: Background

Before diving into LingBot-VA's architecture, we need three pieces of background.

Flow Matching (Continuous Diffusion)

Standard diffusion models add noise to data and learn to reverse the process. Flow matching is a cleaner formulation: it learns a velocity field that transports noise to data along a straight path.

Given a data sample x₁ and noise ε ~ N(0, I), define an interpolation path:

x(s) = (1 - s)ε + s · x₁

The true velocity along this path is simply x₁ - ε. The model learns to predict this velocity:

L_FM = E[ || v_θ(x(s), s) - (x₁ - ε) ||² ]

At inference, start from pure noise (s=0) and integrate the learned velocity field to s=1. The result is a sample from the data distribution. Flow matching is used in LingBot-VA for both video frame generation and action prediction.

Autoregressive vs. Chunk-Based Generation

There are two ways to generate sequences of video frames for robot control:

Chunk-based (bidirectional): Generate an entire chunk of K frames at once using bidirectional attention within the chunk. Methods like UWM and UVA do this. Problem: future tokens in the chunk can influence past tokens, which violates physical causality. And you can't inject new observations mid-chunk.
Autoregressive (causal): Generate one frame (or one small chunk) at a time, each conditioned on all previous frames. Like how language models generate one token at a time. This respects causality — the present depends only on the past.

Why causality matters for robots: In the real physical world, the future doesn't influence the past. If your robot bumps a cup at time t, the cup's new position at t+1 depends on the bump — not on what happens at t+2. Bidirectional models allow t+2 to influence t+1 within a chunk, which can produce physically inconsistent predictions. Causal models enforce the correct temporal direction.

KV Cache

In autoregressive transformers, each new token attends to all previous tokens. Without optimization, this means recomputing attention over the entire history at every step. The KV cache stores the key and value vectors from previous steps, so only the new token's queries need to be computed. This makes autoregressive generation efficient and enables persistent long-term memory — the model never "forgets" what happened 100 steps ago.

Why is autoregressive (causal) generation preferred over bidirectional chunk generation for robot control?

Because causal generation respects the temporal arrow of physics (past influences present, not vice versa) and allows real-time observations to be injected at every step Because causal generation produces higher-resolution video frames Because bidirectional attention requires more GPU memory

Chapter 3: Shared Latent Space

LingBot-VA's first key design: unify vision and action tokens in a single sequence, processed by a Mixture-of-Transformers (MoT) architecture.

Token Interleaving

Video frames are encoded into latent tokens using a causal Video VAE (from Wan2.2). Each frame produces N = 192 spatial tokens. Actions are projected into the same embedding space via a small MLP. The tokens are interleaved in temporal order:

[ z₀, a_0,1, a_0,2, ..., a_0,τ, z₁, a_1,1, ..., a_1,τ, z₂, ... ]

Here τ = 4 is the temporal downsampling factor. For each video frame, there are 4 action tokens, because actions run at higher frequency (50 Hz) than video (12.5 Hz). Predicting K video frames means generating 4K actions.

Mixture-of-Transformers

The key challenge: vision and action are very different modalities. Video tokens are high-dimensional (capturing rich spatial information), while action tokens are low-dimensional (7 DoF per arm). Sharing all parameters would force the model to compress both into the same representation.

MoT solves this with separate expert parameters per modality:

Each layer has two separate QKV projection matrices — one for video tokens, one for action tokens
Video and action tokens share attention — they can attend to each other
But their features are projected independently, maintaining distinct representational spaces

The video stream is large (d_v = 3072, initialized from Wan2.2-5B). The action stream is small (d_a = 768, 4x smaller). This asymmetric design reflects the fact that action distributions are inherently simpler than visual data.

For joint attention, action tokens are projected up to the video dimension, participate in shared self-attention, then projected back down. A residual connection preserves action-specific features.

Why MoT, not just a shared transformer? Video tokens need expressive high-dimensional features to capture spatial detail. Action tokens only need to represent 7 numbers per arm. If you force them through the same projection layers, either the action stream wastes capacity or the video stream is constrained. MoT gives each modality its own "expert" while still allowing cross-modal communication through shared attention.

Mixture-of-Transformers Architecture

Vision tokens (blue) and action tokens (orange) pass through separate expert projections, then share attention. Watch how information flows through one MoT layer.

Click Play to animate

Action Network Initialization

Training the action stream from scratch is unstable. The action tokens' initial output distribution diverges from the video distribution, disrupting joint attention. LingBot-VA initializes the action network by interpolating the pretrained video weights to the smaller action dimension, scaled by √(d_v/d_a) to preserve output variance. This ensures both streams start with comparable distributions.

Why does LingBot-VA use separate QKV projections for vision and action tokens instead of sharing all parameters?

Because vision and action have very different dimensionalities and representational needs — video needs rich spatial features while actions only encode low-dimensional motor commands. Shared projections would force a suboptimal compromise. Because separate projections allow the model to train faster on multiple GPUs Because action tokens are always processed after video tokens

Chapter 4: Causal Autoregressive Generation

This is the heart of LingBot-VA. At each autoregressive step, the model generates a chunk of K future video frames via flow matching, then simultaneously decodes the corresponding actions via inverse dynamics. Let's trace through exactly how this works.

Step-by-Step Generation

Given observation history z_≤t and action history a_<t:

Sample noise: Draw ε ~ N(0, I) for the next K latent frames
Denoise video chunk: Integrate the learned velocity field from s=0 to s=0.5 (partial denoising — we'll explain why shortly):
ẑ_t+1:t+K = ε + ∫₀^0.5 v_θ(z^(s), s | C) ds
Decode actions: Sample new noise for action tokens, then integrate from s=0 to s=1:
a_t:t+K-1 = ε + ∫₀¹ v_ψ(a^(s), s | ẑ_t:t+K, C) ds
Execute actions and collect new real-world observations
Update KV cache with the new tokens and repeat

Causal Attention Masking

Within the interleaved sequence, a strict causal attention mask ensures that each token can only attend to tokens that appear earlier in the temporal sequence. Within a chunk, tokens can attend to each other (bidirectional within the chunk), but across chunks, attention is strictly causal. This preserves the temporal arrow of physical causality.

Teacher Forcing During Training

During training, the model uses ground-truth tokens as context (teacher forcing). This is unusually well-suited for robotics: unlike pure generative modeling where teacher forcing creates train-test mismatch, robot policies naturally receive real observations during deployment. The training and deployment regimes match.

Noisy History Augmentation

The biggest bottleneck is video denoising — video tokens vastly outnumber action tokens, and each requires multiple denoising steps. Key insight: action decoding doesn't need pixel-perfect video reconstruction. The inverse dynamics model can extract action-relevant information from partially noisy video states.

During training, with probability 0.5, the video history is augmented with noise at a random flow time s_aug ∈ [0.5, 1.0]. This trains the action decoder to be robust to partially denoised inputs. At inference, video tokens only need to be denoised to s=0.5 instead of s=1.0 — halving the denoising computation while maintaining action prediction quality.

The core tradeoff: Full video denoising produces beautiful frames but is computationally expensive. For robot control, we don't need beautiful frames — we need accurate actions. Partial denoising preserves the semantic structure that actions depend on while cutting compute in half.

Autoregressive Rollout

Watch the model generate frame-by-frame with action conditioning. Toggle between open-loop (no feedback) and closed-loop (real observations injected). Notice how open-loop drifts over time.

Closed-loop (ground-truth feedback) Step 0 / 8

Why does LingBot-VA only denoise video tokens to s=0.5 instead of s=1.0 during inference?

Because the action decoder is trained (via noisy history augmentation) to extract action-relevant information from partially denoised video — cutting denoising computation in half without sacrificing action accuracy Because the video VAE decoder can reconstruct full frames from partial latents Because partial denoising produces more diverse action predictions

Chapter 5: Closed-Loop Rollout

Here's a problem with any world model: predicted frames drift from reality over time. Even small errors compound. After 10 steps of open-loop prediction, the model's imagined world may look nothing like the real one. The robot is acting on a hallucination.

Open-Loop vs. Closed-Loop

Open-loop: Generate an entire trajectory of predicted frames, then execute all the corresponding actions. The model never gets corrected by reality. If the first prediction is slightly off, every subsequent prediction builds on that error.

Closed-loop: After executing each action chunk, replace the model's predicted frames with the actual observation from the robot's camera. The model is continuously re-grounded in reality.

Think of it like driving. Open-loop is like closing your eyes at the start of a road and driving based on your mental model of where the road goes. Closed-loop is normal driving — you keep your eyes open and correct as you go. For long-horizon tasks, closed-loop is essential.

How LingBot-VA Closes the Loop

At each autoregressive step:

The model predicts the next K frames and K actions
The robot executes the K actions and captures real observations
The real observations are encoded into latent tokens via the Video VAE
These ground-truth tokens replace the model's predictions in the KV cache
The next prediction step conditions on the corrected history

This is natural for autoregressive models: the KV cache simply gets the real tokens instead of predicted ones. No special mechanism is needed — it's the same teacher-forcing used during training.

Why Chunk-Based Methods Struggle Here

Chunk-based bidirectional models generate entire segments at once. Injecting a ground-truth observation mid-chunk would require re-generating the entire chunk, because bidirectional attention means every token in the chunk depends on every other. LingBot-VA's causal structure means you can simply append the real observation and continue generating from there.

Training-deployment alignment: During training, teacher forcing provides ground-truth context. During deployment, closed-loop rollout provides ground-truth observations. The two regimes match perfectly — the model sees the same kind of input distribution in both cases. This is a rare case where teacher forcing doesn't create distribution mismatch.

Why does closed-loop rollout align naturally with LingBot-VA's autoregressive formulation?

Because the causal structure lets you simply replace predicted tokens with real observations in the KV cache and continue generating — matching the teacher-forcing regime used during training Because autoregressive models always produce more accurate predictions than bidirectional models Because closed-loop rollout requires less GPU memory than open-loop

Chapter 6: Asynchronous Inference

Even with partial denoising and KV cache, autoregressive video-action generation takes time. If the robot has to wait for the model to finish predicting before it can move, there's a delay between seeing the world and acting on it. For real-time control at 50 Hz, this delay can be catastrophic.

The Synchronous Problem

In synchronous inference: observe → predict → execute → observe → predict → execute. The robot is idle while the model computes, and the model is idle while the robot moves. Half the time is wasted.

The Asynchronous Solution

Pipeline the computation: while the robot executes action chunk a_t, the model simultaneously predicts the next chunk a_t+1. When the robot finishes executing, the next actions are already ready. No idle time.

The FDM Grounding Trick

There's a subtle problem with naive asynchronous inference. When predicting a_t+1, the model doesn't have the real observation at time t (the robot hasn't finished executing yet). It has to use its own predicted frame ẑ_t. But this predicted frame might be stale or inaccurate.

A naive approach just uses the stale prediction. But the video model tends to "continue" its own hallucinated video rather than staying grounded in reality. Over time, the model drifts into open-loop mode.

LingBot-VA fixes this with a Forward Dynamics Model (FDM) grounding step:

Wait for the most recent real observation z_t-1 (from the previous execution)
Use the model to "imagine" what z_t looks like after applying action a_t to real observation z_t-1
Use this FDM-grounded prediction (instead of the stale one) as context for predicting a_t+1

This re-grounds the model in real observations at every step, even though there's a one-step delay.

The result: Asynchronous inference with FDM grounding achieves comparable success rates to synchronous inference while completing tasks 2× faster. The naive async approach (without FDM grounding) degrades significantly, especially on long-horizon tasks (32.9% vs 93.2% at Horizon=3).

Why does naive asynchronous inference (without FDM grounding) degrade on long-horizon tasks?

Because the video model "continues" its own hallucinated predictions rather than staying grounded in real observations, causing progressive drift into open-loop mode Because the GPU runs out of memory during long sequences Because the action decoder is too slow for long horizons

Chapter 7: Training

LingBot-VA follows a two-phase training pipeline: massive-scale pre-training on diverse data, then lightweight post-training on specific robot tasks.

Phase 1: Pre-Training

The backbone is Wan2.2-5B, a large-scale pretrained video generation model. The action stream (350M parameters) is added on top, bringing the total to 5.3B parameters.

Pre-training data comes from two sources:

Internet video: Diverse in-the-wild videos that teach the model about physics, object dynamics, and scene evolution. No robot actions — just visual prediction.
Robot manipulation data: ~16K hours from six datasets (Agibot, RoboMind, InternData-A1, OXE, UMI, RoboCOIN) spanning diverse embodiments and tasks.

The model is pre-trained for 1.4 trillion tokens using AdamW with cosine annealing, bfloat16 mixed precision, and classifier-free guidance.

Unified Action Representation

Different robots have different action spaces. LingBot-VA defines a universal dual-arm representation: each arm gets 7 end-effector pose dimensions + 7 joint angle dimensions + 1 gripper dimension = 15 per arm, 30 total. Robots with fewer degrees of freedom get zero-padded.

Variable Chunk Size

During training, the chunk size K is randomly sampled from [1, 4]. This teaches the model to generate coherent predictions at different temporal horizons. At inference, K=4 is used as a practical tradeoff between efficiency and responsiveness.

Phase 2: Post-Training

Adapting to a new robot platform requires remarkably little data. With just 50 demonstrations per task, the model is fine-tuned for 3K steps at a reduced learning rate (10^-5). This is sufficient for effective deployment.

Data efficiency: With 50 demonstrations, LingBot-VA achieves 97.0% on LIBERO — competitive with methods trained on far more data. With just 10 demonstrations, it still reaches 81.7%. The video pre-training provides such strong physical priors that very little robot-specific data is needed to ground them.

Training Objective

The total loss combines two flow matching objectives:

L = L_dyn + λ L_inv

L_dyn supervises video velocity field prediction (visual dynamics). L_inv supervises action velocity field prediction (inverse dynamics), conditioned on current and next observations. Both use the same flow matching framework, with λ=1.

Why can LingBot-VA achieve strong performance with only 50 demonstrations per task?

Because the video pre-training on internet video and robot data provides rich physical priors that transfer to new tasks — post-training only needs to adapt the motor commands, not relearn physics Because the model uses data augmentation to generate 1000x more training examples Because 50 demonstrations contain enough variation for any manipulation task

Chapter 8: Results

LingBot-VA is evaluated on both simulation benchmarks and real-world robot tasks. The results are strong across the board.

Real-World Deployment (6 Tasks)

Six manipulation tasks spanning three categories, each with only 50 demonstrations:

Task	Category	LingBot-VA	π_0.5
Make Breakfast	Long-horizon	7 steps	~5 steps
Unpack Delivery	Long-horizon	Best	Lower
Insert Tubes	Precision	Best	Lower
Pick Screws	Precision	Best	Lower
Fold Clothes	Deformable	Best	Lower
Fold Pants	Deformable	Best	Lower

LingBot-VA outperforms π_0.5 on all six tasks and both metrics (success rate and progress score). The strongest gains appear on long-horizon tasks, validating the temporal memory advantage of autoregressive world modeling.

Simulation: RoboTwin 2.0 (50 Bimanual Tasks)

Method	Easy (avg)	Hard (avg)
X-VLA	72.9	72.8
π₀	65.9	58.4
π_0.5	82.7	76.8
Motus	88.7	87.0
LingBot-VA	92.9 (+4.2)	91.6 (+4.6)

The improvement grows with task horizon: at Horizon=3 (3-step tasks), LingBot-VA gains +8.2% (Easy) and +9.1% (Hard) over the next best method.

Simulation: LIBERO (4 Suites)

Method	Spatial	Object	Goal	Long	Avg
X-VLA	98.2	98.6	97.8	97.6	98.1
π₀	97.6	98.4	97.9	85.2	97.1
LingBot-VA	98.5	99.6	97.2	98.5	98.5

Data Efficiency

On LIBERO with varying numbers of demonstrations:

10 demos: 81.7% average success
25 demos: 92.9% average success
50 demos: 97.0% average success

Results Comparison: RoboTwin 2.0

Average success rate across 50 bimanual tasks. LingBot-VA (rightmost) consistently outperforms all baselines, with the gap widening on harder settings.

Why long-horizon gains are largest: On short tasks (Horizon=1), all methods perform well — even reactive policies can handle one-step manipulation. But as tasks grow longer, methods without persistent memory and closed-loop correction drift. LingBot-VA's KV cache maintains full context, and its closed-loop rollout continuously re-grounds predictions in reality.

Where does LingBot-VA show the largest improvement over prior methods?

On single-step simple manipulation tasks On long-horizon multi-step tasks (Horizon=3), where it gains +8-9% over the next best method — validating its persistent memory and closed-loop correction On tasks with many objects in the scene

Chapter 9: Connections

LingBot-VA sits at the intersection of several major threads in robotics and generative modeling. Let's map where it fits.

Relation to π₀ / π_0.5

π₀ is a flow-matching VLA that generates action chunks conditioned on visual observations. It's a feedforward policy — no world model, no video prediction. LingBot-VA extends this paradigm by adding an explicit video world model that predicts future visual states, enabling the action decoder to reason about consequences rather than just react.

Relation to UWM (Universal World Model)

UWM uses bidirectional diffusion within chunks to jointly generate video and actions. LingBot-VA's key departure is causal autoregressive generation across chunks, which provides persistent memory (KV cache) and natural closed-loop correction. UWM's bidirectional chunks can't easily incorporate real-time feedback mid-generation.

Relation to Diffusion Policy

Diffusion Policy uses diffusion to generate action sequences, but without any video prediction component. It's purely a policy model. LingBot-VA adds the "imagination" layer — predicting what the world will look like — which provides richer conditioning for action generation.

Relation to Video World Models

Video prediction models like Genie and UniSim can imagine future frames, but they're not designed for real-time robot control. LingBot-VA bridges this gap by making video prediction fast enough (via partial denoising and async inference) and tight enough (via MoT and inverse dynamics) for closed-loop manipulation.

Cheat Sheet

Aspect	LingBot-VA
Core idea	Video world model + action policy in one AR diffusion framework
Architecture	Mixture-of-Transformers (5B video + 350M action)
Generation	Autoregressive chunks, flow matching per chunk
Inference trick	Partial denoising (s=0.5) + async FDM grounding
Loop	Closed-loop: real observations replace predictions each step
Pre-training	Internet video + 16K hrs robot data, 1.4T tokens
Post-training	50 demos, 3K steps — sufficient for new tasks
Key result	92.9% on RoboTwin (50 tasks), 98.5% on LIBERO (avg)
Advantage	Long-horizon: +8-9% over best baseline at Horizon=3

The broader lesson: Combining generation and control in one model — rather than treating them as separate modules — enables the model to use its "imagination" of the future as a rich conditioning signal for action prediction. The causal autoregressive structure is what makes this practical: it provides persistent memory, respects physical causality, and enables efficient closed-loop correction with real-world feedback.

What is LingBot-VA's key architectural departure from chunk-based video-action models like UWM?

LingBot-VA uses a larger backbone LingBot-VA uses causal autoregressive generation across chunks (with KV cache for persistent memory and natural closed-loop correction), while UWM uses bidirectional diffusion within independent chunks LingBot-VA doesn't use diffusion at all

Causal World Modeling for Robot Control