Rhoda AI: Direct Video Action Models

Chapter 0: The Data Problem

Teaching a robot to do useful work is absurdly expensive. Training GPT-4 required text scraped from the whole internet — billions of documents, essentially free. Training a robot policy requires robot demonstrations: a human teleoperating an expensive arm, one task at a time, in the exact environment where the robot will work.

State-of-the-art Vision-Language-Action (VLA) models like RT-2 and OpenVLA need 100,000+ hours of robot demonstrations to reach industrial reliability. At 1 demo per 2 minutes, that's a continuous human effort for over 100 years. You can parallelize with many arms and many operators — but the cost stays in the tens of millions of dollars per robot type.

The question Rhoda AI asked: What if robot data isn't the bottleneck? What if we could use the trillion hours of video already on the internet — cooking shows, sports, manufacturing footage, everyday human activity — and only need a tiny amount of robot-specific data?

The naive answer: "just pre-train on video then fine-tune on robot data." Every major lab tried this. It helps, but not enough. The reason: video predicts pixels, and controlling a robot requires actions — and those two things are fundamentally different outputs.

Rhoda AI's answer: make them the same thing. If your robot policy IS a video model, then all that web video is directly usable robot training data. You never need to "transfer" from video to actions — the video model generates actions through an inverse step that is cheap to learn.

Data Cost Comparison

Training data required for a new robot task. Drag the slider to see the gap.

Number of tasks 3

Why do traditional VLA models require 100,000+ hours of robot data?

Robots are physically harder to train than language models The models are larger so they need more data Every (observation, action) training pair must come from actual robot teleoperation — internet video doesn't provide robot actions

Chapter 1: Video Generation = Robot Control

Here is the core insight. Say your robot is looking at a table with a cup on it. You want it to pick up the cup. Traditional VLA: the model sees the image, reads the instruction, and directly predicts the motor command [Δx=0.05, Δy=0.02, Δz=-0.01, ...].

Direct Video Action (DVA) does something different. The model sees the image, reads the instruction, and predicts future video frames — what the scene would look like if the robot were successfully completing the task. Then a second, simpler model watches those predicted future frames and figures out what motor commands would produce that motion.

Current frame

What the camera sees right now

↓ causal video model

Predicted future frames

What the scene SHOULD look like in 0.5s, 1s, 1.5s...

↓ inverse dynamics model

Robot actions

[Δx, Δy, Δz, Δrx, Δry, Δrz, gripper] per timestep

Why would this work better? Because predicting plausible future video is a well-constrained problem with massive training data. Every YouTube video, every movie, every cooking tutorial is training data for "what does the world look like over time?" The video model learns physics, 3D structure, object permanence, and human behavior — for free, from data that already exists.

The key insight: "What will the scene look like?" is a much easier question to answer from web video than "what motor command should I send?" The hard part of robot control is understanding the world. The video model handles that. The inverse dynamics model only needs to solve the easier problem: given that I want the world to look like this, what motor commands get me there?

The inverse dynamics model is cheap because it only needs to map between two visual states and robot kinematics — a problem solvable from ~10 hours of robot data, because you don't need to learn the hard world-understanding part. That's already baked into the video model.

VLA vs DVA: What the Model Learns

Click to toggle between approaches. Notice what each model must learn from scratch vs. what it gets for free.

In DVA, what does the causal video model actually output?

Motor commands directly Predicted future video frames A language description of the action

Chapter 2: The Three Components

DVA has three interlocking pieces. Each has a distinct job, trained on different data at different scales.

1. Causal Video Model

Pre-trained on 1M+ internet videos. Takes as input: the current camera frame(s), proprioception (the robot's joint positions and velocities, encoded as a signal overlaid on the video), and a language instruction. Outputs: predicted future frames showing what the scene should look like as the task is completed.

The word causal is important. Standard video models are bidirectional — they see past and future frames during training and fill in the middle. Causal means the model can only attend to past frames, not future ones. This is essential for real-time robot control: you can't wait for the future to generate the present.

Context amortization: DVA's video model predicts future frames at every position in the input sequence, not just at the end. This gives the robot a rich, multi-scale prediction of the future — the immediate next frame, frames 1 second out, frames 3 seconds out — all computed in one forward pass.

2. Inverse Dynamics Model

Given two consecutive predicted video frames — what the scene looks like at time t and time t+1 — the inverse dynamics model computes: what robot action would cause that transition? It's trained on ~10 hours of robot data per embodiment type.

Why is this cheap to train? Because "figure out what motion produced this visual change" is a much narrower problem than "figure out what to do in the world." The visual model already did the hard reasoning. The inverse dynamics model just needs to know robot kinematics — and 10 hours of data is enough to learn that.

3. Leapfrog Inference

Running a large video model takes time — maybe 100ms per step. But robots need commands every 10-30ms for smooth control. Leapfrog inference solves this: the video model generates predictions that extend far enough into the future to cover its own inference latency. While the model is computing the next batch of future frames, the robot executes actions from the previous prediction.

Leapfrog Inference Timeline

Watch how predictions overlap to cover inference latency. The robot never waits.

Why it works: The three components form a clean separation of concerns. The video model handles world understanding (hard, needs web-scale data). The inverse dynamics model handles robot grounding (easy, needs 10 hours). Leapfrog handles latency (engineering, handled at inference time). Each part is independently trainable and upgradeable.

Why does the inverse dynamics model only need ~10 hours of robot data?

Because it's a very small model Because robot tasks are simple Because it only needs to map visual changes to motor commands — the hard world-understanding is already done by the video model

Chapter 3: Pre-training on Web Video

The causal video model is the foundation of DVA — and its power comes entirely from what it learns before a robot ever enters the picture. Training on 1M+ internet videos, the model develops an internal physics engine: it learns that objects fall, that liquids pour, that cloth folds, that hands grasp.

None of this required labeling or annotation. The training signal is just: "given these past frames, predict the next frame." The model is forced to understand causality, occlusion, 3D structure, and dynamics to do this well. It learns all of this from human video — which is abundant, cheap, and richly varied.

What 1M+ videos buy you: The model has seen more physics than any robot has ever experienced. It's watched things fall, pour, roll, fold, shatter, and assemble in millions of contexts. When it then needs to predict "what happens when the robot arm approaches the cup," it's not guessing — it's pattern-matching against a vast library of visually similar scenes.

This is context amortization: the pre-training investment is amortized across every future task. Any task the robot needs to do in the physical world is likely covered, at least approximately, by some video in the training set. A robot manipulating a coffee machine has seen humans use coffee machines. A robot sorting boxes has seen people pack and unpack things.

The model also learns long-context visual memory. DVA's video model processes hundreds of frames as context — not just the most recent one. This lets it maintain a coherent model of the scene over time: remembering where objects were placed two minutes ago, tracking the state of a partially-assembled task.

What Web Video Teaches

Click a skill to see where robots get it — from web data or from robot-specific training.

What is "context amortization" in DVA?

Compressing the video context to save memory Predicting future frames at every position in the sequence (not just the end), so the model develops a rich multi-scale view of the future in one pass Using a shorter history window to reduce compute

Chapter 4: Post-training on Robot Data

After web video pre-training, the model knows the world. But it doesn't yet know this robot, in this environment, with this proprioception signal. That's what post-training provides — and it's where DVA's efficiency advantage becomes concrete.

Post-training has two stages. First, the causal video model is fine-tuned on robot video: recordings of the robot operating in its target environment. This teaches it the visual appearance of the robot, the specific lighting and camera angles, and what successful task execution looks like from the robot's viewpoint. This requires ~10-20 hours of footage.

10-20 hours vs 100,000+ hours: Traditional VLA methods require the model to learn world physics AND robot control simultaneously from robot data alone. DVA's model already knows physics — it just needs to adapt the visual style and learn the robot's specific proprioception encoding. That's a tiny fraction of the learning problem.

Second, the inverse dynamics model is trained. Given pairs of (frame at time t, frame at time t+1) from robot video, it learns: what action was the robot executing between these two frames? This is pure supervised learning on kinematics data, and it converges fast.

One remarkable result: one-shot learning from a single human demo. For some tasks, a single demonstration video of a human performing the task — not even using the robot — is enough. The video model generalizes so well that one example unlocks novel objects, novel arrangements, and novel environments. The inverse dynamics model handles the "translate human motion to robot commands" step automatically.

Stage 1: Web pre-training

1M+ videos · Learns physics, 3D, object behavior

↓

Stage 2: Robot video fine-tune

~10-20h · Adapts visual domain, learns proprioception

↓

Stage 3: Inverse dynamics

Same ~10-20h of robot data · Learns frame → action mapping

↓

Deployment

Leapfrog inference · Real-time robot control

Why can DVA achieve one-shot learning from a single human demo?

The video model already understands physics and scenes; it just needs to see the target task once to generalize, and the inverse dynamics model handles the human-to-robot translation The robot has a large action vocabulary that covers most tasks Human demos are higher quality than robot teleoperation data

Chapter 5: Results

Rhoda AI tested DVA on three industrial logistics tasks at Decathlon, a large sporting-goods retailer. These aren't toy demonstrations — they're end-to-end warehouse operations with real inventory, real uncertainty, and real time pressure.

Decanting

Taking items out of supplier boxes and placing them into warehouse bins. Requires recognizing each item, choosing the right bin, and handling items of varying size, shape, and weight.

1.5 hours autonomous · 11 hours training data

Container Breakdown

Processing incoming freight containers: identifying, sorting, and staging items for warehouse storage. Higher complexity — more item variety, more spatial reasoning.

160 minutes autonomous · 17 hours training data

Returns Processing

Handling returned clothing — inspecting condition, refolding, and re-stocking. This is the hardest task: deformable objects (fabric) with infinite possible configurations.

End-to-end with no scaffolding · Single human demo per garment type

One-Shot Transfer

Novel objects and environments from a single human demonstration. No robot-specific data collected for these scenarios at all.

Succeeds on out-of-distribution objects and layouts

Autonomous Operation Duration — DVA vs Traditional VLA

Time the robot runs autonomously before needing human intervention. Training data shown below each bar.

The real test: 160 minutes of continuous autonomous operation for Container Breakdown is not an academic result — it's a warehouse shift. The robot is operating for nearly 3 hours without a human touching it. That's the bar for industrial deployment, and DVA cleared it with 17 hours of training data.

Which task demonstrated the most challenging scenario for DVA?

Decanting, because it has the most items Container breakdown, because it runs the longest Returns processing, because deformable objects like clothing have infinite possible configurations that cannot be exhaustively demonstrated

Chapter 6: DVA vs VLAs

DVA and VLAs solve the same problem — robot control from visual observations and language — but make opposite bets about where the bottleneck is.

Dimension	VLA (RT-2, OpenVLA)	DVA (Rhoda AI)
Training data	100K+ hours robot demos	10–20 hours robot demos
Pre-training signal	Language + robot actions	Web video (1M+ videos)
Output	Action tokens directly	Future frames → actions
Context length	Typically 1–16 frames	Hundreds of frames
Interpretability	Black box output	Predicted video is inspectable
New embodiment cost	100K+ hours new data	~10 hours new data
One-shot learning	Limited	Demonstrated on novel objects
Continuous operation	Minutes (research demos)	160 minutes (industrial)

The interpretability bonus: Because DVA generates future video, you can actually look at what the robot plans to do. If the predicted future frames show the robot approaching the wrong object, you can catch that before it executes. VLAs output an action vector — you can't tell from [0.05, -0.02, 0.01, ...] whether the robot is about to do something reasonable or catastrophically wrong.

VLAs have one advantage: they're end-to-end differentiable. If the action is wrong, the gradient flows directly back to fix it. DVA has a hard boundary between the video model and the inverse dynamics model — errors in one don't automatically correct the other. In practice, DVA's empirical performance has been strong enough that this hasn't been a limiting factor, but it's a real architectural difference.

The bigger picture: DVA and VLAs represent different hypotheses about the structure of the robot learning problem. VLAs bet that actions are the right abstraction. DVA bets that video is the right abstraction, and actions are a derived quantity. Both approaches may ultimately be necessary — perhaps future systems will combine both.

What is the interpretability advantage of DVA over VLAs?

DVA has fewer parameters so it's easier to inspect DVA generates predicted future frames that can be visually inspected, letting you see the robot's "plan" before it executes DVA outputs actions in human-readable format

Chapter 7: Connections

DVA doesn't exist in isolation. It sits at the intersection of several threads in AI and robotics research.

Related Approaches

System	What it shares with DVA	Key difference
pi-0 (Physical Intelligence)	Flow-matching policy, large-scale pre-training, manipulation	Predicts actions directly; uses diffusion not video generation
OpenVLA	Open weights, industrial ambition	7B LLM backbone outputting discretized action tokens
UniSim (Google)	Video model for robot planning	Uses video as a world model for planning, not as the direct policy
Genie (DeepMind)	Video generation from single frames	Interactive world model for games, not robot control
SWIM / GROOT (NVIDIA)	Video model + robot	Generates video as a planning intermediate, uses imitation learning to extract policy

Explore the building blocks

Vision-Language-Action Models →
The VLA approach DVA is an alternative to. Understand RT-2, OpenVLA, and behavioral cloning.

World Models →
DVA's video model is essentially a world model. The connection to Dreamer and RSSM is deep.

Diffusion Models →
Some video generation approaches (and pi-0) use diffusion. Understanding diffusion helps contextualize DVA's architectural choices.

Imitation Learning →
DVA's post-training is a form of imitation learning. The challenges of distribution shift and compounding errors still apply.

The bigger question: DVA's success suggests that the hard part of robot learning is world understanding, not robot-specific control. If that's true, the future of robotics is training larger and larger video models on more and more internet data — and robot-specific data becomes a thin adaptation layer. This is the same story as LLMs: the hard part was pre-training on the internet; fine-tuning to specific tasks is relatively cheap.

"The world is mostly video. If we can learn to predict it accurately, we can learn to act in it." — the bet behind DVA