Robot control reduced to video generation — 10 hours of data, 160 minutes of autonomous operation.
Teaching a robot to do useful work is absurdly expensive. Training GPT-4 required text scraped from the whole internet — billions of documents, essentially free. Training a robot policy requires robot demonstrations: a human teleoperating an expensive arm, one task at a time, in the exact environment where the robot will work.
State-of-the-art Vision-Language-Action (VLA) models like RT-2 and OpenVLA need 100,000+ hours of robot demonstrations to reach industrial reliability. At 1 demo per 2 minutes, that's a continuous human effort for over 100 years. You can parallelize with many arms and many operators — but the cost stays in the tens of millions of dollars per robot type.
The naive answer: "just pre-train on video then fine-tune on robot data." Every major lab tried this. It helps, but not enough. The reason: video predicts pixels, and controlling a robot requires actions — and those two things are fundamentally different outputs.
Rhoda AI's answer: make them the same thing. If your robot policy IS a video model, then all that web video is directly usable robot training data. You never need to "transfer" from video to actions — the video model generates actions through an inverse step that is cheap to learn.
Training data required for a new robot task. Drag the slider to see the gap.
Here is the core insight. Say your robot is looking at a table with a cup on it. You want it to pick up the cup. Traditional VLA: the model sees the image, reads the instruction, and directly predicts the motor command [Δx=0.05, Δy=0.02, Δz=-0.01, ...].
Direct Video Action (DVA) does something different. The model sees the image, reads the instruction, and predicts future video frames — what the scene would look like if the robot were successfully completing the task. Then a second, simpler model watches those predicted future frames and figures out what motor commands would produce that motion.
Why would this work better? Because predicting plausible future video is a well-constrained problem with massive training data. Every YouTube video, every movie, every cooking tutorial is training data for "what does the world look like over time?" The video model learns physics, 3D structure, object permanence, and human behavior — for free, from data that already exists.
The inverse dynamics model is cheap because it only needs to map between two visual states and robot kinematics — a problem solvable from ~10 hours of robot data, because you don't need to learn the hard world-understanding part. That's already baked into the video model.
Click to toggle between approaches. Notice what each model must learn from scratch vs. what it gets for free.
DVA has three interlocking pieces. Each has a distinct job, trained on different data at different scales.
Pre-trained on 1M+ internet videos. Takes as input: the current camera frame(s), proprioception (the robot's joint positions and velocities, encoded as a signal overlaid on the video), and a language instruction. Outputs: predicted future frames showing what the scene should look like as the task is completed.
The word causal is important. Standard video models are bidirectional — they see past and future frames during training and fill in the middle. Causal means the model can only attend to past frames, not future ones. This is essential for real-time robot control: you can't wait for the future to generate the present.
Given two consecutive predicted video frames — what the scene looks like at time t and time t+1 — the inverse dynamics model computes: what robot action would cause that transition? It's trained on ~10 hours of robot data per embodiment type.
Why is this cheap to train? Because "figure out what motion produced this visual change" is a much narrower problem than "figure out what to do in the world." The visual model already did the hard reasoning. The inverse dynamics model just needs to know robot kinematics — and 10 hours of data is enough to learn that.
Running a large video model takes time — maybe 100ms per step. But robots need commands every 10-30ms for smooth control. Leapfrog inference solves this: the video model generates predictions that extend far enough into the future to cover its own inference latency. While the model is computing the next batch of future frames, the robot executes actions from the previous prediction.
Watch how predictions overlap to cover inference latency. The robot never waits.
The causal video model is the foundation of DVA — and its power comes entirely from what it learns before a robot ever enters the picture. Training on 1M+ internet videos, the model develops an internal physics engine: it learns that objects fall, that liquids pour, that cloth folds, that hands grasp.
None of this required labeling or annotation. The training signal is just: "given these past frames, predict the next frame." The model is forced to understand causality, occlusion, 3D structure, and dynamics to do this well. It learns all of this from human video — which is abundant, cheap, and richly varied.
This is context amortization: the pre-training investment is amortized across every future task. Any task the robot needs to do in the physical world is likely covered, at least approximately, by some video in the training set. A robot manipulating a coffee machine has seen humans use coffee machines. A robot sorting boxes has seen people pack and unpack things.
The model also learns long-context visual memory. DVA's video model processes hundreds of frames as context — not just the most recent one. This lets it maintain a coherent model of the scene over time: remembering where objects were placed two minutes ago, tracking the state of a partially-assembled task.
Click a skill to see where robots get it — from web data or from robot-specific training.
After web video pre-training, the model knows the world. But it doesn't yet know this robot, in this environment, with this proprioception signal. That's what post-training provides — and it's where DVA's efficiency advantage becomes concrete.
Post-training has two stages. First, the causal video model is fine-tuned on robot video: recordings of the robot operating in its target environment. This teaches it the visual appearance of the robot, the specific lighting and camera angles, and what successful task execution looks like from the robot's viewpoint. This requires ~10-20 hours of footage.
Second, the inverse dynamics model is trained. Given pairs of (frame at time t, frame at time t+1) from robot video, it learns: what action was the robot executing between these two frames? This is pure supervised learning on kinematics data, and it converges fast.
One remarkable result: one-shot learning from a single human demo. For some tasks, a single demonstration video of a human performing the task — not even using the robot — is enough. The video model generalizes so well that one example unlocks novel objects, novel arrangements, and novel environments. The inverse dynamics model handles the "translate human motion to robot commands" step automatically.
Rhoda AI tested DVA on three industrial logistics tasks at Decathlon, a large sporting-goods retailer. These aren't toy demonstrations — they're end-to-end warehouse operations with real inventory, real uncertainty, and real time pressure.
Taking items out of supplier boxes and placing them into warehouse bins. Requires recognizing each item, choosing the right bin, and handling items of varying size, shape, and weight.
1.5 hours autonomous · 11 hours training data
Processing incoming freight containers: identifying, sorting, and staging items for warehouse storage. Higher complexity — more item variety, more spatial reasoning.
160 minutes autonomous · 17 hours training data
Handling returned clothing — inspecting condition, refolding, and re-stocking. This is the hardest task: deformable objects (fabric) with infinite possible configurations.
End-to-end with no scaffolding · Single human demo per garment type
Novel objects and environments from a single human demonstration. No robot-specific data collected for these scenarios at all.
Succeeds on out-of-distribution objects and layouts
Time the robot runs autonomously before needing human intervention. Training data shown below each bar.
DVA and VLAs solve the same problem — robot control from visual observations and language — but make opposite bets about where the bottleneck is.
| Dimension | VLA (RT-2, OpenVLA) | DVA (Rhoda AI) |
|---|---|---|
| Training data | 100K+ hours robot demos | 10–20 hours robot demos |
| Pre-training signal | Language + robot actions | Web video (1M+ videos) |
| Output | Action tokens directly | Future frames → actions |
| Context length | Typically 1–16 frames | Hundreds of frames |
| Interpretability | Black box output | Predicted video is inspectable |
| New embodiment cost | 100K+ hours new data | ~10 hours new data |
| One-shot learning | Limited | Demonstrated on novel objects |
| Continuous operation | Minutes (research demos) | 160 minutes (industrial) |
VLAs have one advantage: they're end-to-end differentiable. If the action is wrong, the gradient flows directly back to fix it. DVA has a hard boundary between the video model and the inverse dynamics model — errors in one don't automatically correct the other. In practice, DVA's empirical performance has been strong enough that this hasn't been a limiting factor, but it's a real architectural difference.
The bigger picture: DVA and VLAs represent different hypotheses about the structure of the robot learning problem. VLAs bet that actions are the right abstraction. DVA bets that video is the right abstraction, and actions are a derived quantity. Both approaches may ultimately be necessary — perhaps future systems will combine both.
DVA doesn't exist in isolation. It sits at the intersection of several threads in AI and robotics research.
| System | What it shares with DVA | Key difference |
|---|---|---|
| pi-0 (Physical Intelligence) | Flow-matching policy, large-scale pre-training, manipulation | Predicts actions directly; uses diffusion not video generation |
| OpenVLA | Open weights, industrial ambition | 7B LLM backbone outputting discretized action tokens |
| UniSim (Google) | Video model for robot planning | Uses video as a world model for planning, not as the direct policy |
| Genie (DeepMind) | Video generation from single frames | Interactive world model for games, not robot control |
| SWIM / GROOT (NVIDIA) | Video model + robot | Generates video as a planning intermediate, uses imitation learning to extract policy |
"The world is mostly video. If we can learn to predict it accurately, we can learn to act in it." — the bet behind DVA