How foundation models learned to move robots — translating pixels and instructions into physical actions in the real world.
Vision-Language Models (VLMs) can look at an image and answer questions. But what if instead of answering with words, the model answered with motor commands? That's the leap from VLM to VLA (Vision-Language-Action): a model that sees the world, understands a language instruction, and outputs physical actions.
The insight is deceptively simple: if a VLM can generate the token sequence "pick up the red cup," why can't it instead generate the action sequence [move_to(0.3, 0.5, 0.2), close_gripper()]? A VLA treats actions as just another modality — another kind of output token.
Watch how the same backbone produces different outputs: text for VLMs, actions for VLAs.
The simplest way to teach a robot: show it what to do. A human demonstrates the task (teleoperation), recording observations and actions. Then we train a neural network to predict the expert's action given the current observation. This is behavioral cloning (BC) — supervised learning for robotics.
BC is simple but fragile. If the robot drifts slightly off the demonstrated path, it encounters states it has never seen before — the distribution shift problem. Small errors compound, and the robot spirals into failure.
The teal path is the expert demo. The red path is the BC agent. Watch errors compound as it drifts away.
How do you represent "move the arm"? This choice is critical. Common representations include:
| Representation | Format | Pros | Cons |
|---|---|---|---|
| Joint angles | [θ1,...,θ7] | Direct motor control | Robot-specific |
| End-effector pose | [x, y, z, rx, ry, rz, grip] | Intuitive, transferable | Needs IK solver |
| Discrete bins | Token index per dim | Works with language models | Loses precision |
| Continuous vectors | Raw floats | Full precision | Harder for LLMs |
Drag the slider to change the number of bins. More bins = more precision but larger vocabulary. The green dot shows the discretized position.
RT-2 (Robotics Transformer 2) by Google DeepMind is the landmark paper that proved VLMs can directly control robots. The key insight: fine-tune a VLM so that instead of generating text, it generates action tokens.
RT-2 takes a PaLM-E or PaLI-X vision-language model, discretizes robot actions into 256 bins per dimension, maps them to string tokens ("128", "64", "255"), and co-fine-tunes on robot demonstrations alongside web-scale vision-language data.
See how a continuous 7-DOF action is tokenized into text. Each dimension maps to a bin number.
What if instead of predicting a single action, the robot could sample from a distribution over actions? Diffusion Policy applies the same denoising process used in image generation to robot action prediction.
Starting from random noise, the model iteratively refines an action trajectory. This naturally handles multimodal action distributions — when there are multiple valid ways to do something (reach from the left or right), the diffusion model can represent all of them.
Watch random noise get denoised into a clean action trajectory. Each step removes noise. The green path is the final clean trajectory.
Predicting one action at a time is reactive and jerky. Action Chunking (from ACT — Action Chunking with Transformers) predicts an entire sequence of future actions at once. Instead of "what should I do now?", the model answers "what should I do for the next H steps?"
This turns control into a trajectory prediction problem. The robot executes a chunk of actions, then re-plans. This produces smoother motion and handles temporal correlation in the demonstrations.
Compare single-step (reactive) control with chunked (planned) control. Notice how chunking is smoother.
OpenVLA is the open-source counterpart to RT-2: a 7B-parameter VLA built on Llama 2 + SigLIP, trained on the Open X-Embodiment dataset. It demonstrated that smaller, open models can rival proprietary giant VLAs.
Trace the path from image + instruction to robot action. Each color represents a different processing stage.
A human can watch someone else cook and learn the recipe, even though their arms are different lengths. Can robots do the same? Cross-embodiment learning trains on data from many different robot types, hoping that high-level task knowledge transfers even when the hardware differs.
The Open X-Embodiment (OXE) dataset combined demonstrations from 22+ robot embodiments: single-arm manipulators, dual-arms, quadrupeds, dexterous hands. The finding: a single policy trained on all this data outperforms robot-specific policies on most tasks.
Different robots contribute demonstrations to a shared policy. Toggle embodiments on/off.
Real robot data is expensive and slow to collect. Simulation is cheap and infinitely scalable. But policies trained in sim often fail in the real world — the reality gap. Visual differences (lighting, textures), physics mismatches (friction, contact), and sensor noise all contribute.
See how domain randomization helps bridge sim and real. Each refresh randomizes the simulated environment.
| Technique | How |
|---|---|
| Domain Randomization | Randomize textures, lighting, physics in sim |
| System Identification | Calibrate sim to match real physics |
| Progressive Nets | Train in sim, fine-tune with little real data |
| Foundation Models | VLMs pretrained on real images reduce the visual gap |
We're at the beginning of the VLA era. Current models can follow simple instructions in controlled settings, but the gap to human-level dexterity and generalization remains vast. Key open challenges:
| Challenge | Current State | What's Needed |
|---|---|---|
| Dexterity | Basic grasping | In-hand manipulation, tool use |
| Long-horizon | 1-2 step tasks | Multi-step planning, error recovery |
| Safety | Lab environments | Real-world safety guarantees |
| Data scale | ~1M episodes | Internet-scale robot data |
| Speed | ~3Hz control | Real-time reactivity (100Hz+) |
Key milestones in the journey from pure VLMs to embodied agents.
You now understand the path from seeing to acting. VLAs are teaching machines to reach out and touch the world.