Vision-Language-Action Models
Foundation models that physically act in the world
What Is a VLA?
A Vision-Language-Action model is a VLM that outputs robot actions instead of (or in addition to) text. It takes an image from a robot camera and a language instruction — "pick up the red cup" — and produces motor commands: joint velocities, end-effector deltas, or gripper open/close signals.
The core insight: language models already reason about the world through text. If you can represent actions as tokens (or decode them from the LLM's hidden states), the same model that understands "red cup on the table" can plan how to reach for it. VLAs are the bridge from digital intelligence to physical action.
Architecture
The canonical VLA pipeline: an image and instruction enter a vision encoder and language tokenizer, their embeddings merge inside a pretrained LLM backbone, and an action head decodes the LLM's output into a vector of continuous motor commands.
Action Tokenization
How do you turn continuous 7-DoF actions into something an LLM can predict?
Discretize: bin each dimension into 256 values and treat them as vocabulary tokens.
RT-2 appends these as special tokens to the LLM vocabulary. The model autoregressively
predicts [bin_x, bin_y, bin_z, bin_roll, bin_pitch, bin_yaw, bin_gripper]
just like it would predict the next word.
Action Head Designs
The action head is where VLAs diverge most. Five major strategies have emerged, each with distinct trade-offs in expressiveness, speed, and multimodality handling.
RT-2 Discrete Tokens
Bin each action dimension into 256 discrete tokens. Autoregressive prediction, same as text. Simple but quantization error limits precision.
Octo Diffusion Head
Condition a small diffusion model on the LLM's hidden state to denoise continuous actions. Captures multimodal action distributions naturally.
π₀ Flow Matching
Replace the diffusion head with flow matching for straighter ODE paths. Fewer denoising steps at inference, better for real-time control.
Direct MLP Regression
A simple MLP maps the LLM's final hidden state to continuous action values. Fast, but assumes unimodal actions — struggles with multimodal demonstrations.
ACT Action Chunking
Predict a chunk of K future actions at once via a CVAE. Temporal consistency across the chunk reduces jitter. Used in ALOHA and ACT.
Key Models
A lineage from task-specific to generalist, from one robot to many.
| Model | Year | Params | Training Data | Action Repr. | Key Idea |
|---|---|---|---|---|---|
| RT-1 | 2022 | 35M | 130k robot episodes | Discrete tokens | EfficientNet + TokenLearner; first large-scale robot transformer |
| RT-2 | 2023 | 55B | Web + robot data | Discrete tokens (256 bins) | VLM backbone (PaLI-X); co-training on web + robot data |
| Octo | 2024 | 93M | 800k episodes (Open X-Embodiment) | Diffusion head | Cross-embodiment generalist; open-source; flexible I/O |
| OpenVLA | 2024 | 7B | 970k episodes (Open X-Embodiment) | Discrete tokens | Llama 2 backbone + DINOv2/SigLIP; open-source 7B VLA |
| π₀ | 2024 | 3B | 10k+ hours multi-robot | Flow matching head | Flow matching action head; dexterous manipulation; multi-embodiment |
Training
Behavioral Cloning
VLAs learn from demonstration data: a human teleoperates the robot (via VR controllers, leader-follower arms, or kinesthetic teaching) while cameras record observations. The model is trained to predict the expert's actions given the same observations. This is behavioral cloning — supervised learning on (observation, action) pairs.
Cross-Embodiment Training
A breakthrough insight: train on data from many different robots. The Open X-Embodiment dataset aggregates data from 22 robot embodiments. Shared visual and language understanding transfers across robots; only the action space differs.
Co-Training on Web Data
RT-2's key trick: co-train on internet-scale vision-language data alongside robot episodes. The web data preserves the VLM's broad knowledge (object recognition, spatial reasoning, semantic understanding) while robot data teaches physical grounding. Mixing ratio matters — too much robot data degrades language ability; too little yields poor manipulation.
Inference: The Control Loop
At deployment, the VLA runs in a closed loop at 3–10 Hz:
- Camera captures current image observation
- Image + language goal are encoded by the vision encoder and tokenizer
- LLM backbone predicts the next action chunk (1–16 timesteps)
- Action head decodes to motor commands (Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper)
- Robot executes the action chunk
- Observe again → repeat
Key Challenges
Distribution Shift
Small errors compound over time. The robot visits states never seen in demonstrations, and the policy has no recovery strategy. Techniques: DAgger, action chunking, closed-loop replanning.
Reality Gap
Sim-to-real transfer remains hard. Simulated physics, rendering, and contact models differ from the real world. Domain randomization helps but doesn't close the gap for dexterous tasks.
Data Scarcity
Robot data is 1000x scarcer than internet data. Teleoperation is slow and expensive. Cross-embodiment pooling and co-training on web data are current mitigations, not solutions.
Action Multimodality
Multiple valid ways to perform a task. Naive MSE regression averages them, producing invalid actions. Diffusion/flow heads and action chunking with CVAEs address this explicitly.