Overview 09

Vision-Language-Action Models

Foundation models that physically act in the world

2023–2024 Year
RT-2, OpenVLA, π₀ Key Works
Embodied AI Category

What Is a VLA?

A Vision-Language-Action model is a VLM that outputs robot actions instead of (or in addition to) text. It takes an image from a robot camera and a language instruction — "pick up the red cup" — and produces motor commands: joint velocities, end-effector deltas, or gripper open/close signals.

The core insight: language models already reason about the world through text. If you can represent actions as tokens (or decode them from the LLM's hidden states), the same model that understands "red cup on the table" can plan how to reach for it. VLAs are the bridge from digital intelligence to physical action.

◆ Key idea
A VLM produces text tokens. A VLA produces action tokens. Same backbone, different output head, radically different capabilities — a robot that can follow open-ended natural language commands.

Architecture

The canonical VLA pipeline: an image and instruction enter a vision encoder and language tokenizer, their embeddings merge inside a pretrained LLM backbone, and an action head decodes the LLM's output into a vector of continuous motor commands.

VLA Architecture Pipeline Interactive

Action Tokenization

How do you turn continuous 7-DoF actions into something an LLM can predict? Discretize: bin each dimension into 256 values and treat them as vocabulary tokens. RT-2 appends these as special tokens to the LLM vocabulary. The model autoregressively predicts [bin_x, bin_y, bin_z, bin_roll, bin_pitch, bin_yaw, bin_gripper] just like it would predict the next word.

Action Tokenization Detail Interactive

Action Head Designs

The action head is where VLAs diverge most. Five major strategies have emerged, each with distinct trade-offs in expressiveness, speed, and multimodality handling.

RT-2 Discrete Tokens

Bin each action dimension into 256 discrete tokens. Autoregressive prediction, same as text. Simple but quantization error limits precision.

Octo Diffusion Head

Condition a small diffusion model on the LLM's hidden state to denoise continuous actions. Captures multimodal action distributions naturally.

π₀ Flow Matching

Replace the diffusion head with flow matching for straighter ODE paths. Fewer denoising steps at inference, better for real-time control.

Direct MLP Regression

A simple MLP maps the LLM's final hidden state to continuous action values. Fast, but assumes unimodal actions — struggles with multimodal demonstrations.

ACT Action Chunking

Predict a chunk of K future actions at once via a CVAE. Temporal consistency across the chunk reduces jitter. Used in ALOHA and ACT.


Key Models

A lineage from task-specific to generalist, from one robot to many.

Model Year Params Training Data Action Repr. Key Idea
RT-1 2022 35M 130k robot episodes Discrete tokens EfficientNet + TokenLearner; first large-scale robot transformer
RT-2 2023 55B Web + robot data Discrete tokens (256 bins) VLM backbone (PaLI-X); co-training on web + robot data
Octo 2024 93M 800k episodes (Open X-Embodiment) Diffusion head Cross-embodiment generalist; open-source; flexible I/O
OpenVLA 2024 7B 970k episodes (Open X-Embodiment) Discrete tokens Llama 2 backbone + DINOv2/SigLIP; open-source 7B VLA
π₀ 2024 3B 10k+ hours multi-robot Flow matching head Flow matching action head; dexterous manipulation; multi-embodiment

Training

Behavioral Cloning

VLAs learn from demonstration data: a human teleoperates the robot (via VR controllers, leader-follower arms, or kinesthetic teaching) while cameras record observations. The model is trained to predict the expert's actions given the same observations. This is behavioral cloning — supervised learning on (observation, action) pairs.

Cross-Embodiment Training

A breakthrough insight: train on data from many different robots. The Open X-Embodiment dataset aggregates data from 22 robot embodiments. Shared visual and language understanding transfers across robots; only the action space differs.

Cross-Embodiment Training Interactive

Co-Training on Web Data

RT-2's key trick: co-train on internet-scale vision-language data alongside robot episodes. The web data preserves the VLM's broad knowledge (object recognition, spatial reasoning, semantic understanding) while robot data teaches physical grounding. Mixing ratio matters — too much robot data degrades language ability; too little yields poor manipulation.


Inference: The Control Loop

At deployment, the VLA runs in a closed loop at 3–10 Hz:

  1. Camera captures current image observation
  2. Image + language goal are encoded by the vision encoder and tokenizer
  3. LLM backbone predicts the next action chunk (1–16 timesteps)
  4. Action head decodes to motor commands (Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper)
  5. Robot executes the action chunk
  6. Observe again → repeat
Control Loop Animated
Paused
◆ Latency matters
A 7B-parameter LLM takes ~200ms per forward pass on an NVIDIA Jetson. At 5 Hz that leaves zero headroom. Solutions: smaller models, action chunking (predict many steps at once), quantization (INT4/INT8), or offloading to an edge GPU with speculative execution.

Key Challenges

Distribution Shift

Small errors compound over time. The robot visits states never seen in demonstrations, and the policy has no recovery strategy. Techniques: DAgger, action chunking, closed-loop replanning.

Reality Gap

Sim-to-real transfer remains hard. Simulated physics, rendering, and contact models differ from the real world. Domain randomization helps but doesn't close the gap for dexterous tasks.

Data Scarcity

Robot data is 1000x scarcer than internet data. Teleoperation is slow and expensive. Cross-embodiment pooling and co-training on web data are current mitigations, not solutions.

Action Multimodality

Multiple valid ways to perform a task. Naive MSE regression averages them, producing invalid actions. Diffusion/flow heads and action chunking with CVAEs address this explicitly.