The VLA Paradigm
A Vision-Language-Action model (VLA) is a single neural network that ingests camera images and a language instruction, then directly outputs motor commands. The idea sounds almost absurdly ambitious: train one model that can see, read, reason, and act. Yet between 2022 and 2024, a succession of systems — RT-1, RT-2, Octo, OpenVLA, and π₀ — demonstrated that this is not only feasible but increasingly practical.
The conceptual arc is straightforward. Large vision-language models (VLMs) like PaLI-X and Llama already encode rich semantic and spatial knowledge from billions of image-text pairs. Robot manipulation requires exactly that knowledge: understanding what objects are, where they are, and what "put the strawberry on the plate" means. The VLA hypothesis is that fine-tuning a VLM on robot demonstrations is far more sample-efficient than training a policy from scratch, because the hardest part — grounding language in visual perception — is already solved.
The central engineering challenge is the action representation. Language models output discrete tokens from a vocabulary. Robot arms need continuous 6-DOF poses plus gripper commands, updated at 5–10 Hz. Bridging this gap — turning continuous actions into something a language model can predict, or bolting on a continuous action head — is where the major architectural differences lie.
RT-1: Robotics Transformer
RT-1 (Brohan et al., 2022) was the first large-scale demonstration that a Transformer-based policy could control a real mobile manipulator across hundreds of tasks. Unlike later VLAs that repurpose pretrained language models, RT-1 was designed from scratch for robot control, with every component chosen for inference speed and action precision.
Architecture
The RT-1 architecture has three stages: a pretrained EfficientNet-B3 image encoder, a TokenLearner module for token compression, and a decoder-only Transformer that outputs discretized actions.
The image encoder is an EfficientNet-B3 pretrained on ImageNet. It processes 300×300 RGB images and produces a 9×9×512 feature map. Language conditioning happens via FiLM layers (Feature-wise Linear Modulation): the task instruction is embedded by a pretrained sentence encoder (Universal Sentence Encoder), and the resulting vector modulates the convolutional features through learned affine transforms:
This is elegant because language conditioning happens inside the visual backbone rather than at a later fusion point, giving the model early access to task semantics when forming visual features.
The TokenLearner module is critical for speed. The 9×9 = 81 spatial tokens from EfficientNet would be expensive to process with self-attention. TokenLearner learns 8 spatial attention maps that soft-select and compress 81 tokens down to 8. With 6 history frames, this gives 48 tokens total — cheap enough for real-time inference at 3 Hz.
Action Tokenization
RT-1 discretizes each continuous action dimension into 256 bins uniformly distributed over the dimension's range. The action space has 11 dimensions:
| Dimensions | Description | Range |
|---|---|---|
| 1–3 | End-effector displacement (x, y, z) | Bounded per-dim |
| 4–6 | End-effector rotation (roll, pitch, yaw) | [−π, π] |
| 7 | Gripper opening | {open, close} |
| 8–10 | Base movement (x, y, yaw) | Bounded |
| 11 | Termination token | {continue, terminate} |
Each dimension is predicted by its own 256-way softmax head in the Transformer's final layer. The model autoregressively predicts these 11 categories, giving 25611 possible discrete actions per timestep. At inference, the argmax of each softmax is taken, then linearly mapped back to the continuous range.
Scale and Results
Google collected 130,000 episodes over 17 months using a fleet of 13 Everyday Robots mobile manipulators. The dataset spans 700+ task instructions across kitchen manipulation: picking, placing, opening drawers, moving objects near targets.
RT-1 achieved a 97% success rate on seen tasks and 76% on unseen task/object combinations, dramatically outperforming prior methods like Gato (which achieved ~50% on similar benchmarks). The key lesson: data scale and diversity matter as much as architecture. RT-1's 130K episodes were 10–100× larger than any prior real-robot dataset.
RT-2: Vision-Language-Action Transfer
RT-2 (Brohan et al., 2023) represents a conceptual leap: instead of building a robot-specific architecture, take a pretrained vision-language model and fine-tune it to output actions. The result is a model that inherits the VLM's world knowledge — semantic understanding, visual reasoning, even chain-of-thought — and applies it directly to robot control.
The Key Insight
Language models already output sequences of discrete tokens. Robot actions can be discretized into sequences of discrete tokens. Therefore, a language model can output robot actions by treating them as just another kind of text. No architectural modification is needed — only a change in the training data format.
RT-2 fine-tunes two VLMs: PaLI-X 55B (a ViT-22B vision encoder + 32B language model) and PaLM-E 12B (a PaLM language model with ViT-4B embeddings). Both are massive models pretrained on web-scale image-text data.
Actions as Text Tokens
RT-2 discretizes each action dimension into 256 bins (matching RT-1), but
represents each bin as a string token rather than a classification head. The tokens
are simply the strings "0" through "255". A 7-DoF action
(x, y, z, rx, ry, rz, gripper) plus a termination flag becomes an 8-token string:
Each number represents the bin index for one action dimension. The model generates this as a standard text sequence, one token at a time, using its existing vocabulary. This is possible because the strings "0" through "255" are already valid tokens in most LLM tokenizers.
# RT-2 action tokenization (conceptual)
def tokenize_action(continuous_action, num_bins=256):
"""Convert continuous action vector to text tokens."""
tokens = []
for dim_val, (lo, hi) in zip(continuous_action, ACTION_RANGES):
# Normalize to [0, 1] then discretize
normalized = (dim_val - lo) / (hi - lo)
bin_idx = int(np.clip(normalized * num_bins, 0, num_bins - 1))
tokens.append(str(bin_idx))
return " ".join(tokens) # e.g., "128 91 241 5 101 127 0"
def detokenize_action(token_string, num_bins=256):
"""Convert text tokens back to continuous action."""
bins = [int(t) for t in token_string.split()]
action = []
for bin_idx, (lo, hi) in zip(bins, ACTION_RANGES):
continuous = lo + (bin_idx + 0.5) / num_bins * (hi - lo)
action.append(continuous)
return np.array(action)
The training data mixes 50% web-scale vision-language data with 50% robot demonstration data. This co-training ratio is critical: too much robot data causes catastrophic forgetting of VLM capabilities; too little yields poor action accuracy. The 50/50 split was found empirically to preserve the VLM's reasoning abilities while learning competent control.
Emergent Capabilities
The most striking result from RT-2 is the emergence of capabilities that were never present in the robot training data:
- Symbol understanding: "move the object to the triangle" — the robot understands geometric shapes drawn on the table, despite never being trained on this.
- Reasoning: "pick up the object that is not a fruit" requires understanding categories and negation — capabilities inherited from the VLM.
- Chain-of-thought: RT-2 can be prompted to "think step by step" before outputting actions, improving performance on complex reasoning tasks by 2×.
- Multilingual instructions: commands in languages other than English sometimes work, despite robot data being English-only.
On emergent capability evaluations, RT-2 (PaLI-X 55B) achieved 62% success on tasks requiring semantic reasoning, compared to 32% for RT-1. This 2× improvement comes entirely from the VLM pretraining — no additional robot data was collected.
Octo: Generalist Cross-Embodiment Policy
While RT-1 and RT-2 were tied to Google's Everyday Robot, Octo (Ghosh et al., 2024) was designed from the start as a generalist policy that works across different robots, camera configurations, and action spaces. Its key innovation is a modular architecture with swappable observation tokenizers and action heads.
Modular Architecture
Octo's architecture separates three concerns: observation encoding, cross-modal reasoning, and action decoding. Each can be independently configured for a target robot.
Observation tokenizers convert each input modality into a sequence of tokens. The image tokenizer is a ViT encoder that produces patch tokens. A language tokenizer embeds instructions. An optional proprioception tokenizer encodes joint states. These token sequences are concatenated and fed to the backbone.
The Transformer backbone performs cross-modal attention across all observation tokens plus a set of learned readout tokens. These readout tokens serve as queries that aggregate information needed for action prediction — similar to Perceiver's latent tokens. The backbone uses a causal mask within each timestep and can attend to previous timesteps for temporal context.
The diffusion action head takes the readout token representations and uses them to condition a denoising diffusion process. Starting from Gaussian noise, the head iteratively denoises to produce a chunk of future actions (typically 4–16 timesteps). This is critical for multi-modal action distributions: unlike classification heads that produce a single mode, diffusion can represent the full distribution over possible actions.
Octo was pretrained on 800,000 episodes from the Open X-Embodiment (OXE) dataset, which aggregates demonstrations from 22 different robot embodiments. This cross-embodiment pretraining gives Octo a shared representation of manipulation concepts that transfers across robots.
Fine-Tuning Protocol
Octo's modular design enables a clean fine-tuning story. For a new robot:
- Swap the action head: replace the pretrained diffusion head with a new one matching the target robot's action dimensionality.
- Keep or adapt tokenizers: if the new robot has different camera configurations, add or replace observation tokenizers while keeping the backbone.
- Fine-tune end-to-end: with a small dataset (as few as 100 demonstrations), fine-tune the entire model with a low learning rate on the backbone and higher rate on the new head.
In evaluations, fine-tuned Octo matched or exceeded the performance of policies trained from scratch, while requiring 5–10× fewer demonstrations. The pretrained backbone provides a strong initialization that captures generalizable manipulation primitives.
OpenVLA: The Open-Source Recipe
OpenVLA (Kim et al., 2024) distills the lessons of RT-2 into a fully open-source, reproducible 7B-parameter VLA. Where RT-2 required a 55B proprietary model, OpenVLA shows that a well-designed 7B model with the right vision encoders and training recipe can achieve competitive performance — and be fine-tuned on a single GPU.
Architecture
OpenVLA is built on the Prismatic VLM, which uses a dual vision encoder:
SigLIP (ViT-SO400M)
Trained with sigmoid loss on image-text pairs. Provides strong semantic features — understands what objects are and their relationships to language.
DINOv2 (ViT-L)
Self-supervised vision encoder. Provides strong spatial features — understands where objects are, their shapes, and fine-grained geometry.
The two encoders process the same input image. Their output tokens are concatenated and projected through a 2-layer MLP into the token space of a Llama 2 7B language model backbone. The language instruction is tokenized normally and prepended. The entire sequence — projected vision tokens + language tokens — is processed by the Llama backbone, which autoregressively generates action tokens.
Like RT-2, action tokenization uses 256 bins per dimension. The continuous action for each dimension is discretized and mapped to one of 256 dedicated tokens added to the vocabulary. OpenVLA predicts 7 action dimensions: 6-DoF end-effector pose (x, y, z, roll, pitch, yaw) plus a binary gripper action.
Training and LoRA Fine-Tuning
OpenVLA was trained on 970,000 episodes from the Open X-Embodiment dataset. The base model training uses all parameters (full fine-tuning of the Prismatic VLM) on a cluster of 64 A100 GPUs for approximately 14 days.
For downstream adaptation, OpenVLA supports LoRA (Low-Rank Adaptation) fine-tuning. LoRA freezes the base model weights and adds small trainable rank-decomposition matrices to the attention layers. This reduces the number of trainable parameters from 7B to approximately 14M (< 0.2% of total), enabling fine-tuning on a single consumer GPU:
With LoRA rank 32, OpenVLA can be fine-tuned on a new task with as few as 50–100 demonstrations in under 2 hours on a single A100. On the WidowX BridgeV2 benchmark, LoRA-fine-tuned OpenVLA achieves 82% success, compared to 85% for full fine-tuning and 57% for the base Octo model.
π₀: Flow Matching for Dexterous Control
π₀ (Physical Intelligence, 2024) takes a different approach to the action generation problem. Rather than discretizing actions into tokens, π₀ uses a flow matching action head that generates continuous actions directly. This avoids the quantization errors inherent in bin-based discretization and enables smoother, more precise control — especially important for dexterous manipulation with multi-fingered hands.
Architecture
π₀ uses a 3B parameter pre-trained VLM as its backbone. The model processes images and language instructions through the VLM encoder and language model, just like RT-2 and OpenVLA. However, instead of generating discrete action tokens, the final hidden states are passed to a flow matching head.
The model is pre-trained on a diverse mixture of internet-scale vision-language data and robot demonstration data from multiple embodiments, including single-arm manipulators, bimanual robots, and dexterous hands. This diverse pre-training is key to π₀'s generalization.
Flow Matching Action Head
Flow matching is a generative modeling technique that learns to transport samples from a noise distribution to a target distribution via a continuous vector field. For action generation, this means:
- Sample an initial action from Gaussian noise: a0 ~ N(0, I)
- Integrate the learned vector field vθ(at, t, c) conditioned on context c (vision + language embeddings) from t=0 to t=1
- Output the final a1 as the predicted action
During training, the flow matching objective is simpler than diffusion: it directly regresses the vector field against the optimal transport path between noise and data. The loss is:
where at = (1 - t)a0 + t·a1 is the linear interpolation.
π₀ generates action chunks of 50 future timesteps, allowing it to plan extended manipulation sequences. The model demonstrated remarkable dexterous manipulation capabilities, including folding laundry, busing tables, and assembling boxes — tasks requiring bimanual coordination and contact-rich manipulation that discretized action models struggle with.
Architecture Comparison
The following table summarizes the key architectural differences across all five VLAs:
| Property | RT-1 | RT-2 | Octo | OpenVLA | π₀ |
|---|---|---|---|---|---|
| Year | 2022 | 2023 | 2024 | 2024 | 2024 |
| Parameters | 35M | 12B / 55B | 93M | 7B | 3B |
| Vision Encoder | EfficientNet-B3 | ViT-22B / ViT-4B | ViT (small) | SigLIP + DINOv2 | VLM encoder |
| Language Model | None (USE embed) | PaLI-X / PaLM | None (T5 embed) | Llama 2 7B | 3B LM |
| Action Head | 256-way softmax ×11 | Text token generation | Diffusion | Text token generation | Flow matching |
| Action Type | Discrete (256 bins) | Discrete (256 bins) | Continuous | Discrete (256 bins) | Continuous |
| Training Data | 130K episodes | 130K + web data | 800K (OXE) | 970K (OXE) | Diverse (web + robot) |
| Embodiments | 1 (Everyday Robot) | 1 (Everyday Robot) | 22+ | 22+ (OXE) | Multiple |
| Control Freq | 3 Hz | 1–3 Hz | 5–10 Hz | ~5 Hz | ~10 Hz |
| Open Source | No | No | Yes | Yes | No |
| LoRA Support | N/A | N/A | Partial | Yes | N/A |
The VLM-to-VLA Recipe
Looking across RT-2, OpenVLA, and π₀, a clear recipe has emerged for converting any vision-language model into a vision-language-action model. The recipe has four steps, each with well-understood trade-offs:
Start with a Strong VLM
Choose a pretrained vision-language model with good visual grounding and instruction following. Bigger models transfer better (RT-2's 55B > 12B), but 7B models (OpenVLA) are practical. The vision encoder quality matters enormously — dual encoders (SigLIP + DINOv2) outperform single encoders.
Define the Action Representation
Choose between discrete tokenization (256 bins mapped to text tokens) or a continuous action head (diffusion/flow matching). Discrete is simpler and inherits the LM's generation infrastructure. Continuous is better for high-precision or multi-modal actions.
Co-train on Web + Robot Data
Mix internet-scale vision-language data with robot demonstrations. The ratio matters: RT-2 uses 50/50. Too much robot data causes catastrophic forgetting of VLM capabilities. Too little robot data produces a model that reasons well but can't control a robot.
Fine-Tune for Target Robot
The pretrained VLA serves as a foundation. Fine-tune with a small dataset (50–1000 demonstrations) on the target embodiment. LoRA makes this efficient: ~14M trainable parameters, single GPU, hours not days. The VLM backbone provides generalization; fine-tuning provides precision.
Code Examples
Action Tokenization and Detokenization
The core mechanism shared by RT-2 and OpenVLA: converting between continuous robot actions and discrete token indices.
import numpy as np
# Action space bounds for a 7-DoF manipulator
ACTION_RANGES = [
(-0.05, 0.05), # x displacement (m)
(-0.05, 0.05), # y displacement (m)
(-0.05, 0.05), # z displacement (m)
(-0.25, 0.25), # roll (rad)
(-0.25, 0.25), # pitch (rad)
(-0.25, 0.25), # yaw (rad)
(0.0, 1.0), # gripper (0=closed, 1=open)
]
NUM_BINS = 256
def continuous_to_tokens(action: np.ndarray) -> list[int]:
"""Discretize continuous action into bin indices."""
tokens = []
for val, (lo, hi) in zip(action, ACTION_RANGES):
normalized = np.clip((val - lo) / (hi - lo), 0.0, 1.0)
bin_idx = int(normalized * (NUM_BINS - 1))
tokens.append(bin_idx)
return tokens
def tokens_to_continuous(tokens: list[int]) -> np.ndarray:
"""Convert bin indices back to continuous action."""
action = []
for bin_idx, (lo, hi) in zip(tokens, ACTION_RANGES):
# Map to bin center for unbiased reconstruction
continuous = lo + (bin_idx + 0.5) / NUM_BINS * (hi - lo)
action.append(continuous)
return np.array(action)
# Example: encode and decode a grasp action
action = np.array([0.02, -0.01, -0.03, 0.0, 0.1, -0.05, 0.0])
tokens = continuous_to_tokens(action)
print(f"Tokens: {tokens}")
# Tokens: [178, 102, 51, 128, 153, 76, 0]
reconstructed = tokens_to_continuous(tokens)
print(f"Reconstruction error: {np.abs(action - reconstructed).max():.5f} m")
# Reconstruction error: ~0.00020 m (well within robot precision)
VLA Inference Loop
A simplified inference loop showing how a VLA model processes observations and generates actions in a closed-loop control setting:
import torch
from PIL import Image
class VLAInferenceLoop:
"""Simplified VLA control loop (OpenVLA-style)."""
def __init__(self, model, processor, action_ranges, num_bins=256):
self.model = model
self.processor = processor
self.action_ranges = action_ranges
self.num_bins = num_bins
def get_action(self, image: Image.Image, instruction: str):
"""Run single-step VLA inference."""
# 1. Tokenize image and instruction
inputs = self.processor(
images=image,
text=f"In: What action should the robot take to {instruction}?\nOut:",
return_tensors="pt"
).to(self.model.device)
# 2. Generate action tokens autoregressively
with torch.no_grad():
output_ids = self.model.generate(
**inputs,
max_new_tokens=len(self.action_ranges),
do_sample=False, # greedy decoding for actions
)
# 3. Decode token IDs to bin indices
generated_ids = output_ids[0, inputs["input_ids"].shape[1]:]
bin_indices = [
int(self.processor.decode(tok_id).strip())
for tok_id in generated_ids[:len(self.action_ranges)]
]
# 4. Convert bins to continuous action
action = tokens_to_continuous(bin_indices)
return action
def run(self, env, instruction: str, max_steps: int = 300):
"""Closed-loop control."""
obs = env.reset()
for step in range(max_steps):
image = Image.fromarray(obs["image"])
action = self.get_action(image, instruction)
obs, reward, done, info = env.step(action)
if done:
print(f"Task completed in {step + 1} steps")
return True
print("Task timed out")
return False
LoRA Fine-Tuning for OpenVLA
Adapting a pretrained VLA to a new task with parameter-efficient fine-tuning:
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForVision2Seq, AutoProcessor
# Load pretrained OpenVLA
model = AutoModelForVision2Seq.from_pretrained(
"openvla/openvla-7b",
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained("openvla/openvla-7b")
# Apply LoRA to attention layers only
lora_config = LoraConfig(
r=32, # rank
lora_alpha=32, # scaling factor
target_modules=[ # which layers to adapt
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,893,632 || all params: 7,615,751,168 || 0.18%
# Fine-tune on your robot's demonstrations
# (dataset loading, training loop, etc.)
References
Seminal papers and key works referenced in this article.
- Brohan et al. "RT-1: Robotics Transformer for Real-World Control at Scale." RSS, 2023. arXiv
- Brohan et al. "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." CoRL, 2023. arXiv
- Ghosh et al. "Octo: An Open-Source Generalist Robot Policy." RSS, 2024. arXiv
- Kim et al. "OpenVLA: An Open-Source Vision-Language-Action Model." CoRL, 2024. arXiv
- Black et al. "pi_0: A Vision-Language-Action Flow Model for General Robot Control." 2024. arXiv