VLA Architectures — Engineermaxxing

The VLA Paradigm

A Vision-Language-Action model (VLA) is a single neural network that ingests camera images and a language instruction, then directly outputs motor commands. The idea sounds almost absurdly ambitious: train one model that can see, read, reason, and act. Yet between 2022 and 2024, a succession of systems — RT-1, RT-2, Octo, OpenVLA, and π₀ — demonstrated that this is not only feasible but increasingly practical.

The conceptual arc is straightforward. Large vision-language models (VLMs) like PaLI-X and Llama already encode rich semantic and spatial knowledge from billions of image-text pairs. Robot manipulation requires exactly that knowledge: understanding what objects are, where they are, and what "put the strawberry on the plate" means. The VLA hypothesis is that fine-tuning a VLM on robot demonstrations is far more sample-efficient than training a policy from scratch, because the hardest part — grounding language in visual perception — is already solved.

The central engineering challenge is the action representation. Language models output discrete tokens from a vocabulary. Robot arms need continuous 6-DOF poses plus gripper commands, updated at 5–10 Hz. Bridging this gap — turning continuous actions into something a language model can predict, or bolting on a continuous action head — is where the major architectural differences lie.

💡 The VLA Design Space

The five systems we examine span three action-output strategies: classification heads (RT-1), text-token repurposing (RT-2, OpenVLA), and continuous generation heads (Octo's diffusion, π₀'s flow matching). Each choice propagates through the entire architecture.

Dec 2022

RT-1 — Robotics Transformer with 130K real episodes, 700+ tasks. Purpose-built architecture.

Jul 2023

RT-2 — Fine-tune PaLI-X 55B / PaLM-E 12B to output action tokens as text. Emergent reasoning.

May 2024

Octo — Generalist policy with transformer backbone, diffusion action head. 800K episodes from Open X-Embodiment.

Jun 2024

OpenVLA — 7B VLA built on Prismatic VLM. Open weights, LoRA fine-tuning. 970K episodes.

Oct 2024

π₀ — 3B VLA with flow matching action head from Physical Intelligence. Dexterous bimanual manipulation.

RT-1: Robotics Transformer

RT-1 (Brohan et al., 2022) was the first large-scale demonstration that a Transformer-based policy could control a real mobile manipulator across hundreds of tasks. Unlike later VLAs that repurpose pretrained language models, RT-1 was designed from scratch for robot control, with every component chosen for inference speed and action precision.

Architecture

The RT-1 architecture has three stages: a pretrained EfficientNet-B3 image encoder, a TokenLearner module for token compression, and a decoder-only Transformer that outputs discretized actions.

Image (300×300) → EfficientNet-B3 → FiLM (language) → TokenLearner (8 tokens) → Transformer (8L, 512d) → 11 Action Dims

The image encoder is an EfficientNet-B3 pretrained on ImageNet. It processes 300×300 RGB images and produces a 9×9×512 feature map. Language conditioning happens via FiLM layers (Feature-wise Linear Modulation): the task instruction is embedded by a pretrained sentence encoder (Universal Sentence Encoder), and the resulting vector modulates the convolutional features through learned affine transforms:

FiLM(x) = γ(language) ⊙ x + β(language)

This is elegant because language conditioning happens inside the visual backbone rather than at a later fusion point, giving the model early access to task semantics when forming visual features.

The TokenLearner module is critical for speed. The 9×9 = 81 spatial tokens from EfficientNet would be expensive to process with self-attention. TokenLearner learns 8 spatial attention maps that soft-select and compress 81 tokens down to 8. With 6 history frames, this gives 48 tokens total — cheap enough for real-time inference at 3 Hz.

Action Tokenization

RT-1 discretizes each continuous action dimension into 256 bins uniformly distributed over the dimension's range. The action space has 11 dimensions:

Dimensions	Description	Range
1–3	End-effector displacement (x, y, z)	Bounded per-dim
4–6	End-effector rotation (roll, pitch, yaw)	[−π, π]
7	Gripper opening	{open, close}
8–10	Base movement (x, y, yaw)	Bounded
11	Termination token	{continue, terminate}

Each dimension is predicted by its own 256-way softmax head in the Transformer's final layer. The model autoregressively predicts these 11 categories, giving 256¹¹ possible discrete actions per timestep. At inference, the argmax of each softmax is taken, then linearly mapped back to the continuous range.

ℹ Why 256 bins?

256 bins yield a resolution of about 0.4% of the action range per bin. For a typical end-effector displacement range of ±5 cm, this gives ~0.04 cm resolution — well within the repeatability of most robot arms. More bins would increase the softmax dimension without meaningful precision gains; fewer bins would start to cause jerky motions.

Scale and Results

Google collected 130,000 episodes over 17 months using a fleet of 13 Everyday Robots mobile manipulators. The dataset spans 700+ task instructions across kitchen manipulation: picking, placing, opening drawers, moving objects near targets.

RT-1 achieved a 97% success rate on seen tasks and 76% on unseen task/object combinations, dramatically outperforming prior methods like Gato (which achieved ~50% on similar benchmarks). The key lesson: data scale and diversity matter as much as architecture. RT-1's 130K episodes were 10–100× larger than any prior real-robot dataset.

RT-2: Vision-Language-Action Transfer

RT-2 (Brohan et al., 2023) represents a conceptual leap: instead of building a robot-specific architecture, take a pretrained vision-language model and fine-tune it to output actions. The result is a model that inherits the VLM's world knowledge — semantic understanding, visual reasoning, even chain-of-thought — and applies it directly to robot control.

The Key Insight

Language models already output sequences of discrete tokens. Robot actions can be discretized into sequences of discrete tokens. Therefore, a language model can output robot actions by treating them as just another kind of text. No architectural modification is needed — only a change in the training data format.

RT-2 fine-tunes two VLMs: PaLI-X 55B (a ViT-22B vision encoder + 32B language model) and PaLM-E 12B (a PaLM language model with ViT-4B embeddings). Both are massive models pretrained on web-scale image-text data.

RT-2 Architecture: VLM → Action Tokens Interactive

Click to animate inference

Actions as Text Tokens

RT-2 discretizes each action dimension into 256 bins (matching RT-1), but represents each bin as a string token rather than a classification head. The tokens are simply the strings "0" through "255". A 7-DoF action (x, y, z, rx, ry, rz, gripper) plus a termination flag becomes an 8-token string:

"1 128 91 241 5 101 127 0"

Each number represents the bin index for one action dimension. The model generates this as a standard text sequence, one token at a time, using its existing vocabulary. This is possible because the strings "0" through "255" are already valid tokens in most LLM tokenizers.

python

# RT-2 action tokenization (conceptual)
def tokenize_action(continuous_action, num_bins=256):
    """Convert continuous action vector to text tokens."""
    tokens = []
    for dim_val, (lo, hi) in zip(continuous_action, ACTION_RANGES):
        # Normalize to [0, 1] then discretize
        normalized = (dim_val - lo) / (hi - lo)
        bin_idx = int(np.clip(normalized * num_bins, 0, num_bins - 1))
        tokens.append(str(bin_idx))
    return " ".join(tokens)  # e.g., "128 91 241 5 101 127 0"

def detokenize_action(token_string, num_bins=256):
    """Convert text tokens back to continuous action."""
    bins = [int(t) for t in token_string.split()]
    action = []
    for bin_idx, (lo, hi) in zip(bins, ACTION_RANGES):
        continuous = lo + (bin_idx + 0.5) / num_bins * (hi - lo)
        action.append(continuous)
    return np.array(action)

The training data mixes 50% web-scale vision-language data with 50% robot demonstration data. This co-training ratio is critical: too much robot data causes catastrophic forgetting of VLM capabilities; too little yields poor action accuracy. The 50/50 split was found empirically to preserve the VLM's reasoning abilities while learning competent control.

💡 Why co-training matters

When RT-2 was trained on 100% robot data, it lost the ability to reason about novel objects and concepts. When trained on 100% web data, it couldn't control the robot. The 50/50 mixture creates a model that can answer "which object is a fruit?" and pick it up. This is the essence of VLA transfer: web knowledge enhances robot performance on tasks the robot has never seen.

Emergent Capabilities

The most striking result from RT-2 is the emergence of capabilities that were never present in the robot training data:

Symbol understanding: "move the object to the triangle" — the robot understands geometric shapes drawn on the table, despite never being trained on this.
Reasoning: "pick up the object that is not a fruit" requires understanding categories and negation — capabilities inherited from the VLM.
Chain-of-thought: RT-2 can be prompted to "think step by step" before outputting actions, improving performance on complex reasoning tasks by 2×.
Multilingual instructions: commands in languages other than English sometimes work, despite robot data being English-only.

On emergent capability evaluations, RT-2 (PaLI-X 55B) achieved 62% success on tasks requiring semantic reasoning, compared to 32% for RT-1. This 2× improvement comes entirely from the VLM pretraining — no additional robot data was collected.

Octo: Generalist Cross-Embodiment Policy

While RT-1 and RT-2 were tied to Google's Everyday Robot, Octo (Ghosh et al., 2024) was designed from the start as a generalist policy that works across different robots, camera configurations, and action spaces. Its key innovation is a modular architecture with swappable observation tokenizers and action heads.

Modular Architecture

Octo's architecture separates three concerns: observation encoding, cross-modal reasoning, and action decoding. Each can be independently configured for a target robot.

Image Tokenizer + Language Tokenizer + Proprio Tokenizer → Transformer Backbone → Diffusion Head

Observation tokenizers convert each input modality into a sequence of tokens. The image tokenizer is a ViT encoder that produces patch tokens. A language tokenizer embeds instructions. An optional proprioception tokenizer encodes joint states. These token sequences are concatenated and fed to the backbone.

The Transformer backbone performs cross-modal attention across all observation tokens plus a set of learned readout tokens. These readout tokens serve as queries that aggregate information needed for action prediction — similar to Perceiver's latent tokens. The backbone uses a causal mask within each timestep and can attend to previous timesteps for temporal context.

The diffusion action head takes the readout token representations and uses them to condition a denoising diffusion process. Starting from Gaussian noise, the head iteratively denoises to produce a chunk of future actions (typically 4–16 timesteps). This is critical for multi-modal action distributions: unlike classification heads that produce a single mode, diffusion can represent the full distribution over possible actions.

ℹ Why diffusion for actions?

Robot manipulation is inherently multi-modal. Given the instruction "pick up a cup," a robot can approach from the left, right, or top. A unimodal regression head averages these modes, producing an action that points at the center and fails. Diffusion models naturally handle multi-modality by learning the full action distribution and sampling from it.

Octo was pretrained on 800,000 episodes from the Open X-Embodiment (OXE) dataset, which aggregates demonstrations from 22 different robot embodiments. This cross-embodiment pretraining gives Octo a shared representation of manipulation concepts that transfers across robots.

Fine-Tuning Protocol

Octo's modular design enables a clean fine-tuning story. For a new robot:

Swap the action head: replace the pretrained diffusion head with a new one matching the target robot's action dimensionality.
Keep or adapt tokenizers: if the new robot has different camera configurations, add or replace observation tokenizers while keeping the backbone.
Fine-tune end-to-end: with a small dataset (as few as 100 demonstrations), fine-tune the entire model with a low learning rate on the backbone and higher rate on the new head.

In evaluations, fine-tuned Octo matched or exceeded the performance of policies trained from scratch, while requiring 5–10× fewer demonstrations. The pretrained backbone provides a strong initialization that captures generalizable manipulation primitives.

OpenVLA: The Open-Source Recipe

OpenVLA (Kim et al., 2024) distills the lessons of RT-2 into a fully open-source, reproducible 7B-parameter VLA. Where RT-2 required a 55B proprietary model, OpenVLA shows that a well-designed 7B model with the right vision encoders and training recipe can achieve competitive performance — and be fine-tuned on a single GPU.

Architecture

OpenVLA is built on the Prismatic VLM, which uses a dual vision encoder:

Encoder 1

SigLIP (ViT-SO400M)

Trained with sigmoid loss on image-text pairs. Provides strong semantic features — understands what objects are and their relationships to language.

Encoder 2

DINOv2 (ViT-L)

Self-supervised vision encoder. Provides strong spatial features — understands where objects are, their shapes, and fine-grained geometry.

The two encoders process the same input image. Their output tokens are concatenated and projected through a 2-layer MLP into the token space of a Llama 2 7B language model backbone. The language instruction is tokenized normally and prepended. The entire sequence — projected vision tokens + language tokens — is processed by the Llama backbone, which autoregressively generates action tokens.

SigLIP + DINOv2 → MLP Projector → Llama 2 7B → Action Tokens ("0"–"255")

Like RT-2, action tokenization uses 256 bins per dimension. The continuous action for each dimension is discretized and mapped to one of 256 dedicated tokens added to the vocabulary. OpenVLA predicts 7 action dimensions: 6-DoF end-effector pose (x, y, z, roll, pitch, yaw) plus a binary gripper action.

Training and LoRA Fine-Tuning

OpenVLA was trained on 970,000 episodes from the Open X-Embodiment dataset. The base model training uses all parameters (full fine-tuning of the Prismatic VLM) on a cluster of 64 A100 GPUs for approximately 14 days.

For downstream adaptation, OpenVLA supports LoRA (Low-Rank Adaptation) fine-tuning. LoRA freezes the base model weights and adds small trainable rank-decomposition matrices to the attention layers. This reduces the number of trainable parameters from 7B to approximately 14M (< 0.2% of total), enabling fine-tuning on a single consumer GPU:

W' = W + ΔW = W + BA, where B \in R d\timesr, A \in R r\timesd, r « d

With LoRA rank 32, OpenVLA can be fine-tuned on a new task with as few as 50–100 demonstrations in under 2 hours on a single A100. On the WidowX BridgeV2 benchmark, LoRA-fine-tuned OpenVLA achieves 82% success, compared to 85% for full fine-tuning and 57% for the base Octo model.

💡 Dual encoder advantage

The SigLIP + DINOv2 combination is not arbitrary. Kim et al. ablated single-encoder variants and found that SigLIP alone struggles with precise spatial localization (it was trained for semantic matching), while DINOv2 alone lacks language grounding. The dual encoder consistently outperforms either single encoder by 8–15% on manipulation benchmarks.

π₀: Flow Matching for Dexterous Control

π₀ (Physical Intelligence, 2024) takes a different approach to the action generation problem. Rather than discretizing actions into tokens, π₀ uses a flow matching action head that generates continuous actions directly. This avoids the quantization errors inherent in bin-based discretization and enables smoother, more precise control — especially important for dexterous manipulation with multi-fingered hands.

Architecture

π₀ uses a 3B parameter pre-trained VLM as its backbone. The model processes images and language instructions through the VLM encoder and language model, just like RT-2 and OpenVLA. However, instead of generating discrete action tokens, the final hidden states are passed to a flow matching head.

The model is pre-trained on a diverse mixture of internet-scale vision-language data and robot demonstration data from multiple embodiments, including single-arm manipulators, bimanual robots, and dexterous hands. This diverse pre-training is key to π₀'s generalization.

Flow Matching Action Head

Flow matching is a generative modeling technique that learns to transport samples from a noise distribution to a target distribution via a continuous vector field. For action generation, this means:

Sample an initial action from Gaussian noise: a₀ ~ N(0, I)
Integrate the learned vector field v_θ(a_t, t, c) conditioned on context c (vision + language embeddings) from t=0 to t=1
Output the final a₁ as the predicted action

da/dt = v θ (a t, t, context), t \in [0, 1]

During training, the flow matching objective is simpler than diffusion: it directly regresses the vector field against the optimal transport path between noise and data. The loss is:

L = E t,a 0,a 1 || v θ (a t, t, c) - (a 1 - a 0) || 2

where a_t = (1 - t)a₀ + t·a₁ is the linear interpolation.

ℹ Flow matching vs. diffusion

Flow matching and diffusion are closely related, but flow matching has practical advantages for robotics: it uses straight-line paths (optimal transport) rather than the curved SDE trajectories of diffusion, requiring fewer integration steps at inference. Where Octo's diffusion head needs ~20 denoising steps, flow matching can achieve similar quality in 5–10 steps — important for real-time control.

π₀ generates action chunks of 50 future timesteps, allowing it to plan extended manipulation sequences. The model demonstrated remarkable dexterous manipulation capabilities, including folding laundry, busing tables, and assembling boxes — tasks requiring bimanual coordination and contact-rich manipulation that discretized action models struggle with.

Architecture Comparison

The following table summarizes the key architectural differences across all five VLAs:

Property	RT-1	RT-2	Octo	OpenVLA	π₀
Year	2022	2023	2024	2024	2024
Parameters	35M	12B / 55B	93M	7B	3B
Vision Encoder	EfficientNet-B3	ViT-22B / ViT-4B	ViT (small)	SigLIP + DINOv2	VLM encoder
Language Model	None (USE embed)	PaLI-X / PaLM	None (T5 embed)	Llama 2 7B	3B LM
Action Head	256-way softmax ×11	Text token generation	Diffusion	Text token generation	Flow matching
Action Type	Discrete (256 bins)	Discrete (256 bins)	Continuous	Discrete (256 bins)	Continuous
Training Data	130K episodes	130K + web data	800K (OXE)	970K (OXE)	Diverse (web + robot)
Embodiments	1 (Everyday Robot)	1 (Everyday Robot)	22+	22+ (OXE)	Multiple
Control Freq	3 Hz	1–3 Hz	5–10 Hz	~5 Hz	~10 Hz
Open Source	No	No	Yes	Yes	No
LoRA Support	N/A	N/A	Partial	Yes	N/A

Architecture Comparison Radar Interactive

Showing: OpenVLA

The VLM-to-VLA Recipe

Looking across RT-2, OpenVLA, and π₀, a clear recipe has emerged for converting any vision-language model into a vision-language-action model. The recipe has four steps, each with well-understood trade-offs:

Start with a Strong VLM

Choose a pretrained vision-language model with good visual grounding and instruction following. Bigger models transfer better (RT-2's 55B > 12B), but 7B models (OpenVLA) are practical. The vision encoder quality matters enormously — dual encoders (SigLIP + DINOv2) outperform single encoders.

Define the Action Representation

Choose between discrete tokenization (256 bins mapped to text tokens) or a continuous action head (diffusion/flow matching). Discrete is simpler and inherits the LM's generation infrastructure. Continuous is better for high-precision or multi-modal actions.

Co-train on Web + Robot Data

Mix internet-scale vision-language data with robot demonstrations. The ratio matters: RT-2 uses 50/50. Too much robot data causes catastrophic forgetting of VLM capabilities. Too little robot data produces a model that reasons well but can't control a robot.

Fine-Tune for Target Robot

The pretrained VLA serves as a foundation. Fine-tune with a small dataset (50–1000 demonstrations) on the target embodiment. LoRA makes this efficient: ~14M trainable parameters, single GPU, hours not days. The VLM backbone provides generalization; fine-tuning provides precision.

VLM-to-VLA Pipeline: Frozen vs. Fine-Tuned Interactive

Mode: Pre-training (all parameters trainable)

💡 The foundation model advantage

The VLM-to-VLA recipe works because of a deep structural alignment: the "hard" part of robot manipulation — understanding what objects are, parsing spatial language, reasoning about goals — is exactly what VLMs learn from web data. The "easy" part — mapping understood goals to motor commands — can be learned from relatively few demonstrations. This is why a 7B model fine-tuned on 970K episodes outperforms specialized 35M models trained on the same data.

Code Examples

Action Tokenization and Detokenization

The core mechanism shared by RT-2 and OpenVLA: converting between continuous robot actions and discrete token indices.

python

import numpy as np

# Action space bounds for a 7-DoF manipulator
ACTION_RANGES = [
    (-0.05, 0.05),   # x displacement (m)
    (-0.05, 0.05),   # y displacement (m)
    (-0.05, 0.05),   # z displacement (m)
    (-0.25, 0.25),   # roll (rad)
    (-0.25, 0.25),   # pitch (rad)
    (-0.25, 0.25),   # yaw (rad)
    (0.0, 1.0),      # gripper (0=closed, 1=open)
]
NUM_BINS = 256

def continuous_to_tokens(action: np.ndarray) -> list[int]:
    """Discretize continuous action into bin indices."""
    tokens = []
    for val, (lo, hi) in zip(action, ACTION_RANGES):
        normalized = np.clip((val - lo) / (hi - lo), 0.0, 1.0)
        bin_idx = int(normalized * (NUM_BINS - 1))
        tokens.append(bin_idx)
    return tokens

def tokens_to_continuous(tokens: list[int]) -> np.ndarray:
    """Convert bin indices back to continuous action."""
    action = []
    for bin_idx, (lo, hi) in zip(tokens, ACTION_RANGES):
        # Map to bin center for unbiased reconstruction
        continuous = lo + (bin_idx + 0.5) / NUM_BINS * (hi - lo)
        action.append(continuous)
    return np.array(action)

# Example: encode and decode a grasp action
action = np.array([0.02, -0.01, -0.03, 0.0, 0.1, -0.05, 0.0])
tokens = continuous_to_tokens(action)
print(f"Tokens: {tokens}")
# Tokens: [178, 102, 51, 128, 153, 76, 0]

reconstructed = tokens_to_continuous(tokens)
print(f"Reconstruction error: {np.abs(action - reconstructed).max():.5f} m")
# Reconstruction error: ~0.00020 m (well within robot precision)

VLA Inference Loop

A simplified inference loop showing how a VLA model processes observations and generates actions in a closed-loop control setting:

python

import torch
from PIL import Image

class VLAInferenceLoop:
    """Simplified VLA control loop (OpenVLA-style)."""

    def __init__(self, model, processor, action_ranges, num_bins=256):
        self.model = model
        self.processor = processor
        self.action_ranges = action_ranges
        self.num_bins = num_bins

    def get_action(self, image: Image.Image, instruction: str):
        """Run single-step VLA inference."""
        # 1. Tokenize image and instruction
        inputs = self.processor(
            images=image,
            text=f"In: What action should the robot take to {instruction}?\nOut:",
            return_tensors="pt"
        ).to(self.model.device)

        # 2. Generate action tokens autoregressively
        with torch.no_grad():
            output_ids = self.model.generate(
                **inputs,
                max_new_tokens=len(self.action_ranges),
                do_sample=False,  # greedy decoding for actions
            )

        # 3. Decode token IDs to bin indices
        generated_ids = output_ids[0, inputs["input_ids"].shape[1]:]
        bin_indices = [
            int(self.processor.decode(tok_id).strip())
            for tok_id in generated_ids[:len(self.action_ranges)]
        ]

        # 4. Convert bins to continuous action
        action = tokens_to_continuous(bin_indices)
        return action

    def run(self, env, instruction: str, max_steps: int = 300):
        """Closed-loop control."""
        obs = env.reset()
        for step in range(max_steps):
            image = Image.fromarray(obs["image"])
            action = self.get_action(image, instruction)

            obs, reward, done, info = env.step(action)
            if done:
                print(f"Task completed in {step + 1} steps")
                return True

        print("Task timed out")
        return False

Action Tokenization / Detokenization Interactive

256 bins — Click bars to set values

LoRA Fine-Tuning for OpenVLA

Adapting a pretrained VLA to a new task with parameter-efficient fine-tuning:

python

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForVision2Seq, AutoProcessor

# Load pretrained OpenVLA
model = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("openvla/openvla-7b")

# Apply LoRA to attention layers only
lora_config = LoraConfig(
    r=32,                       # rank
    lora_alpha=32,              # scaling factor
    target_modules=[            # which layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,893,632 || all params: 7,615,751,168 || 0.18%

# Fine-tune on your robot's demonstrations
# (dataset loading, training loop, etc.)

References

Seminal papers and key works referenced in this article.

Brohan et al. "RT-1: Robotics Transformer for Real-World Control at Scale." RSS, 2023. arXiv
Brohan et al. "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." CoRL, 2023. arXiv
Ghosh et al. "Octo: An Open-Source Generalist Robot Policy." RSS, 2024. arXiv
Kim et al. "OpenVLA: An Open-Source Vision-Language-Action Model." CoRL, 2024. arXiv
Black et al. "pi_0: A Vision-Language-Action Flow Model for General Robot Control." 2024. arXiv