Capuano et al., Chapter 5

Generalist Robot Policies

From single-task policies to foundation models — language-conditioned VLAs that work across tasks and embodiments.

Prerequisites: Chapters 1–4 + Basic familiarity with transformers. That's it.
10
Chapters
5+
Simulations

Chapter 0: From Single-Task to Generalist

Everything we have built so far — behavior cloning, diffusion policies, RL fine-tuning, world models — shares one brutal limitation: each policy knows exactly one task on exactly one robot. Train a diffusion policy to fold towels on a Franka arm, and it cannot stack cups. Train it on an ALOHA bimanual system, and it has no idea what to do on a single-arm WidowX.

Think about how absurd this is compared to language. GPT-4 can write poetry, debug code, summarize legal documents, and translate Mandarin — all with a single set of weights. We trained one model, and it generalized across an enormous space of tasks. Why can't we do this for robots?

Three obstacles stand in the way:

  1. Embodiment diversity. A 6-DOF arm, a bimanual torso, and a humanoid with legs have completely different action spaces. "Move the gripper 5cm right" means different joint commands on each body.
  2. Task specification. In language, the "task" is the prompt. In robotics, how do you tell the robot what to do? Hard-coded task IDs don't scale. Natural language does.
  3. Data scarcity. The internet has trillions of tokens of text. Robotics has — until very recently — maybe a few hundred thousand demonstration episodes, scattered across incompatible formats.

The breakthrough idea of 2024–2025: take a vision-language model (VLM) that already understands images and text, and bolt on an action head that outputs motor commands. The VLM provides the "brain" — visual understanding, language grounding, common-sense reasoning. The action head provides the "hands" — continuous, smooth motor outputs. Together, they form a Vision-Language-Action model (VLA).

The generalist bet: Instead of training 1,000 specialist policies for 1,000 tasks, train one foundation model on all the data you can get. Language is the universal task interface — say "pick up the red cup" and the same weights handle perception, planning, and action.
One Policy, Many Robots

Click each robot to see how the same generalist policy receives a language instruction, processes the camera image, and outputs actions in the robot's own action space.

Select a robot
What is the fundamental problem that generalist robot policies try to solve?

Chapter 1: VLM Backbones

Where does a generalist robot get its "understanding" of the world? Not from robot data — there isn't enough. Instead, we start with a pre-trained vision-language model (VLM) that has already learned to see and speak from billions of internet image-text pairs.

A VLM provides three superpowers for free:

  1. Visual understanding. It knows what a cup looks like, what "the red object on the left" means, and can distinguish a spoon from a spatula — without ever seeing a robot gripper.
  2. Language grounding. It maps words like "pick up" or "move to the bowl" to visual concepts. This is the bridge that lets natural language be the task interface.
  3. Semantic reasoning. It knows that cups go on saucers, that liquids pour, that drawers slide open. This common-sense knowledge is impossible to learn from a few thousand robot episodes alone.

The key architectural insight: a VLM is already a sequence model. It takes in image tokens and text tokens, processes them through a transformer, and outputs text tokens. To make it a VLA, we need one more output — action tokens.

The VLM → VLA conversion: Take a VLM that outputs text. Add an action head that outputs motor commands. Keep the VLM backbone mostly frozen (or lightly fine-tuned) so you don't destroy its visual and language knowledge. The VLA inherits the VLM's understanding and adds the ability to do things.

Two concrete architectures dominate this space:

ApproachAction OutputExample
Tokenized actionsDiscretize actions into tokens, predict autoregressively like textRT-2, OpenVLA
Continuous action headAttach a separate network that outputs continuous action vectors via diffusion/flow matchingπ0, SmolVLA

The tokenized approach is simpler — you literally treat actions as extra vocabulary. But discretization throws away precision, and autoregressive decoding is slow (one token at a time). The continuous action head approach is more complex but produces smoother, faster actions. This is what π0 and SmolVLA use, and what we will focus on.

Image + Text
Camera observation + language instruction
VLM Backbone
Pre-trained transformer: encodes visual and language features into a shared representation
Action Head
Continuous output: generates smooth action chunks via flow matching or diffusion
Robot Actions
Joint velocities, end-effector poses, gripper open/close
Why use a pre-trained VLM as the backbone for a robot policy instead of training from scratch?

Chapter 2: Data at Scale

Language models became powerful when data went from millions to trillions of tokens. Can the same scaling story work for robotics? The answer, circa 2024, is a cautious "yes" — but the data landscape looks very different.

The landmark effort is Open X-Embodiment (OXE), an aggregation of robot datasets from over 20 research labs. It contains:

MetricValue
Total episodes1M+
Robot embodiments22
Unique tasks500+
Research labs21

That sounds like a lot — until you compare it to language. One million episodes, each maybe 5–30 seconds of data, gives you roughly 10,000 hours of robot experience. GPT-4 was trained on the equivalent of millions of years of reading. Robotics data is still many orders of magnitude smaller.

A second major effort is DROID (Distributed Robot Interaction Dataset): 76,000 episodes across 564 tasks, collected in 52 different labs using a standardized setup. What makes DROID special is consistency — same camera positions, same action format, same annotation quality.

The data flywheel: More data → better models → models that work on more robots → more users collecting data → more data. This is the same flywheel that powered language model scaling, and it's why HuggingFace invested in LeRobot: an open ecosystem accelerates the flywheel for everyone.

But raw quantity isn't enough. Robot datasets have unique challenges:

The practical solution: standardize on end-effector pose deltas in SE(3) (position change + rotation change + gripper state) as the action representation. This is what π0, SmolVLA, and LeRobot all converge on. A 3D position delta and a rotation delta mean the same thing regardless of whether the underlying robot has 6 joints or 16.

The Robot Data Landscape

Explore the scale of major robot datasets. Each block represents episodes. Notice how even the largest robot datasets are tiny compared to language data.

What common action representation do modern VLAs standardize on to handle different robot embodiments?

Chapter 3: π0 Architecture

Now we get to the model that put VLAs on the map. π0 (pi-zero), from Physical Intelligence (2024), is the first VLA to demonstrate genuine cross-task, cross-embodiment generalization at scale. Let's take it apart piece by piece.

The architecture has two halves, and understanding why they are separate is the key insight:

VLM Backbone

PaliGemma 3B — a pre-trained vision-language model from Google. It takes in images (via a SigLIP vision encoder) and text (via a Gemma language model) and produces rich, semantically meaningful embeddings.

This half understands the what and why: what objects are in the scene, what the instruction means, what a reasonable next step looks like.

Action Expert

A separate transformer that takes the VLM's embeddings and generates continuous action chunks via flow matching (see Chapter 2 on imitation learning).

This half handles the how: the precise, smooth motor commands needed to actually move the gripper 3.2 cm to the right while rotating 15 degrees.

Why two separate networks? VLMs are pre-trained on text output — they think in discrete tokens. Robot actions are continuous and high-frequency. Forcing a VLM to output raw actions through its language head would be like asking a poet to write sheet music — the representations are fundamentally different. The action expert speaks the language of continuous control.

The data flow through π0:

Camera Image(s)
RGB frames from wrist + base cameras: 224×224×3 each
SigLIP Vision Encoder
Encodes each image into patch tokens: 256 tokens per image
Language Instruction
"Pick up the red block and place it on the plate" → tokenized text
Gemma LM (PaliGemma)
Cross-attends image and text tokens. Outputs fused representation.
Action Expert (Flow Matching Transformer)
Takes VLM embeddings + noisy action chunk. Iteratively denoises to produce clean action sequence.
Action Chunk
50 timesteps of [Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper]

The action expert uses flow matching, not diffusion. Recall from Chapter 2 that flow matching learns straight-line paths from noise to data, making denoising faster. At inference, π0 needs only 10 denoising steps (vs 50–100 for DDPM-style diffusion) to generate a full action chunk. Each chunk is 50 timesteps of 7D actions, so a single forward pass produces roughly 1–2 seconds of future motion.

Action chunks, not single actions. Like ACT (Action Chunking with Transformers), π0 predicts an entire sequence of future actions at once. This temporal smoothness prevents the jittery stop-start behavior you get from step-by-step action prediction. The robot executes the first few steps, then re-plans with a new observation.
π0 Architecture Diagram

The two halves of π0: the VLM backbone (teal) processes vision and language, while the action expert (orange) generates motor commands via flow matching.

Why does π0 use a separate action expert instead of predicting actions directly from the VLM's language head?

Chapter 4: π0 Training

Having a good architecture is half the battle. The other half is training it correctly — and π0's training recipe is surprisingly nuanced.

Phase 1: Pre-training the backbone

PaliGemma 3B comes already pre-trained on internet-scale image-text data. This gives the VLM backbone its visual and language understanding. π0 inherits all of this for free. No robot data needed for this phase.

Phase 2: Robot fine-tuning

Now we fine-tune the full model (VLM + action expert) on robot demonstration data. The training set is massive by robotics standards:

PropertyValue
Total robot data10,000+ hours
Robot embodiments7 different robots
Data sourcesOXE + DROID + proprietary data
Action formatEnd-effector Δpose + gripper state
The mixed-training trick: During robot fine-tuning, π0 doesn't train exclusively on robot data. It mixes in language and vision tasks (image captioning, VQA) to prevent catastrophic forgetting. If you only show the model robot data, it gradually forgets what "red" means or how to parse complex instructions. The mix ratio is typically 70% robot data, 30% vision-language data.

The action expert's loss

The action expert is trained with a flow matching objective. Given a ground-truth action chunk a0 from a demonstration and a random noise vector a1, we create an interpolated point at time t:

at = (1 − t) · a1 + t · a0

The model predicts the velocity field — the direction from noise to data:

L = || vθ(at, t, context) − (a0 − a1) ||2

At inference time, we start from pure noise a1 and take 10 Euler steps along the learned velocity field to arrive at a clean action chunk. Fewer steps than diffusion, straighter paths, faster inference.

LeRobot code

python
# Using pi0 with LeRobot
from lerobot.common.policies.pi0.modeling_pi0 import PI0Policy
from lerobot.common.policies.pi0.config_pi0 import PI0Config

# Load pre-trained pi0
config = PI0Config()
policy = PI0Policy(config)

# At inference: pass observation dict
observation = {
    "observation.images.top": image_tensor,     # [B, C, H, W]
    "observation.state": proprio_tensor,        # [B, state_dim]
}
action_chunk = policy.select_action(observation)
# action_chunk shape: [B, chunk_size, action_dim]
# e.g., [1, 50, 7] = 50 steps of 7D actions
Inference speed: With 10 denoising steps on an NVIDIA A100, π0 generates a full 50-step action chunk in approximately 80ms. That's fast enough for real-time control at 5Hz re-planning frequency: observe, generate a chunk, execute for 200ms, re-observe.
Why does π0 mix vision-language tasks into robot fine-tuning?

Chapter 5: SmolVLA

π0 is impressive, but it has a problem: 3 billion parameters. That requires an A100 or H100 GPU just for inference, and fine-tuning demands even more. Most robotics labs — and certainly most hobbyists with a LeRobot SO-100 on their desk — don't have that kind of hardware.

Enter SmolVLA (Small Vision-Language-Action), developed by HuggingFace. The pitch is radical: 250 million parameters — 12 times smaller than π0 — that achieves competitive performance on standard manipulation benchmarks.

The accessibility argument: If only labs with $100K GPU clusters can fine-tune VLAs, the data flywheel stalls. SmolVLA is designed to be fine-tuned on a single consumer GPU (RTX 4090, ~$1,600) in under 24 hours. This is what makes generalist policies practical for the open-source community.

How does SmolVLA get away with being so small? Three design choices:

  1. Efficient backbone. Instead of PaliGemma 3B, SmolVLA uses SmolVLM — a compact VLM designed for edge deployment. It packs visual and language understanding into ~200M parameters.
  2. Diffusion action head (not flow matching). A lightweight DiT (Diffusion Transformer) generates action chunks. The diffusion process is well-understood and stable at small scales.
  3. Aggressive image compression. SmolVLA uses fewer image tokens per frame (e.g., 64 vs. 256), reducing the compute bottleneck in the transformer backbone.
VLA Parameter Comparison

How SmolVLA compares to other VLAs in model size. Smaller models enable fine-tuning on consumer hardware and faster inference on edge devices.

ModelParametersBackboneAction HeadHardware for Fine-tuning
RT-255BPaLM-ETokenizedTPU pod
OpenVLA7BPrismatic VLMTokenized8x A100
π03BPaliGemmaFlow matching4x A100
SmolVLA250MSmolVLMDiffusion1x RTX 4090

The benchmark results are striking. On the SimplerEnv manipulation suite, SmolVLA achieves within 5–10% of π0's success rate on most tasks, at 1/12th the parameter count. On some tasks (like "pick up cube"), it actually outperforms π0 — likely because the simpler model overfits less on easy tasks.

What is SmolVLA's main advantage over π0?

Chapter 6: SmolVLA Architecture Details

Let's open the hood and trace exactly how data flows through SmolVLA, from raw pixels and text to motor commands.

Stream 1: Vision (SigLIP)

Camera images enter through a SigLIP vision encoder — the same family used in PaliGemma, but a smaller variant. Each 224×224 image is divided into 14×14 patches, and each patch becomes a token. A learned projection compresses these into the model's hidden dimension.

Image (224×224×3) → SigLIP → 196 patch tokens → projection → 64 compressed tokens

The compression from 196 to 64 tokens is critical for efficiency. SmolVLA uses a perceiver-style pooling layer that learns to summarize visual information into fewer tokens without losing task-relevant details.

Stream 2: Language (SmolLM)

The language instruction is tokenized and processed by SmolLM, a compact language model (~125M parameters). This produces contextualized text embeddings that encode what the robot should do.

Stream 3: Actions (Diffusion Transformer)

The action head is a small Diffusion Transformer (DiT). It takes three inputs:

  1. VLM embeddings (fused vision + language) as conditioning
  2. Noisy action chunk at (Gaussian noise at start, progressively denoised)
  3. Timestep embedding t (tells the model how much noise remains)

The DiT uses cross-attention to attend from action tokens to VLM embeddings. This is how the action head "reads" what the VLM sees and understands.

Image → SigLIP → 64 tokens
Visual features compressed via perceiver pooling
↓ concat
Text → SmolLM → N tokens
Language instruction embeddings
↓ fused conditioning
DiT Action Head
Cross-attends to VLM features. Iteratively denoises action chunk over ~10 diffusion steps.
[B, chunk_size, action_dim]
e.g., [1, 50, 7]: 50 future steps of [Δx, Δy, Δz, Δr, Δp, Δyaw, grip]

Fine-tuning SmolVLA with LeRobot

python
# Fine-tuning SmolVLA on your own dataset with LeRobot
from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
from lerobot.common.policies.smolvla.config_smolvla import SmolVLAConfig
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

# Load your custom dataset (collected via teleoperation)
dataset = LeRobotDataset("your_hf_username/pick_place_cups")

# Configure SmolVLA for your robot
config = SmolVLAConfig(
    input_shapes={
        "observation.images.top": [3, 224, 224],
        "observation.state": [7],  # 6 joints + gripper
    },
    output_shapes={
        "action": [7],  # same as state
    },
    chunk_size=50,
)

# Initialize and train
policy = SmolVLAPolicy(config)
# Training loop handles flow matching loss internally
# ~8 hours on a single RTX 4090 for 50K steps
Practical tip: When fine-tuning SmolVLA on a custom dataset with limited data (e.g., 100 episodes), freeze the SigLIP vision encoder and SmolLM language model. Only train the projection layers and the DiT action head. This prevents catastrophic forgetting with small datasets and reduces trainable parameters to ~30M.
How does SmolVLA's DiT action head receive information about the visual scene and language instruction?

Chapter 7: Cross-Embodiment Transfer

Here's where things get genuinely surprising. You train a policy on data from one robot and deploy it on a completely different robot — and it works. Not perfectly, but far better than random. How is this possible?

The key is the shared action representation. When every robot's actions are expressed as end-effector pose deltas in SE(3) — "move the gripper 2cm right, rotate 5 degrees, close gripper" — the action space is the same regardless of the underlying kinematics.

A concrete example from the LeRobot ecosystem:

SO-100 (Training)ALOHA-2 (Deployment)
TypeSingle 6-DOF armBimanual (2 × 7-DOF)
Actuators6 servos14 servos
Action dim7 (6 joints + grip)14 (per arm: 7)
EE delta dim7 (Δxyz + Δrpy + grip)7 (Δxyz + Δrpy + grip) per arm

Even though the raw joint spaces are completely different, the task-level representation — "how should the end-effector move?" — is shared. If the model learned on SO-100 that "pick up" means "move down, close gripper, move up," that same sequence of SE(3) deltas works on ALOHA-2's right arm.

What transfers: Spatial reasoning (reaching, placing, pushing), visual grounding (identifying objects from language), and task structure (pick-then-place sequences). What doesn't transfer: Gripper-specific grasping strategies (finger shapes differ), bimanual coordination (single-arm data can't teach two-arm dance), and workspace limits (each robot has different reach).

The practical recipe for cross-embodiment transfer:

  1. Pre-train on the union of all available datasets (OXE + DROID + your data), all converted to SE(3) end-effector deltas.
  2. Fine-tune on a small amount of target robot data (even 50–100 episodes can be enough for known tasks).
  3. Zero-shot deployment works for simple tasks (reaching, pushing) but fine-tuning is needed for precise manipulation.
Cross-Embodiment Transfer Simulator

A policy trained on one robot transfers to another through the shared SE(3) action space. Click "Transfer" to see how the same action sequence maps across embodiments.

Press Transfer
Why does expressing actions as end-effector SE(3) deltas enable cross-embodiment transfer?

Chapter 8: The Frontier

We have built all the pieces: VLM backbones, massive datasets, flow matching action heads, cross-embodiment training. But the field of generalist robot policies is still in its infancy. Here are the open questions that will define the next 3–5 years.

Scaling laws for robot data

Language models follow predictable scaling laws: double the data, performance improves by a known amount. Do robot policies follow similar curves? Early evidence from RT-2 and π0 suggests yes, but the slope is different. Visual-motor tasks saturate faster than language tasks — you don't need a trillion episodes to learn "pick up the cup." The question is where each task's saturation point lies.

How much data is enough? For simple tasks (reach, grasp), perhaps 1,000 episodes suffice. For complex manipulation (folding clothes, cooking), we may need 100,000+. For truly open-ended behavior, we likely need orders of magnitude more than currently exists. Nobody knows the exact numbers yet.

Simulation as data engine

Can simulation data substitute for real data? The answer is nuanced:

ProCon
Unlimited quantity — generate millions of episodes overnightDomain gap: simulated images and physics don't perfectly match reality
Automatic reset — no human operator needed between episodesContact physics is hard — deformable objects, liquids, friction
Perfect labels — ground-truth state, segmentation, posesSim-trained policies often fail on real textures, lighting, clutter

The emerging consensus: sim-to-real-to-sim. Train in sim, deploy on real robot, use the real experience to improve the simulator, repeat. Each loop narrows the domain gap. The simulator becomes a better training ground, and the real robot provides grounding.

Open challenges

  1. Long-horizon tasks. Current VLAs handle 10–30 second tasks. Making a sandwich requires 5+ minutes of coordinated manipulation. How do you plan, recover from errors, and maintain context over that timeframe?
  2. Safety and robustness. A language model that hallucinates is embarrassing. A robot that applies 50N of force in the wrong direction is dangerous. How do we guarantee safety bounds?
  3. Emergent capabilities. GPT-3 unexpectedly learned to do arithmetic. What unexpected capabilities might emerge from robot foundation models trained on diverse enough data?
  4. Continual learning. How does a deployed robot improve from its own experience without forgetting what it already knows? This is catastrophic forgetting at deployment scale.
The bet: The teams behind π0, SmolVLA, and RT-2 are all betting that the same recipe that worked for language will work for robotics: scale data, scale models, and emergent generalization will follow. The open-source community (LeRobot, Open X-Embodiment, DROID) is trying to make this happen in the open, not behind closed doors. The next few years will tell us if they're right.
What is the "sim-to-real-to-sim" loop?

Chapter 9: Connections

We've covered the full arc of the Robot Learning Tutorial — from collecting your first demonstrations to training generalist foundation models. Let's pull it all together.

The Book at a Glance

ChapterCore IdeaKey MethodLeRobot Module
1. IntroductionData is the foundationLeRobotDataset, teleoperationlerobot.common.datasets
2. Imitation LearningLearn actions from demosACT, Diffusion Policy, Flow Matchinglerobot.common.policies.act
3. Reinforcement LearningLearn from reward signalsSAC, sim-to-real, domain randomizationlerobot.common.policies.sac
4. World ModelsLearn physics from videoRSSM, TD-MPC, latent imaginationlerobot.common.policies.tdmpc
5. Generalist PoliciesOne model, all tasksVLAs: π0, SmolVLAlerobot.common.policies.pi0

Concept Cheat Sheet

ConceptWhat It MeansWhere It Appears
Action chunkPredicting a sequence of future actions at once, not one step at a timeACT (Ch 2), π0 (Ch 5), SmolVLA (Ch 5)
Flow matchingLearning straight-line paths from noise to data (faster than diffusion)Ch 2 (theory), π0 (Ch 5)
Diffusion policyGenerating actions by iteratively denoising Gaussian noiseCh 2, SmolVLA (Ch 5)
VLM → VLAAdding an action output head to a vision-language modelπ0, SmolVLA, OpenVLA (Ch 5)
SE(3) deltasEnd-effector movements in 3D space: position + rotation + gripperCross-embodiment (Ch 5), all policies
Catastrophic forgettingModel forgets old skills when trained exclusively on new dataπ0 mixed training (Ch 5), fine-tuning (Ch 2)
Open X-Embodiment1M+ episodes across 22 robots — the ImageNet of robot manipulationData (Ch 5), baselines (Ch 2–4)

The Stack

LeRobotDataset
Unified dataset format: episodes of (observation, action) pairs
Policy Architecture
ACT | Diffusion Policy | TD-MPC | π0 | SmolVLA
Training Loop
IL loss (MSE, flow matching, diffusion) or RL loss (SAC, PPO)
Deployment
policy.select_action(obs) → action chunk → robot.send(action)

Where to Go Next

Closing thought: "The best way to predict the future is to invent it." — Alan Kay. Robot foundation models are being invented right now, in the open, by a community that believes the path to general-purpose robots runs through shared data, shared code, and shared knowledge. This tutorial is your on-ramp.
What is the key enabler that makes generalist robot policies possible?