From single-task policies to foundation models — language-conditioned VLAs that work across tasks and embodiments.
Everything we have built so far — behavior cloning, diffusion policies, RL fine-tuning, world models — shares one brutal limitation: each policy knows exactly one task on exactly one robot. Train a diffusion policy to fold towels on a Franka arm, and it cannot stack cups. Train it on an ALOHA bimanual system, and it has no idea what to do on a single-arm WidowX.
Think about how absurd this is compared to language. GPT-4 can write poetry, debug code, summarize legal documents, and translate Mandarin — all with a single set of weights. We trained one model, and it generalized across an enormous space of tasks. Why can't we do this for robots?
Three obstacles stand in the way:
The breakthrough idea of 2024–2025: take a vision-language model (VLM) that already understands images and text, and bolt on an action head that outputs motor commands. The VLM provides the "brain" — visual understanding, language grounding, common-sense reasoning. The action head provides the "hands" — continuous, smooth motor outputs. Together, they form a Vision-Language-Action model (VLA).
Click each robot to see how the same generalist policy receives a language instruction, processes the camera image, and outputs actions in the robot's own action space.
Where does a generalist robot get its "understanding" of the world? Not from robot data — there isn't enough. Instead, we start with a pre-trained vision-language model (VLM) that has already learned to see and speak from billions of internet image-text pairs.
A VLM provides three superpowers for free:
The key architectural insight: a VLM is already a sequence model. It takes in image tokens and text tokens, processes them through a transformer, and outputs text tokens. To make it a VLA, we need one more output — action tokens.
Two concrete architectures dominate this space:
| Approach | Action Output | Example |
|---|---|---|
| Tokenized actions | Discretize actions into tokens, predict autoregressively like text | RT-2, OpenVLA |
| Continuous action head | Attach a separate network that outputs continuous action vectors via diffusion/flow matching | π0, SmolVLA |
The tokenized approach is simpler — you literally treat actions as extra vocabulary. But discretization throws away precision, and autoregressive decoding is slow (one token at a time). The continuous action head approach is more complex but produces smoother, faster actions. This is what π0 and SmolVLA use, and what we will focus on.
Language models became powerful when data went from millions to trillions of tokens. Can the same scaling story work for robotics? The answer, circa 2024, is a cautious "yes" — but the data landscape looks very different.
The landmark effort is Open X-Embodiment (OXE), an aggregation of robot datasets from over 20 research labs. It contains:
| Metric | Value |
|---|---|
| Total episodes | 1M+ |
| Robot embodiments | 22 |
| Unique tasks | 500+ |
| Research labs | 21 |
That sounds like a lot — until you compare it to language. One million episodes, each maybe 5–30 seconds of data, gives you roughly 10,000 hours of robot experience. GPT-4 was trained on the equivalent of millions of years of reading. Robotics data is still many orders of magnitude smaller.
A second major effort is DROID (Distributed Robot Interaction Dataset): 76,000 episodes across 564 tasks, collected in 52 different labs using a standardized setup. What makes DROID special is consistency — same camera positions, same action format, same annotation quality.
But raw quantity isn't enough. Robot datasets have unique challenges:
The practical solution: standardize on end-effector pose deltas in SE(3) (position change + rotation change + gripper state) as the action representation. This is what π0, SmolVLA, and LeRobot all converge on. A 3D position delta and a rotation delta mean the same thing regardless of whether the underlying robot has 6 joints or 16.
Explore the scale of major robot datasets. Each block represents episodes. Notice how even the largest robot datasets are tiny compared to language data.
Now we get to the model that put VLAs on the map. π0 (pi-zero), from Physical Intelligence (2024), is the first VLA to demonstrate genuine cross-task, cross-embodiment generalization at scale. Let's take it apart piece by piece.
The architecture has two halves, and understanding why they are separate is the key insight:
PaliGemma 3B — a pre-trained vision-language model from Google. It takes in images (via a SigLIP vision encoder) and text (via a Gemma language model) and produces rich, semantically meaningful embeddings.
This half understands the what and why: what objects are in the scene, what the instruction means, what a reasonable next step looks like.
A separate transformer that takes the VLM's embeddings and generates continuous action chunks via flow matching (see Chapter 2 on imitation learning).
This half handles the how: the precise, smooth motor commands needed to actually move the gripper 3.2 cm to the right while rotating 15 degrees.
The data flow through π0:
The action expert uses flow matching, not diffusion. Recall from Chapter 2 that flow matching learns straight-line paths from noise to data, making denoising faster. At inference, π0 needs only 10 denoising steps (vs 50–100 for DDPM-style diffusion) to generate a full action chunk. Each chunk is 50 timesteps of 7D actions, so a single forward pass produces roughly 1–2 seconds of future motion.
The two halves of π0: the VLM backbone (teal) processes vision and language, while the action expert (orange) generates motor commands via flow matching.
Having a good architecture is half the battle. The other half is training it correctly — and π0's training recipe is surprisingly nuanced.
PaliGemma 3B comes already pre-trained on internet-scale image-text data. This gives the VLM backbone its visual and language understanding. π0 inherits all of this for free. No robot data needed for this phase.
Now we fine-tune the full model (VLM + action expert) on robot demonstration data. The training set is massive by robotics standards:
| Property | Value |
|---|---|
| Total robot data | 10,000+ hours |
| Robot embodiments | 7 different robots |
| Data sources | OXE + DROID + proprietary data |
| Action format | End-effector Δpose + gripper state |
The action expert is trained with a flow matching objective. Given a ground-truth action chunk a0 from a demonstration and a random noise vector a1, we create an interpolated point at time t:
The model predicts the velocity field — the direction from noise to data:
At inference time, we start from pure noise a1 and take 10 Euler steps along the learned velocity field to arrive at a clean action chunk. Fewer steps than diffusion, straighter paths, faster inference.
python # Using pi0 with LeRobot from lerobot.common.policies.pi0.modeling_pi0 import PI0Policy from lerobot.common.policies.pi0.config_pi0 import PI0Config # Load pre-trained pi0 config = PI0Config() policy = PI0Policy(config) # At inference: pass observation dict observation = { "observation.images.top": image_tensor, # [B, C, H, W] "observation.state": proprio_tensor, # [B, state_dim] } action_chunk = policy.select_action(observation) # action_chunk shape: [B, chunk_size, action_dim] # e.g., [1, 50, 7] = 50 steps of 7D actions
π0 is impressive, but it has a problem: 3 billion parameters. That requires an A100 or H100 GPU just for inference, and fine-tuning demands even more. Most robotics labs — and certainly most hobbyists with a LeRobot SO-100 on their desk — don't have that kind of hardware.
Enter SmolVLA (Small Vision-Language-Action), developed by HuggingFace. The pitch is radical: 250 million parameters — 12 times smaller than π0 — that achieves competitive performance on standard manipulation benchmarks.
How does SmolVLA get away with being so small? Three design choices:
How SmolVLA compares to other VLAs in model size. Smaller models enable fine-tuning on consumer hardware and faster inference on edge devices.
| Model | Parameters | Backbone | Action Head | Hardware for Fine-tuning |
|---|---|---|---|---|
| RT-2 | 55B | PaLM-E | Tokenized | TPU pod |
| OpenVLA | 7B | Prismatic VLM | Tokenized | 8x A100 |
| π0 | 3B | PaliGemma | Flow matching | 4x A100 |
| SmolVLA | 250M | SmolVLM | Diffusion | 1x RTX 4090 |
The benchmark results are striking. On the SimplerEnv manipulation suite, SmolVLA achieves within 5–10% of π0's success rate on most tasks, at 1/12th the parameter count. On some tasks (like "pick up cube"), it actually outperforms π0 — likely because the simpler model overfits less on easy tasks.
Let's open the hood and trace exactly how data flows through SmolVLA, from raw pixels and text to motor commands.
Camera images enter through a SigLIP vision encoder — the same family used in PaliGemma, but a smaller variant. Each 224×224 image is divided into 14×14 patches, and each patch becomes a token. A learned projection compresses these into the model's hidden dimension.
The compression from 196 to 64 tokens is critical for efficiency. SmolVLA uses a perceiver-style pooling layer that learns to summarize visual information into fewer tokens without losing task-relevant details.
The language instruction is tokenized and processed by SmolLM, a compact language model (~125M parameters). This produces contextualized text embeddings that encode what the robot should do.
The action head is a small Diffusion Transformer (DiT). It takes three inputs:
The DiT uses cross-attention to attend from action tokens to VLM embeddings. This is how the action head "reads" what the VLM sees and understands.
python # Fine-tuning SmolVLA on your own dataset with LeRobot from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy from lerobot.common.policies.smolvla.config_smolvla import SmolVLAConfig from lerobot.common.datasets.lerobot_dataset import LeRobotDataset # Load your custom dataset (collected via teleoperation) dataset = LeRobotDataset("your_hf_username/pick_place_cups") # Configure SmolVLA for your robot config = SmolVLAConfig( input_shapes={ "observation.images.top": [3, 224, 224], "observation.state": [7], # 6 joints + gripper }, output_shapes={ "action": [7], # same as state }, chunk_size=50, ) # Initialize and train policy = SmolVLAPolicy(config) # Training loop handles flow matching loss internally # ~8 hours on a single RTX 4090 for 50K steps
Here's where things get genuinely surprising. You train a policy on data from one robot and deploy it on a completely different robot — and it works. Not perfectly, but far better than random. How is this possible?
The key is the shared action representation. When every robot's actions are expressed as end-effector pose deltas in SE(3) — "move the gripper 2cm right, rotate 5 degrees, close gripper" — the action space is the same regardless of the underlying kinematics.
A concrete example from the LeRobot ecosystem:
| SO-100 (Training) | ALOHA-2 (Deployment) | |
|---|---|---|
| Type | Single 6-DOF arm | Bimanual (2 × 7-DOF) |
| Actuators | 6 servos | 14 servos |
| Action dim | 7 (6 joints + grip) | 14 (per arm: 7) |
| EE delta dim | 7 (Δxyz + Δrpy + grip) | 7 (Δxyz + Δrpy + grip) per arm |
Even though the raw joint spaces are completely different, the task-level representation — "how should the end-effector move?" — is shared. If the model learned on SO-100 that "pick up" means "move down, close gripper, move up," that same sequence of SE(3) deltas works on ALOHA-2's right arm.
The practical recipe for cross-embodiment transfer:
A policy trained on one robot transfers to another through the shared SE(3) action space. Click "Transfer" to see how the same action sequence maps across embodiments.
We have built all the pieces: VLM backbones, massive datasets, flow matching action heads, cross-embodiment training. But the field of generalist robot policies is still in its infancy. Here are the open questions that will define the next 3–5 years.
Language models follow predictable scaling laws: double the data, performance improves by a known amount. Do robot policies follow similar curves? Early evidence from RT-2 and π0 suggests yes, but the slope is different. Visual-motor tasks saturate faster than language tasks — you don't need a trillion episodes to learn "pick up the cup." The question is where each task's saturation point lies.
Can simulation data substitute for real data? The answer is nuanced:
| Pro | Con |
|---|---|
| Unlimited quantity — generate millions of episodes overnight | Domain gap: simulated images and physics don't perfectly match reality |
| Automatic reset — no human operator needed between episodes | Contact physics is hard — deformable objects, liquids, friction |
| Perfect labels — ground-truth state, segmentation, poses | Sim-trained policies often fail on real textures, lighting, clutter |
The emerging consensus: sim-to-real-to-sim. Train in sim, deploy on real robot, use the real experience to improve the simulator, repeat. Each loop narrows the domain gap. The simulator becomes a better training ground, and the real robot provides grounding.
We've covered the full arc of the Robot Learning Tutorial — from collecting your first demonstrations to training generalist foundation models. Let's pull it all together.
| Chapter | Core Idea | Key Method | LeRobot Module |
|---|---|---|---|
| 1. Introduction | Data is the foundation | LeRobotDataset, teleoperation | lerobot.common.datasets |
| 2. Imitation Learning | Learn actions from demos | ACT, Diffusion Policy, Flow Matching | lerobot.common.policies.act |
| 3. Reinforcement Learning | Learn from reward signals | SAC, sim-to-real, domain randomization | lerobot.common.policies.sac |
| 4. World Models | Learn physics from video | RSSM, TD-MPC, latent imagination | lerobot.common.policies.tdmpc |
| 5. Generalist Policies | One model, all tasks | VLAs: π0, SmolVLA | lerobot.common.policies.pi0 |
| Concept | What It Means | Where It Appears |
|---|---|---|
| Action chunk | Predicting a sequence of future actions at once, not one step at a time | ACT (Ch 2), π0 (Ch 5), SmolVLA (Ch 5) |
| Flow matching | Learning straight-line paths from noise to data (faster than diffusion) | Ch 2 (theory), π0 (Ch 5) |
| Diffusion policy | Generating actions by iteratively denoising Gaussian noise | Ch 2, SmolVLA (Ch 5) |
| VLM → VLA | Adding an action output head to a vision-language model | π0, SmolVLA, OpenVLA (Ch 5) |
| SE(3) deltas | End-effector movements in 3D space: position + rotation + gripper | Cross-embodiment (Ch 5), all policies |
| Catastrophic forgetting | Model forgets old skills when trained exclusively on new data | π0 mixed training (Ch 5), fine-tuning (Ch 2) |
| Open X-Embodiment | 1M+ episodes across 22 robots — the ImageNet of robot manipulation | Data (Ch 5), baselines (Ch 2–4) |