pi-0: VLA Flow Model

Chapter 0: The Problem

Robot learning has a versatility problem. We can train a robot to pick up a specific cup in a specific lab. We can even train it on many cups in many labs. But ask it to fold a shirt — a task that requires understanding fabric dynamics, planning a sequence of folds, adapting to arbitrary initial configurations — and it falls apart.

The issue isn't intelligence. Vision-language models like GPT-4V can describe how to fold a shirt in perfect detail. The issue is that no amount of language understanding translates into the precise, high-frequency motor commands needed to actually manipulate fabric at 50 Hz.

Previous VLAs (like RT-2) solved this by discretizing actions into tokens — binning each action dimension into 256 values. This works for coarse pick-and-place, but discretization destroys the smoothness needed for dexterous manipulation. Try folding a shirt with only 256 possible positions per joint per timestep. You can't.

What makes VLAs different from industrial robots

Industrial robots (in factories) run at >1000 Hz with sub-millimeter precision. They don't need VLMs because their environment is perfectly known — the same part arrives at the same position every time. The challenge VLAs solve is the opposite: handling unpredictable environments where the robot must understand what it sees to decide what to do. This requires a VLM. But the VLM is huge (billions of parameters), which means slower inference, which means lower control frequency.

pi-0 operates at the intersection: fast enough for smooth manipulation (50 Hz), smart enough to understand language and novel objects (VLM backbone). This dual requirement is what makes the engineering so constrained.

What "50 Hz control" actually means

A robot arm with 7 joints needs a new position command for every joint, every 20 milliseconds. That's 7 floating-point numbers x 50 times per second = 350 continuous values per second. Each value needs sub-degree precision — the difference between a clean fold and a crumpled mess is often less than 2 degrees of wrist rotation.

With RT-2's 256-bin discretization, the minimum resolution per joint is approximately 360 degrees / 256 = 1.4 degrees per bin. Sounds fine in isolation. But errors compound across joints and timesteps. Over a 10-second fold sequence (500 timesteps x 7 joints), the accumulated quantization error makes smooth trajectories impossible.

The fundamental tension: VLMs give you understanding (what to do). But discrete token outputs can't express the precision needed for dexterous manipulation (how to do it smoothly). pi-0 resolves this by keeping the VLM for understanding and adding flow matching for continuous, high-precision action generation.

Discrete vs Continuous Actions

Drag the resolution slider to see how discretization affects a smooth trajectory. At 256 bins it's coarse and jerky. Flow matching produces the smooth blue curve.

Bins per dim32

What degrades with discretization

The damage isn't just "slightly less smooth." Discretization causes three specific failure modes in dexterous manipulation:

Temporal jitter: Quantized trajectories have discontinuous derivatives. The robot accelerates and decelerates abruptly at bin boundaries, causing vibration that disrupts delicate grasps.
Multi-modal collapse: When the true action distribution is bimodal (two valid strategies), discretization forces the model to pick bin centers that may fall between the modes — choosing an action that belongs to neither strategy.
Dimensionality curse: With 7 joints and 256 bins each, the discrete action space has 256^7 ≈ 7.2 x 10^16 possible actions per timestep. No token vocabulary can represent this combinatorial space efficiently.

Why can't discrete action tokens (like RT-2 uses) handle dexterous manipulation?

Discretization introduces quantization artifacts — 256 bins per dimension is too coarse for smooth, high-frequency control Discrete tokens are too slow to generate The vocabulary size becomes too large

Chapter 1: The Key Insight

pi-0's insight is to treat the robot foundation model exactly like a language foundation model — but with a crucial architectural twist for actions.

In language, the recipe is: (1) pre-train a large model on diverse internet text, (2) post-train (fine-tune) on carefully curated data for the desired behavior. GPT-4 follows this recipe. So does Claude.

pi-0 follows the same recipe for robots: (1) start with a pre-trained VLM (PaLiGemma) that already understands images and text, (2) pre-train on diverse robot data from 7 robot types and 68 tasks, (3) post-train on high-quality data for specific dexterous tasks.

The twist: instead of predicting the next discrete token (like language models do), pi-0 uses flow matching to generate continuous action distributions. This gives it the precision to control robots at 50 Hz for tasks like folding laundry — something no discrete-token VLA can do.

Think of it this way: A language model predicts "the next word" from a finite vocabulary. pi-0 predicts "the next 50 motor commands" from a continuous space — like generating a smooth curve rather than choosing from a fixed set of points. Flow matching is the mathematical tool that makes this possible.

The complete data flow

Let's trace a single inference step through pi-0. The robot has two wrist cameras and one base camera. Here's exactly what happens:

Image input: 3 camera images at 224x224 resolution. Each passes through SigLIP's ViT-So400m encoder, producing 256 visual tokens per image = 768 visual tokens total. Each token is a 1152-dimensional vector.
Language input: A task instruction ("fold the towel in thirds") tokenized into ~15-30 language tokens.
Proprioception: Current joint angles + gripper state = 7-8 floating-point values, projected to a single embedding.
VLM processing: All tokens (768 visual + ~25 language + 1 proprioceptive) pass through the Gemma transformer backbone (18 layers). Output: contextualized embeddings for each token position.
Action expert: 50 randomly-initialized action tokens are concatenated. They attend to all VLM tokens via cross-attention, then pass through the action expert's feedforward layers.
Flow matching denoising: The action tokens go through 10 denoising steps, progressively refining from noise to a clean action chunk of shape [50, 7] — 50 timesteps, 7 joint angles + gripper.
Output: 50 continuous joint-angle commands executed open-loop at 50 Hz (1 second of motion), then re-plan.

Step 1

Start with PaLiGemma VLM (understands images + text from web pre-training)

↓

Step 2

Pre-train on diverse robot data: 7 robot types, 68 tasks, OXE dataset

↓

Step 3

Post-train on high-quality curated data for specific dexterous tasks

↓

Result

A robot that folds laundry, busses tables, and packs groceries

Why flow matching instead of DDPM?

Both diffusion (DDPM) and flow matching can generate continuous outputs. But there's a critical engineering difference: DDPM needs 50-1000 denoising steps because its paths are curved (the model must follow a complex SDE). Flow matching uses straight-line paths (optimal transport) from noise to data, requiring only 5-10 steps for comparable quality.

At 50 Hz control: if each denoising step takes ~2ms of GPU time, then 10 steps = 20ms per action chunk (fits in the 20ms budget). 100 steps = 200ms — already too slow for real-time. This is why pi-0 uses flow matching. It's not a minor preference; it's a hard engineering constraint.

What is the key architectural difference between pi-0 and earlier VLAs like RT-2?

pi-0 uses a bigger language model pi-0 uses flow matching for continuous actions instead of discrete tokens pi-0 uses more training data

Chapter 2: The VLM Backbone

pi-0 is built on PaLiGemma — a vision-language model from Google that combines a SigLIP image encoder with a Gemma language model. PaLiGemma was pre-trained on billions of image-text pairs from the web, so it already understands visual concepts, spatial relationships, and natural language.

Why not train from scratch? Because a VLM pre-trained on web data gives you an enormous head start. It already knows what a "cup" looks like, that "left of the plate" means a spatial relationship, and how to follow instructions like "pick up the red one." Without this foundation, a robot policy would need to learn all of this from relatively scarce robot demonstration data.

Late fusion architecture

PaLiGemma uses late fusion: the image encoder processes images independently, then the resulting visual tokens are concatenated with text tokens and fed to the language model transformer. pi-0 extends this by adding a third modality — robot proprioception and actions — alongside vision and language.

Why late fusion (not early fusion)

There are two ways to combine vision and language in a multimodal model:

Early fusion: Interleave image patches and text tokens from layer 1. The model jointly processes both from the start. Used in some Flamingo-style models.
Late fusion: Process images through a separate encoder (SigLIP) first, then inject the resulting visual tokens into the language model. The image encoder is specialized for vision; the language model handles cross-modal reasoning.

PaLiGemma uses late fusion because it allows the image encoder to be pre-trained independently (SigLIP on image-text contrastive learning) and the language model to be pre-trained independently (Gemma on text). You get the best of both specialists. The downside: the image encoder can't attend to language — it processes images context-free. For robot tasks, this is usually fine because the image content (what's on the table) doesn't depend on the language instruction (what to do with it).

For pi-0, late fusion has a bonus advantage: the SigLIP encoder can be kept frozen during post-training (no gradients needed), saving memory and preventing visual representation degradation. Only the Gemma backbone and action expert receive gradients during post-training.

Concrete dimensions

Let's be precise about what flows through the network:

Component	Architecture	Output Shape
SigLIP encoder	ViT-So400m/14, 400M params	256 tokens x 1152 dims per image
Gemma backbone	2B params, 18 layers, 2048 hidden	Contextualized embeddings
Action tokens	Learned embeddings	50 tokens x 2048 dims (action chunk)
Action head MLP	2-layer MLP per token	7 or 14 dims (joint angles + gripper)

Total model: ~3B parameters. The VLM backbone is ~2.4B, the action expert adds ~600M.

Architecture Overview

Image preprocessing: from raw pixels to tokens

The camera produces 640x480 RGB images. Before reaching SigLIP, each image is:

Resized: 640x480 → 224x224 (bilinear interpolation)
Normalized: Pixel values [0, 255] → [-1, 1] using ImageNet statistics
Patched: 224x224 image / 14x14 patch size = 16x16 = 256 patches
Encoded: Each patch → 1152-dim embedding via SigLIP's ViT layers

Why 224x224? This is SigLIP's native resolution from pre-training. Higher resolution (448x448) would give 1024 patches per image — better for small objects but 4x the compute cost. The paper found 224x224 sufficient for manipulation-scale objects.

What the VLM brings vs. what it lacks

The VLM backbone provides:

Object recognition: "That's a towel" — from web pre-training on millions of labeled images.
Spatial reasoning: "The cup is to the left of the plate" — from image-text alignment training.
Instruction following: "Pick up the red one, not the blue one" — from language model training.
Common sense: "Towels are foldable, plates are rigid" — implicit in web data statistics.

What it does NOT provide (and must be learned during robot pre-training):

How to convert visual understanding into motor commands
Proprioceptive awareness (where are my joints right now?)
Dynamics of physical interaction (force feedback, contact dynamics)
Temporal action coherence (smooth trajectories, not just correct endpoints)

Why PaLiGemma specifically? It's relatively small (3B parameters), which matters for real-time robot control — the model needs to run inference fast enough for 50 Hz action generation. Larger VLMs would give better understanding but slower inference. At inference time, pi-0 runs on a single NVIDIA A100 GPU, generating 50-step action chunks in ~20ms.

Why does pi-0 start from a pre-trained VLM rather than training from scratch?

The VLM already understands visual concepts, spatial relations, and language — this knowledge transfers to robot control Pre-trained VLMs are faster at inference It's cheaper to train

Chapter 3: Flow Matching for Actions

This is the core technical contribution. Instead of discretizing actions into bins (like RT-2's 256 tokens per dimension), pi-0 uses conditional flow matching to model the continuous distribution of actions.

What is flow matching?

Flow matching learns a velocity field that transports random noise into a desired distribution — in this case, the distribution of correct robot actions. During inference, you start with random noise and follow the velocity field to produce a clean action.

dx/dt = v_θ(x_τ, o_t)

where x_τ is the noisy action at flow time τ, o_t is the observation (images + language + proprioception), and v_θ is the learned velocity field.

Why flow matching and not just regression?

You might wonder: why not just predict the action chunk directly via regression (MSE loss)? The answer is multi-modality. Consider "pick up the cup" — there are multiple valid grasps (top-down, side grasp, pinch grasp). A regression model would predict the mean of these grasps: a halfway-between grasp that doesn't work for any approach. Flow matching (like diffusion) can sample from a multi-modal distribution — on each inference call it might produce a top-down grasp OR a side grasp, both valid, but never the mean.

This matters less for simple tasks (there's usually one best approach to "move left"). But for dexterous manipulation with multiple valid strategies, it's the difference between a model that works and one that produces garbage. In the paper's ablation, replacing flow matching with direct regression drops success on multi-strategy tasks by ~25 pp while barely affecting single-strategy tasks.

Training: the math made concrete

During training, the process is simple. Let's walk through one training step with actual numbers:

Sample real action A from the dataset: a chunk of shape [50, 7] — 50 timesteps of 7 joint angles. Example: the first joint might be [0.32, 0.33, 0.34, ...] radians over 50 steps.
Sample noise ε from N(0, I): same shape [50, 7], random values.
Sample flow time τ uniformly from [0, 1]. Say τ = 0.7.
Interpolate: A^τ = τ * A + (1 - τ) * ε = 0.7 * real_action + 0.3 * noise. This is a "partially noisy" action chunk.
Predict: Network outputs v_θ(A^τ, observation) — its guess for the velocity direction.
Loss: Compare v_θ to the true velocity (A - ε). L2 distance.

L(θ) = E[ ||v_θ(A^τ, o) - (A - ε)||² ]

Inference: denoising in practice

At inference time, we start with pure noise x₀ ~ N(0, I) of shape [50, 7] and take K = 10 Euler steps:

x_k+1 = x_k + (1/K) * v_θ(x_k, observation)

After 10 steps, x₁₀ is a clean action chunk. Total time: ~2ms per step x 10 steps = 20ms. Well within the 20ms budget for 50 Hz control.

A worked denoising example

Let's trace one denoising step for a single joint (say, the shoulder pitch). We want the shoulder to move from its current position (0.5 rad) to a target (1.2 rad) over 50 timesteps:

True action trajectory: A smooth sigmoid curve from 0.5 to 1.2 rad over 50 steps.
Start with noise: x₀ = [0.87, -0.32, 1.45, ...] — 50 random values from N(0,1). No structure.
Step 1: Network predicts velocity v₁. x₁ = x₀ + 0.1 * v₁. Still mostly noise, but a faint signal of the trajectory shape appears.
Step 5: The trajectory is recognizable — roughly increasing from ~0.5 to ~1.2, but still jittery.
Step 10: Clean output. A smooth curve matching the training distribution for this observation. Quantization-free — every value is continuous.

The key insight: the velocity field doesn't predict the final trajectory directly. It predicts the direction to move in at each intermediate noise level. This is easier to learn (local corrections vs. global prediction) and produces better results with fewer steps.

Flow Matching: Noise → Actions

Click Play to watch noise get transported to a clean action trajectory via the learned velocity field.

Why flow matching beats DDPM for robotics

Property	Flow Matching	DDPM (Diffusion)
Path shape	Straight lines (OT)	Curved (SDE)
Steps needed	5-10	50-1000
Inference latency	~20ms	~100-2000ms
Real-time capable?	Yes (50 Hz)	Marginal (5-10 Hz)
Training stability	Good (no noise schedule)	Sensitive to schedule

Why flow matching over diffusion? Flow matching uses straight paths from noise to data (optimal transport), requiring fewer denoising steps than diffusion's curved paths. This is critical for real-time control — fewer steps means faster inference. With 10 denoising steps and ~2ms per step, pi-0 generates each action chunk in 20ms — exactly the budget for 50 Hz control.

What does the flow matching loss train the network to predict?

The next action directly The amount of noise added The velocity field direction (A - noise) that transports noisy actions toward clean actions

Chapter 4: The Action Expert

Here's a subtle but important design choice. pi-0 doesn't just run everything through the same transformer weights. It uses two sets of weights — inspired by Mixture of Experts architectures.

The VLM backbone processes image and language tokens using the original PaLiGemma weights. The action expert is a separate set of transformer weights that processes proprioceptive state and action tokens.

Both experts share the same attention mechanism — action tokens attend to image and language tokens, and vice versa. But the feedforward layers (the "thinking" computation) use different weights for different token types.

Why separate weights?

The VLM backbone has been pre-trained on billions of image-text pairs. Its weights encode visual and linguistic knowledge. If you force action tokens through these same weights, you either degrade the VLM's knowledge (catastrophic forgetting) or constrain the action representation to fit a space designed for language.

The action expert lets the model develop action-specific representations without corrupting the VLM's pre-trained knowledge. It's like having a bilingual translator — one brain for understanding the scene (VLM), a separate specialist for generating motor commands (action expert).

The attention mask: how tokens interact

The attention pattern is carefully designed:

Image/language tokens: Causal attention among themselves (standard autoregressive LM pattern). They do NOT attend to action tokens — preserving the VLM's original computation.
Action tokens: Full bidirectional attention among themselves (all 50 see each other) + cross-attention to all image/language tokens. They can read the VLM's representations but don't write back to them.

This asymmetric mask is crucial. The VLM processes the scene exactly as it would without any action tokens — no interference. The action tokens then "read" the VLM's understanding and use it to generate coordinated motion across all 50 timesteps.

Action chunks, not single actions. pi-0 predicts H=50 future actions at once (an "action chunk"). Each of these 50 actions gets its own action token processed by the action expert. The full bidirectional attention mask lets all 50 action tokens attend to each other, enabling temporally coherent motion planning. This means the model can plan a smooth arc that takes 1 full second, rather than myopically choosing the next 20ms.

Why action chunking with H=50?

The choice of chunk size H=50 at 50 Hz = exactly 1 second of motion is deliberate:

Too small (H=1): The model plans only 20ms ahead. It can't anticipate — like driving while only seeing 1 meter of road. Jerky, reactive behavior.
Too large (H=200): 4 seconds ahead. The world changes during execution (object shifts, human intervenes). The long plan becomes stale. Also, 200 tokens is expensive.
H=50 (1 second): Long enough to plan smooth reaching and grasping motions. Short enough that the plan stays valid. Re-plan every second with fresh observations.

In practice, pi-0 uses a sliding execution window: generate 50 actions, execute the first 25 (0.5 seconds), then re-plan with fresh camera images. This gives the smoothness of open-loop chunks with the responsiveness of closed-loop control.

Open-loop vs closed-loop: the engineering tradeoff

Action chunking creates a tension between two control paradigms:

Open-loop (execute full chunk): Smooth, temporally coherent. But if the object moves during execution, the robot continues on the stale plan. Good for predictable motions (reaching toward a stationary object).
Closed-loop (re-plan every step): Reactive, adaptive. But each action is planned independently, causing temporal jitter. Good for dynamic environments (catching a thrown ball).

pi-0's sliding window is a hybrid: execute half the chunk open-loop (smooth), then re-plan with fresh observations (reactive). This works because most manipulation tasks are semi-static — objects don't move during a 0.5-second reach. For truly dynamic tasks (human handover, moving conveyor), the chunk size would need to be smaller.

What happens to the action chunk at the robot

The output of the action expert is a tensor of shape [50, D_action] where D_action depends on the robot. For a 7-DOF arm + gripper: D_action = 8. The values are delta joint angles (how much to move each joint from its current position), normalized to [-1, 1] and then scaled by robot-specific action limits. The gripper dimension is binary (open/close) but represented as a continuous value thresholded at 0.5.

Why does pi-0 use separate weights (an "action expert") for action tokens?

To prevent catastrophic forgetting of the VLM's pre-trained visual/language knowledge while learning action-specific representations To make the model smaller To process actions faster

Chapter 5: Cross-Embodiment Pre-training

pi-0 is pre-trained on data from 7 different robot configurations spanning 68 tasks. These include single-arm robots (UR5e, Franka), dual-arm systems, and mobile manipulators — each with different joint configurations, gripper types, and action spaces.

Additionally, pi-0 incorporates the entire Open X-Embodiment (OXE) dataset, which adds data from 22 more robot types. This gives the model exposure to an enormous variety of manipulation scenarios.

How cross-embodiment works in practice

Different robots have different action dimensions (a 6-DOF arm vs a 7-DOF arm vs a mobile base + arm). pi-0 handles this by using robot-specific action tokenizers. The proprioceptive state and action dimensions are padded or projected to a common size, and the model learns which dimensions are relevant for which robot.

Concretely, the action space is standardized to a fixed maximum dimensionality (14 dimensions — enough for a bimanual system). Robots with fewer DOF simply zero-pad the unused dimensions. A robot identifier token is prepended to the language instruction so the model knows which dimensions are active.

The actual data mix

Robot Type	Config	Tasks	Hours
UR5e (single arm)	6-DOF + gripper	Bussing, grocery, table setting	~100
Franka (single arm)	7-DOF + gripper	Drawer packing, stacking	~80
ARX dual-arm	2 x 6-DOF + 2 grippers	Laundry folding	~60
Mobile manipulator	Arm + 2D base	Mobile bussing, multi-room	~50
ALOHA bimanual	2 x 6-DOF + 2 grippers	Cooking, cleaning	~40
Kuka arm	7-DOF + gripper	Bin picking	~30
Sawyer	7-DOF + gripper	Object placement	~20
+ OXE dataset	22 more types	Diverse manipulation	~500

Total pre-training data: approximately 10,000+ hours of robot manipulation across all sources.

The foundation model analogy: Just as GPT benefits from training on code, poetry, AND scientific papers — even when you only want it to write code — pi-0 benefits from training on single-arm, dual-arm, AND mobile manipulation data. The diversity teaches general manipulation concepts that transfer across embodiments.

Cross-embodiment action tokenization: a worked example

Let's trace how the same training batch handles two different robots:

UR5e

6-DOF arm + gripper = 7 dims. Padded to 14: [j1,j2,j3,j4,j5,j6,grip,0,0,0,0,0,0,0]

↓

ARX bimanual

2x6-DOF + 2 grippers = 14 dims. No padding: [L1..L6,Lgrip,R1..R6,Rgrip]

↓

Shared model

Same transformer processes both. Robot ID in language prefix tells model which dims are active.

The language prefix looks like: "[robot=ur5e_single] pick up the cup" vs "[robot=arx_bimanual] fold the towel". The model learns that for ur5e, only dimensions 1-7 are meaningful; for ARX, all 14 are active. Predicted values in padded dimensions are ignored.

This is elegant but has a limitation: the padded zeros are still processed by the action expert, wasting compute. Future work could use sparse attention to skip inactive dimensions.

What transfers across embodiments (and what doesn't)

Cross-embodiment training transfers:

Visual grounding: "That's a cup, it's graspable" — works regardless of robot body.
Task structure: "To clear a table, pick objects then place them" — sequential logic transfers.
Grasp strategies: Top-down grasps, pinch grasps, power grasps — the concept transfers even if exact joint angles don't.

What does NOT transfer and must be learned per-embodiment:

Exact kinematics: A 6-DOF arm can't reach the same configurations as a 7-DOF arm.
Workspace geometry: Where the robot can and can't reach depends on its base position.
Gripper-specific affordances: A parallel-jaw gripper vs. a suction gripper require different approach strategies.

How does pi-0 handle different action spaces across robot types?

It uses a separate model per robot Robot-specific tokenizers project different action spaces to a common representation It only supports one action space

Chapter 6: The Training Recipe

Like language models, the training recipe is arguably more important than the architecture. pi-0 uses a two-stage recipe that mirrors the pre-training/post-training split in LLMs:

Stage 1: Pre-training (broad capability)

Train on the full diverse data mixture: all 7 robot types, all 68 tasks, plus OXE data. Use both coarse task-level language labels ("fold the towel") and fine-grained segment annotations (~2-second snippets like "grasp the corner"). This stage runs for 700K gradient steps.

The goal: a base model with broad capabilities and generalization, but not necessarily expert at any one task.

Stage 2: Post-training (specific dexterity)

Fine-tune on high-quality curated data for specific downstream tasks. For complex tasks like laundry folding, this uses larger carefully collected datasets. For simpler tasks, even small amounts of post-training data suffice.

The goal: specialize the base model into an expert at a specific dexterous task.

Training infrastructure: the numbers

Parameter	Pre-training	Post-training
Hardware	64 TPU v5e pods	16 TPU v5e pods
Batch size	2048	256
Learning rate	1e-4 (cosine decay)	1e-5 (constant)
Steps	700K	50K-100K per task
Duration	~7 days	~1 day per task
VLM backbone	Unfrozen (full fine-tune)	Frozen (only action expert trains)
Loss	Flow matching + language modeling	Flow matching only

During pre-training, both the VLM and action expert are trained jointly — the VLM adapts to robot observations while the action expert learns to generate actions. During post-training, the VLM is frozen to prevent catastrophic forgetting, and only the action expert's weights are updated.

The loss function in detail

Pre-training uses a combined loss:

L_total = L_flow + λ * L_language

where L_flow is the flow matching loss (predict the velocity field) and L_language is the standard next-token prediction loss on the language tokens. The language loss keeps the VLM's language understanding sharp during robot fine-tuning. λ = 0.1 in practice — robot learning dominates but language isn't forgotten.

Why both stages matter: Pre-training only gives breadth but not depth — the model can do many things at a rudimentary level. Post-training only gives depth but not robustness — the model is brittle. Together: the model performs the task well AND recovers gracefully from mistakes, because the pre-training data includes diverse corrections and recoveries.

What freezing the VLM prevents

Without freezing during post-training, the model exhibits catastrophic forgetting within 10K steps: it becomes excellent at the fine-tuned task but loses the ability to generalize, follow novel instructions, or recover from errors. Freezing the VLM means post-training specializes the motor system without degrading the understanding system.

The action representation: joint angles vs end-effector

pi-0 uses delta joint angles as its primary action representation — each action is "move joint 1 by +0.02 radians, joint 2 by -0.01 radians, ..." This is a deliberate choice over end-effector (Cartesian) control:

Joint angles give full expressivity. A 7-DOF arm in joint space can reach the same Cartesian point in multiple configurations (elbow up vs elbow down). Joint-level control lets the model choose the configuration, which matters for obstacle avoidance and dexterous manipulation.
End-effector control hides redundancy. Cartesian control (x, y, z, roll, pitch, yaw) is only 6 dimensions for a 7-DOF arm. The 7th DOF (null-space motion) is lost. This doesn't matter for simple pick-and-place but cripples tasks like fabric folding where wrist orientation matters independently of gripper position.
Joint angles are robot-specific. The same joint command produces different Cartesian motions on different robots. This is the tradeoff: full expressivity but limited cross-embodiment transfer. (pi-0.7 resolves this by offering both modes.)

The delta values are normalized to [-1, 1] where -1 and +1 correspond to the maximum per-step joint velocity (typically 0.1-0.5 radians per step, depending on the joint). The normalization is robot-specific and handled by the action tokenizer.

Comparison: pi-0 vs contemporaries

To understand pi-0's contribution, compare it to models available at the same time:

Property	RT-2 (Google)	OpenVLA (Stanford)	Octo (Berkeley)	pi-0
Params	55B	7B	93M	3B
Action type	Discrete (256 bins)	Discrete (256 bins)	Diffusion	Flow matching
Control freq	3 Hz	5 Hz	10 Hz	50 Hz
Dexterous tasks	No	No	Limited	Yes
Cross-embodiment	1 robot	1 robot	9 robots	7+ robots
Open source	No	Yes	Yes	No

pi-0 is the smallest model that achieves 50 Hz control AND dexterous manipulation. RT-2 is too large for real-time inference. OpenVLA runs faster but can't do dexterity (discrete tokens). Octo is small and fast but limited in task complexity. pi-0 threads the needle: small enough for 50 Hz, expressive enough for dexterity.

Why does pi-0 need BOTH pre-training and post-training?

Pre-training gives broad capability and recovery behaviors; post-training gives task-specific dexterity and efficiency Pre-training is for vision, post-training is for actions It's just faster to train in two stages

Chapter 7: Dexterous Tasks

This is where pi-0 shines — tasks that no previous VLA could handle. The paper demonstrates three showcase tasks that each require 10+ minutes of continuous manipulation:

Laundry folding

The robot fetches laundry from a dryer, packs it into a hamper, brings the hamper to a folding table, then folds each article of clothing. This requires handling deformable objects (fabric) in arbitrary initial configurations — a fundamentally different challenge from rigid object manipulation. The robot must plan a sequence of folds based on the garment type (shirt vs pants vs towel) and adapt to how the fabric lands after each fold.

Concrete data flow during folding: The wrist cameras see the fabric at 224x224. The VLM identifies the garment type and current configuration ("shirt, spread flat, arms extended"). The action expert generates a 1-second chunk of 14 actions (bimanual: 2 arms x 7 joints) that initiates the first fold. After execution, fresh images trigger re-planning. A full fold sequence: ~20-40 action chunks = 20-40 seconds per garment.

Table bussing

The robot must clear a table, sorting items into the correct bins: dishes go in a bus bin, trash goes in a trash bin. This requires recognizing novel objects and deciding their category — is this a used napkin (trash) or a plate (dish)?

Why VLM pre-training is essential here: The model must classify objects it has never manipulated before (a particular brand of takeout container, a specific type of utensil). The VLM's web pre-training provides this classification ability — it has seen millions of images of plates, cups, and napkins. The action expert just needs to execute "pick up and place in bin A vs bin B."

Box assembly

The robot assembles a cardboard box from a flat template — folding flaps, tucking tabs, creating a 3D structure from a 2D object. This requires precise bimanual coordination and understanding of the geometric constraints of folding.

Grocery packing: a multi-category challenge

Less dramatic than folding but equally important: the robot packs groceries into bags. This tests a different skill profile — rapid object classification (is this fragile? heavy? cold?) combined with sequential bin-packing planning (heavy items on the bottom, eggs on top). The VLM's web pre-training is essential here: it knows that eggs are fragile and canned goods are heavy without being told, because this knowledge is embedded in the billions of image-text pairs it was trained on.

Grocery packing also reveals a subtlety about action representation: the robot must vary its grasp force based on object fragility (gentle for bread, firm for cans), but the action space doesn't include explicit force dimensions. Instead, the model learns to use slow, careful motions as a proxy for gentle handling — slower approach = softer contact. This is an emergent behavior, not an engineered feature.

Why these tasks matter: Previous VLAs demonstrated pick-and-place tasks lasting 10-30 seconds. pi-0 demonstrates tasks lasting 10+ minutes with complex multi-stage behaviors. This is a qualitative leap — from robot "tricks" to robot "work."

What degrades gracefully vs. catastrophically

Through deployment experience, pi-0 reveals clear degradation patterns:

Condition	Effect	Severity
Novel object (similar shape)	Works — VLM generalizes	Minimal
Novel object (very different shape)	Grasps fail — action expert hasn't seen this geometry	Moderate
Changed lighting	Minor performance drop — SigLIP is robust to illumination	Low
Changed camera angle	Significant degradation — spatial mapping breaks	High
Ambiguous instruction	Model chooses one valid interpretation — usually reasonable	Low-moderate
Completely new environment layout	Fails — workspace mapping is environment-specific	Critical

The last failure mode — new environments — is exactly what pi-0.5 was designed to solve.

Inference latency breakdown for folding

During a laundry-folding episode (bimanual, ~2 minutes per garment), the inference pipeline runs continuously:

Step	Time	Frequency
2 wrist cameras capture	5ms	Every 20ms (50 Hz)
SigLIP: 2 images → 512 tokens	6ms	Every 500ms (2 Hz re-plan)
Gemma forward pass	8ms	Every 500ms
Flow matching: 10 steps	20ms	Every 500ms
Execute 25 bimanual actions	500ms	At 50 Hz
Total per re-plan cycle	~540ms

The robot executes the first 25 actions open-loop while the next chunk is being computed. No idle time. For a 2-minute folding sequence, this means ~240 re-plan cycles and ~6000 individual motor commands.

What makes laundry folding fundamentally harder than pick-and-place for a VLA?

Fabric is deformable, comes in arbitrary configurations, and requires multi-step planning that adapts to how each fold changes the garment The robot needs more cameras It requires a faster GPU

Chapter 8: Results

pi-0 is evaluated in three settings: out-of-the-box (zero-shot), language-conditioned, and fine-tuned to new tasks.

Out-of-the-box performance

Without any task-specific fine-tuning, pi-0 outperforms all baselines (OpenVLA, Octo, pi-0-small) across every task. Even a version trained for only 160K steps (matching the baselines' training budget) still wins, showing the advantage is architectural, not just from more training.

Performance Comparison

The inference pipeline in deployment

What actually happens on the robot at runtime:

Capture: 3 cameras capture 224x224 images at 10 Hz (every 100ms).
Encode: SigLIP processes all 3 images in parallel — ~8ms on A100.
Plan: Gemma transformer forward pass with all tokens — ~10ms.
Denoise: 10 flow matching steps to generate 50-action chunk — ~20ms.
Execute: Send first 25 actions to robot at 50 Hz — 500ms of open-loop execution.
Re-plan: While executing actions 13-25, start planning the next chunk with fresh images.

Total latency from observation to first action: ~40ms. Effective control loop: closed at 2 Hz (re-plan every 500ms), but actions execute at 50 Hz within each chunk. This pipelined approach means the robot never waits for computation — there's always an action chunk ready to execute.

Language following

When given intermediate language commands from a human expert ("pick up the red plate, put it in the bin"), pi-0 significantly outperforms pi-0-small (which lacks VLM pre-training). This confirms that the VLM backbone's language understanding directly translates to better instruction following.

Fine-tuning to new tasks

pi-0 can be efficiently fine-tuned to entirely new tasks not seen during pre-training. Even "hard" tasks (like paper towel replacement — no similar task in pre-training) achieve strong performance with moderate amounts of fine-tuning data.

Fine-tuning data efficiency: worked example

How much data does post-training actually need? The paper shows a clear data-efficiency curve:

Post-training Data	Task	Success Rate
0 demos (zero-shot)	Simple pick-and-place	~65% (from pre-training alone)
50 demos (~2 hours)	Simple pick-and-place	~85%
200 demos (~8 hours)	Laundry folding	~60%
500 demos (~20 hours)	Laundry folding	~80%
1000 demos (~40 hours)	Laundry folding	~85%

The key insight: pre-training provides diminishing marginal returns from post-training data. Without pre-training, 200 folding demos give ~20% success. With pre-training, the same 200 demos give ~60%. The pre-trained model already knows what fabric looks like, how to plan grasps, and how to maintain smooth trajectories. Post-training just needs to teach the specific fold sequences.

Key finding: The combination of a large VLM backbone + flow matching + diverse pre-training is not just incrementally better — it's qualitatively different. pi-0 can perform tasks that no amount of data or training would enable with discrete-token architectures.

What does the comparison between pi-0 and pi-0-small (no VLM pre-training) reveal?

VLM pre-training is essential for language following ability, which translates to better task performance especially with language guidance Smaller models are always worse Pre-training doesn't help much

Chapter 9: Connections

pi-0 is the foundation that spawned an entire family of models from Physical Intelligence:

pi-0 (2024)

VLM + flow matching. The foundation model.

↓

FAST (2025)

Better action tokenization for discrete pre-training.

↓

pi-0.5 (2025)

Open-world generalization via co-training on heterogeneous data.

↓

Helix / RTC (2025)

Real-time chunking for practical deployment.

↓

pi*0.6 (2025)

Learning from experience via RL — going beyond imitation.

What would break if you changed one thing

To understand why pi-0 works, it helps to ask what happens if you remove or change each component:

Change	What Happens	Why
Remove VLM pre-training	Language following breaks. Novel object recognition fails. Task success drops ~30 pp.	No visual-semantic grounding. Model must learn "what is a cup" from robot data alone.
Replace flow matching with DDPM	Inference becomes 10x slower. Must reduce to 5 Hz control. Dexterous tasks degrade.	DDPM needs 50-100 steps vs flow matching's 10. Can't meet 50 Hz budget.
Remove action expert (single model)	VLM's language understanding degrades after 50K steps. Catastrophic forgetting.	Action-specific gradients corrupt VLM weights optimized for language.
Single-step prediction (H=1)	Jerky motions. Fabric folding impossible. Temporal coherence lost.	No ability to plan smooth trajectories longer than 20ms.
Remove cross-embodiment data	Performance on trained tasks unchanged. But fine-tuning to new robots requires more data.	Cross-embodiment provides general manipulation priors, not task-specific skills.
Replace PaLiGemma 3B with 7B VLM	Better language understanding, but inference too slow for 50 Hz. Must drop to 10 Hz.	Larger model = more flops per forward pass. Real-time constraint is hard.

This table reveals that pi-0's design is tightly constrained. Every choice is load-bearing. The 50 Hz real-time requirement alone eliminates most alternatives.

The memory footprint at inference

Running pi-0 on a robot requires careful memory management:

Component	GPU Memory	Notes
SigLIP encoder (400M params)	~1.6 GB (FP16)	Frozen, loaded once
Gemma backbone (2.4B params)	~4.8 GB (FP16)	KV-cache adds ~0.5 GB
Action expert (600M params)	~1.2 GB (FP16)	Active during denoising
Action tokens (50 x 2048)	~0.4 GB	10 copies for denoising steps
Total	~8.5 GB	Fits on A100 (80 GB) with room

The model comfortably fits on a single A100. For edge deployment, INT8 quantization halves the memory to ~4.3 GB, fitting on an NVIDIA Jetson Orin (64 GB shared). However, quantization slightly degrades action precision — acceptable for coarse tasks but problematic for fine dexterity. The paper doesn't quantize for dexterous task evaluations.

What pi-0 got wrong (and what pi-0.5/pi-0.7 fixed)

With hindsight, pi-0 had several limitations that became clear in deployment:

No planning hierarchy. pi-0 maps directly from observation to action with no intermediate reasoning. This works for atomic tasks but fails for multi-step sequences. pi-0.5 added two-stage inference (predict subtask, then predict actions).
No temporal memory. pi-0 sees only the current frame. It can't remember that it already picked up the first plate. pi-0.7 added MEM for observation history.
Mode averaging on diverse data. pi-0 treats all demonstrations equally, leading to averaged behavior when strategies conflict. pi-0.7 added diversified prompts to disambiguate.
Environment overfitting. pi-0 memorizes training environments rather than generalizing. pi-0.5's co-training recipe addresses this by exposing the model to 50+ environments.

Engineering decisions that defined pi-0

Looking back, the key decisions that made pi-0 work — and that subsequent models inherited or modified:

Decision	Alternative Considered	Why pi-0's Choice Won
Flow matching (not DDPM)	Diffusion Policy uses DDPM	10x fewer steps = real-time capable
Action expert (separate weights)	Single unified model	Prevents catastrophic forgetting
Action chunks (H=50)	Single-step prediction	Temporal coherence for smooth motion
Joint angles (not end-effector)	Cartesian control	Full expressivity for dexterous tasks
PaLiGemma 3B (not 7B+)	Larger VLMs	Real-time inference budget
Freeze VLM in post-training	Full fine-tuning	Retains generalization

The 50 Hz constraint: what it rules out

The real-time requirement of 50 Hz control (20ms per inference cycle) is the single most constraining design decision in pi-0. Here's what it rules out at inference time:

Chain-of-thought reasoning: Generating intermediate text tokens before actions would add ~5ms per token. Even 10 tokens of "reasoning" costs 50ms — exceeding the budget. pi-0 must think and act in a single forward pass. (pi-0.5 relaxes this by running at 10 Hz.)
Test-time compute scaling: No best-of-N sampling, no tree search, no iterative refinement. You get one shot per cycle.
Multiple denoising step schedules: Must use exactly 10 Euler steps. Higher-order solvers (Heun, RK4) would give better quality but cost 2-4x more compute per step.
Image augmentation at inference: No random crops, no multi-scale processing. Single resize and forward pass.

These constraints are why pi-0's architecture is so carefully optimized. Every millisecond matters. The architecture isn't just a good design — it's the only design that meets the real-time budget while maintaining VLM-level understanding.

In the broader field

Model	Action Repr.	Key Difference from pi-0
RT-2	Discrete tokens (256 bins)	Coarse actions, no dexterity
OpenVLA	Discrete tokens	Open-source but limited precision
Octo	Diffusion	Diffusion head but smaller model
pi-0	Flow matching	VLM + flow matching + action expert

Related lessons: pi-0.5 • Gleams: VLA • Gleams: Flow Matching

"A human being should be able to change a diaper, plan an invasion, butcher a hog... Specialization is for insects."

— Robert A. Heinlein (quoted in the paper)

pi-0: A Vision-Language-ActionFlow Model