Physical Intelligence, 2024

pi-0: A Vision-Language-Action
Flow Model

The robot foundation model that folds laundry, busses tables, and packs groceries — by combining a vision-language model with flow matching to generate smooth, continuous robot actions at 50 Hz.

Prerequisites: Transformers + Basic robotics
10
Chapters
5+
Simulations

Chapter 0: The Problem

Robot learning has a versatility problem. We can train a robot to pick up a specific cup in a specific lab. We can even train it on many cups in many labs. But ask it to fold a shirt — a task that requires understanding fabric dynamics, planning a sequence of folds, adapting to arbitrary initial configurations — and it falls apart.

The issue isn't intelligence. Vision-language models like GPT-4V can describe how to fold a shirt in perfect detail. The issue is that no amount of language understanding translates into the precise, high-frequency motor commands needed to actually manipulate fabric at 50 Hz.

Previous VLAs (like RT-2) solved this by discretizing actions into tokens — binning each action dimension into 256 values. This works for coarse pick-and-place, but discretization destroys the smoothness needed for dexterous manipulation. Try folding a shirt with only 256 possible positions per joint per timestep. You can't.

What makes VLAs different from industrial robots

Industrial robots (in factories) run at >1000 Hz with sub-millimeter precision. They don't need VLMs because their environment is perfectly known — the same part arrives at the same position every time. The challenge VLAs solve is the opposite: handling unpredictable environments where the robot must understand what it sees to decide what to do. This requires a VLM. But the VLM is huge (billions of parameters), which means slower inference, which means lower control frequency.

pi-0 operates at the intersection: fast enough for smooth manipulation (50 Hz), smart enough to understand language and novel objects (VLM backbone). This dual requirement is what makes the engineering so constrained.

What "50 Hz control" actually means

A robot arm with 7 joints needs a new position command for every joint, every 20 milliseconds. That's 7 floating-point numbers x 50 times per second = 350 continuous values per second. Each value needs sub-degree precision — the difference between a clean fold and a crumpled mess is often less than 2 degrees of wrist rotation.

With RT-2's 256-bin discretization, the minimum resolution per joint is approximately 360 degrees / 256 = 1.4 degrees per bin. Sounds fine in isolation. But errors compound across joints and timesteps. Over a 10-second fold sequence (500 timesteps x 7 joints), the accumulated quantization error makes smooth trajectories impossible.

The fundamental tension: VLMs give you understanding (what to do). But discrete token outputs can't express the precision needed for dexterous manipulation (how to do it smoothly). pi-0 resolves this by keeping the VLM for understanding and adding flow matching for continuous, high-precision action generation.
Discrete vs Continuous Actions

Drag the resolution slider to see how discretization affects a smooth trajectory. At 256 bins it's coarse and jerky. Flow matching produces the smooth blue curve.

Bins per dim32

What degrades with discretization

The damage isn't just "slightly less smooth." Discretization causes three specific failure modes in dexterous manipulation:

Why can't discrete action tokens (like RT-2 uses) handle dexterous manipulation?

Chapter 1: The Key Insight

pi-0's insight is to treat the robot foundation model exactly like a language foundation model — but with a crucial architectural twist for actions.

In language, the recipe is: (1) pre-train a large model on diverse internet text, (2) post-train (fine-tune) on carefully curated data for the desired behavior. GPT-4 follows this recipe. So does Claude.

pi-0 follows the same recipe for robots: (1) start with a pre-trained VLM (PaLiGemma) that already understands images and text, (2) pre-train on diverse robot data from 7 robot types and 68 tasks, (3) post-train on high-quality data for specific dexterous tasks.

The twist: instead of predicting the next discrete token (like language models do), pi-0 uses flow matching to generate continuous action distributions. This gives it the precision to control robots at 50 Hz for tasks like folding laundry — something no discrete-token VLA can do.

Think of it this way: A language model predicts "the next word" from a finite vocabulary. pi-0 predicts "the next 50 motor commands" from a continuous space — like generating a smooth curve rather than choosing from a fixed set of points. Flow matching is the mathematical tool that makes this possible.

The complete data flow

Let's trace a single inference step through pi-0. The robot has two wrist cameras and one base camera. Here's exactly what happens:

  1. Image input: 3 camera images at 224x224 resolution. Each passes through SigLIP's ViT-So400m encoder, producing 256 visual tokens per image = 768 visual tokens total. Each token is a 1152-dimensional vector.
  2. Language input: A task instruction ("fold the towel in thirds") tokenized into ~15-30 language tokens.
  3. Proprioception: Current joint angles + gripper state = 7-8 floating-point values, projected to a single embedding.
  4. VLM processing: All tokens (768 visual + ~25 language + 1 proprioceptive) pass through the Gemma transformer backbone (18 layers). Output: contextualized embeddings for each token position.
  5. Action expert: 50 randomly-initialized action tokens are concatenated. They attend to all VLM tokens via cross-attention, then pass through the action expert's feedforward layers.
  6. Flow matching denoising: The action tokens go through 10 denoising steps, progressively refining from noise to a clean action chunk of shape [50, 7] — 50 timesteps, 7 joint angles + gripper.
  7. Output: 50 continuous joint-angle commands executed open-loop at 50 Hz (1 second of motion), then re-plan.
Step 1
Start with PaLiGemma VLM (understands images + text from web pre-training)
Step 2
Pre-train on diverse robot data: 7 robot types, 68 tasks, OXE dataset
Step 3
Post-train on high-quality curated data for specific dexterous tasks
Result
A robot that folds laundry, busses tables, and packs groceries

Why flow matching instead of DDPM?

Both diffusion (DDPM) and flow matching can generate continuous outputs. But there's a critical engineering difference: DDPM needs 50-1000 denoising steps because its paths are curved (the model must follow a complex SDE). Flow matching uses straight-line paths (optimal transport) from noise to data, requiring only 5-10 steps for comparable quality.

At 50 Hz control: if each denoising step takes ~2ms of GPU time, then 10 steps = 20ms per action chunk (fits in the 20ms budget). 100 steps = 200ms — already too slow for real-time. This is why pi-0 uses flow matching. It's not a minor preference; it's a hard engineering constraint.

What is the key architectural difference between pi-0 and earlier VLAs like RT-2?

Chapter 2: The VLM Backbone

pi-0 is built on PaLiGemma — a vision-language model from Google that combines a SigLIP image encoder with a Gemma language model. PaLiGemma was pre-trained on billions of image-text pairs from the web, so it already understands visual concepts, spatial relationships, and natural language.

Why not train from scratch? Because a VLM pre-trained on web data gives you an enormous head start. It already knows what a "cup" looks like, that "left of the plate" means a spatial relationship, and how to follow instructions like "pick up the red one." Without this foundation, a robot policy would need to learn all of this from relatively scarce robot demonstration data.

Late fusion architecture

PaLiGemma uses late fusion: the image encoder processes images independently, then the resulting visual tokens are concatenated with text tokens and fed to the language model transformer. pi-0 extends this by adding a third modality — robot proprioception and actions — alongside vision and language.

Why late fusion (not early fusion)

There are two ways to combine vision and language in a multimodal model:

PaLiGemma uses late fusion because it allows the image encoder to be pre-trained independently (SigLIP on image-text contrastive learning) and the language model to be pre-trained independently (Gemma on text). You get the best of both specialists. The downside: the image encoder can't attend to language — it processes images context-free. For robot tasks, this is usually fine because the image content (what's on the table) doesn't depend on the language instruction (what to do with it).

For pi-0, late fusion has a bonus advantage: the SigLIP encoder can be kept frozen during post-training (no gradients needed), saving memory and preventing visual representation degradation. Only the Gemma backbone and action expert receive gradients during post-training.

Concrete dimensions

Let's be precise about what flows through the network:

ComponentArchitectureOutput Shape
SigLIP encoderViT-So400m/14, 400M params256 tokens x 1152 dims per image
Gemma backbone2B params, 18 layers, 2048 hiddenContextualized embeddings
Action tokensLearned embeddings50 tokens x 2048 dims (action chunk)
Action head MLP2-layer MLP per token7 or 14 dims (joint angles + gripper)

Total model: ~3B parameters. The VLM backbone is ~2.4B, the action expert adds ~600M.

Architecture Overview

Image preprocessing: from raw pixels to tokens

The camera produces 640x480 RGB images. Before reaching SigLIP, each image is:

  1. Resized: 640x480 → 224x224 (bilinear interpolation)
  2. Normalized: Pixel values [0, 255] → [-1, 1] using ImageNet statistics
  3. Patched: 224x224 image / 14x14 patch size = 16x16 = 256 patches
  4. Encoded: Each patch → 1152-dim embedding via SigLIP's ViT layers

Why 224x224? This is SigLIP's native resolution from pre-training. Higher resolution (448x448) would give 1024 patches per image — better for small objects but 4x the compute cost. The paper found 224x224 sufficient for manipulation-scale objects.

What the VLM brings vs. what it lacks

The VLM backbone provides:

What it does NOT provide (and must be learned during robot pre-training):

Why PaLiGemma specifically? It's relatively small (3B parameters), which matters for real-time robot control — the model needs to run inference fast enough for 50 Hz action generation. Larger VLMs would give better understanding but slower inference. At inference time, pi-0 runs on a single NVIDIA A100 GPU, generating 50-step action chunks in ~20ms.
Why does pi-0 start from a pre-trained VLM rather than training from scratch?

Chapter 3: Flow Matching for Actions

This is the core technical contribution. Instead of discretizing actions into bins (like RT-2's 256 tokens per dimension), pi-0 uses conditional flow matching to model the continuous distribution of actions.

What is flow matching?

Flow matching learns a velocity field that transports random noise into a desired distribution — in this case, the distribution of correct robot actions. During inference, you start with random noise and follow the velocity field to produce a clean action.

dx/dt = vθ(xτ, ot)

where xτ is the noisy action at flow time τ, ot is the observation (images + language + proprioception), and vθ is the learned velocity field.

Why flow matching and not just regression?

You might wonder: why not just predict the action chunk directly via regression (MSE loss)? The answer is multi-modality. Consider "pick up the cup" — there are multiple valid grasps (top-down, side grasp, pinch grasp). A regression model would predict the mean of these grasps: a halfway-between grasp that doesn't work for any approach. Flow matching (like diffusion) can sample from a multi-modal distribution — on each inference call it might produce a top-down grasp OR a side grasp, both valid, but never the mean.

This matters less for simple tasks (there's usually one best approach to "move left"). But for dexterous manipulation with multiple valid strategies, it's the difference between a model that works and one that produces garbage. In the paper's ablation, replacing flow matching with direct regression drops success on multi-strategy tasks by ~25 pp while barely affecting single-strategy tasks.

Training: the math made concrete

During training, the process is simple. Let's walk through one training step with actual numbers:

  1. Sample real action A from the dataset: a chunk of shape [50, 7] — 50 timesteps of 7 joint angles. Example: the first joint might be [0.32, 0.33, 0.34, ...] radians over 50 steps.
  2. Sample noise ε from N(0, I): same shape [50, 7], random values.
  3. Sample flow time τ uniformly from [0, 1]. Say τ = 0.7.
  4. Interpolate: Aτ = τ * A + (1 - τ) * ε = 0.7 * real_action + 0.3 * noise. This is a "partially noisy" action chunk.
  5. Predict: Network outputs vθ(Aτ, observation) — its guess for the velocity direction.
  6. Loss: Compare vθ to the true velocity (A - ε). L2 distance.
L(θ) = E[ ||vθ(Aτ, o) - (A - ε)||2 ]

Inference: denoising in practice

At inference time, we start with pure noise x0 ~ N(0, I) of shape [50, 7] and take K = 10 Euler steps:

xk+1 = xk + (1/K) * vθ(xk, observation)

After 10 steps, x10 is a clean action chunk. Total time: ~2ms per step x 10 steps = 20ms. Well within the 20ms budget for 50 Hz control.

A worked denoising example

Let's trace one denoising step for a single joint (say, the shoulder pitch). We want the shoulder to move from its current position (0.5 rad) to a target (1.2 rad) over 50 timesteps:

  1. True action trajectory: A smooth sigmoid curve from 0.5 to 1.2 rad over 50 steps.
  2. Start with noise: x0 = [0.87, -0.32, 1.45, ...] — 50 random values from N(0,1). No structure.
  3. Step 1: Network predicts velocity v1. x1 = x0 + 0.1 * v1. Still mostly noise, but a faint signal of the trajectory shape appears.
  4. Step 5: The trajectory is recognizable — roughly increasing from ~0.5 to ~1.2, but still jittery.
  5. Step 10: Clean output. A smooth curve matching the training distribution for this observation. Quantization-free — every value is continuous.

The key insight: the velocity field doesn't predict the final trajectory directly. It predicts the direction to move in at each intermediate noise level. This is easier to learn (local corrections vs. global prediction) and produces better results with fewer steps.

Flow Matching: Noise → Actions

Click Play to watch noise get transported to a clean action trajectory via the learned velocity field.

Why flow matching beats DDPM for robotics

PropertyFlow MatchingDDPM (Diffusion)
Path shapeStraight lines (OT)Curved (SDE)
Steps needed5-1050-1000
Inference latency~20ms~100-2000ms
Real-time capable?Yes (50 Hz)Marginal (5-10 Hz)
Training stabilityGood (no noise schedule)Sensitive to schedule
Why flow matching over diffusion? Flow matching uses straight paths from noise to data (optimal transport), requiring fewer denoising steps than diffusion's curved paths. This is critical for real-time control — fewer steps means faster inference. With 10 denoising steps and ~2ms per step, pi-0 generates each action chunk in 20ms — exactly the budget for 50 Hz control.
What does the flow matching loss train the network to predict?

Chapter 4: The Action Expert

Here's a subtle but important design choice. pi-0 doesn't just run everything through the same transformer weights. It uses two sets of weights — inspired by Mixture of Experts architectures.

The VLM backbone processes image and language tokens using the original PaLiGemma weights. The action expert is a separate set of transformer weights that processes proprioceptive state and action tokens.

Both experts share the same attention mechanism — action tokens attend to image and language tokens, and vice versa. But the feedforward layers (the "thinking" computation) use different weights for different token types.

Why separate weights?

The VLM backbone has been pre-trained on billions of image-text pairs. Its weights encode visual and linguistic knowledge. If you force action tokens through these same weights, you either degrade the VLM's knowledge (catastrophic forgetting) or constrain the action representation to fit a space designed for language.

The action expert lets the model develop action-specific representations without corrupting the VLM's pre-trained knowledge. It's like having a bilingual translator — one brain for understanding the scene (VLM), a separate specialist for generating motor commands (action expert).

The attention mask: how tokens interact

The attention pattern is carefully designed:

This asymmetric mask is crucial. The VLM processes the scene exactly as it would without any action tokens — no interference. The action tokens then "read" the VLM's understanding and use it to generate coordinated motion across all 50 timesteps.

Action chunks, not single actions. pi-0 predicts H=50 future actions at once (an "action chunk"). Each of these 50 actions gets its own action token processed by the action expert. The full bidirectional attention mask lets all 50 action tokens attend to each other, enabling temporally coherent motion planning. This means the model can plan a smooth arc that takes 1 full second, rather than myopically choosing the next 20ms.

Why action chunking with H=50?

The choice of chunk size H=50 at 50 Hz = exactly 1 second of motion is deliberate:

In practice, pi-0 uses a sliding execution window: generate 50 actions, execute the first 25 (0.5 seconds), then re-plan with fresh camera images. This gives the smoothness of open-loop chunks with the responsiveness of closed-loop control.

Open-loop vs closed-loop: the engineering tradeoff

Action chunking creates a tension between two control paradigms:

pi-0's sliding window is a hybrid: execute half the chunk open-loop (smooth), then re-plan with fresh observations (reactive). This works because most manipulation tasks are semi-static — objects don't move during a 0.5-second reach. For truly dynamic tasks (human handover, moving conveyor), the chunk size would need to be smaller.

What happens to the action chunk at the robot

The output of the action expert is a tensor of shape [50, D_action] where D_action depends on the robot. For a 7-DOF arm + gripper: D_action = 8. The values are delta joint angles (how much to move each joint from its current position), normalized to [-1, 1] and then scaled by robot-specific action limits. The gripper dimension is binary (open/close) but represented as a continuous value thresholded at 0.5.

Why does pi-0 use separate weights (an "action expert") for action tokens?

Chapter 5: Cross-Embodiment Pre-training

pi-0 is pre-trained on data from 7 different robot configurations spanning 68 tasks. These include single-arm robots (UR5e, Franka), dual-arm systems, and mobile manipulators — each with different joint configurations, gripper types, and action spaces.

Additionally, pi-0 incorporates the entire Open X-Embodiment (OXE) dataset, which adds data from 22 more robot types. This gives the model exposure to an enormous variety of manipulation scenarios.

How cross-embodiment works in practice

Different robots have different action dimensions (a 6-DOF arm vs a 7-DOF arm vs a mobile base + arm). pi-0 handles this by using robot-specific action tokenizers. The proprioceptive state and action dimensions are padded or projected to a common size, and the model learns which dimensions are relevant for which robot.

Concretely, the action space is standardized to a fixed maximum dimensionality (14 dimensions — enough for a bimanual system). Robots with fewer DOF simply zero-pad the unused dimensions. A robot identifier token is prepended to the language instruction so the model knows which dimensions are active.

The actual data mix

Robot TypeConfigTasksHours
UR5e (single arm)6-DOF + gripperBussing, grocery, table setting~100
Franka (single arm)7-DOF + gripperDrawer packing, stacking~80
ARX dual-arm2 x 6-DOF + 2 grippersLaundry folding~60
Mobile manipulatorArm + 2D baseMobile bussing, multi-room~50
ALOHA bimanual2 x 6-DOF + 2 grippersCooking, cleaning~40
Kuka arm7-DOF + gripperBin picking~30
Sawyer7-DOF + gripperObject placement~20
+ OXE dataset22 more typesDiverse manipulation~500

Total pre-training data: approximately 10,000+ hours of robot manipulation across all sources.

The foundation model analogy: Just as GPT benefits from training on code, poetry, AND scientific papers — even when you only want it to write code — pi-0 benefits from training on single-arm, dual-arm, AND mobile manipulation data. The diversity teaches general manipulation concepts that transfer across embodiments.

Cross-embodiment action tokenization: a worked example

Let's trace how the same training batch handles two different robots:

UR5e
6-DOF arm + gripper = 7 dims. Padded to 14: [j1,j2,j3,j4,j5,j6,grip,0,0,0,0,0,0,0]
ARX bimanual
2x6-DOF + 2 grippers = 14 dims. No padding: [L1..L6,Lgrip,R1..R6,Rgrip]
Shared model
Same transformer processes both. Robot ID in language prefix tells model which dims are active.

The language prefix looks like: "[robot=ur5e_single] pick up the cup" vs "[robot=arx_bimanual] fold the towel". The model learns that for ur5e, only dimensions 1-7 are meaningful; for ARX, all 14 are active. Predicted values in padded dimensions are ignored.

This is elegant but has a limitation: the padded zeros are still processed by the action expert, wasting compute. Future work could use sparse attention to skip inactive dimensions.

What transfers across embodiments (and what doesn't)

Cross-embodiment training transfers:

What does NOT transfer and must be learned per-embodiment:

How does pi-0 handle different action spaces across robot types?

Chapter 6: The Training Recipe

Like language models, the training recipe is arguably more important than the architecture. pi-0 uses a two-stage recipe that mirrors the pre-training/post-training split in LLMs:

Stage 1: Pre-training (broad capability)

Train on the full diverse data mixture: all 7 robot types, all 68 tasks, plus OXE data. Use both coarse task-level language labels ("fold the towel") and fine-grained segment annotations (~2-second snippets like "grasp the corner"). This stage runs for 700K gradient steps.

The goal: a base model with broad capabilities and generalization, but not necessarily expert at any one task.

Stage 2: Post-training (specific dexterity)

Fine-tune on high-quality curated data for specific downstream tasks. For complex tasks like laundry folding, this uses larger carefully collected datasets. For simpler tasks, even small amounts of post-training data suffice.

The goal: specialize the base model into an expert at a specific dexterous task.

Training infrastructure: the numbers

ParameterPre-trainingPost-training
Hardware64 TPU v5e pods16 TPU v5e pods
Batch size2048256
Learning rate1e-4 (cosine decay)1e-5 (constant)
Steps700K50K-100K per task
Duration~7 days~1 day per task
VLM backboneUnfrozen (full fine-tune)Frozen (only action expert trains)
LossFlow matching + language modelingFlow matching only

During pre-training, both the VLM and action expert are trained jointly — the VLM adapts to robot observations while the action expert learns to generate actions. During post-training, the VLM is frozen to prevent catastrophic forgetting, and only the action expert's weights are updated.

The loss function in detail

Pre-training uses a combined loss:

Ltotal = Lflow + λ * Llanguage

where Lflow is the flow matching loss (predict the velocity field) and Llanguage is the standard next-token prediction loss on the language tokens. The language loss keeps the VLM's language understanding sharp during robot fine-tuning. λ = 0.1 in practice — robot learning dominates but language isn't forgotten.

Why both stages matter: Pre-training only gives breadth but not depth — the model can do many things at a rudimentary level. Post-training only gives depth but not robustness — the model is brittle. Together: the model performs the task well AND recovers gracefully from mistakes, because the pre-training data includes diverse corrections and recoveries.

What freezing the VLM prevents

Without freezing during post-training, the model exhibits catastrophic forgetting within 10K steps: it becomes excellent at the fine-tuned task but loses the ability to generalize, follow novel instructions, or recover from errors. Freezing the VLM means post-training specializes the motor system without degrading the understanding system.

The action representation: joint angles vs end-effector

pi-0 uses delta joint angles as its primary action representation — each action is "move joint 1 by +0.02 radians, joint 2 by -0.01 radians, ..." This is a deliberate choice over end-effector (Cartesian) control:

The delta values are normalized to [-1, 1] where -1 and +1 correspond to the maximum per-step joint velocity (typically 0.1-0.5 radians per step, depending on the joint). The normalization is robot-specific and handled by the action tokenizer.

Comparison: pi-0 vs contemporaries

To understand pi-0's contribution, compare it to models available at the same time:

PropertyRT-2 (Google)OpenVLA (Stanford)Octo (Berkeley)pi-0
Params55B7B93M3B
Action typeDiscrete (256 bins)Discrete (256 bins)DiffusionFlow matching
Control freq3 Hz5 Hz10 Hz50 Hz
Dexterous tasksNoNoLimitedYes
Cross-embodiment1 robot1 robot9 robots7+ robots
Open sourceNoYesYesNo

pi-0 is the smallest model that achieves 50 Hz control AND dexterous manipulation. RT-2 is too large for real-time inference. OpenVLA runs faster but can't do dexterity (discrete tokens). Octo is small and fast but limited in task complexity. pi-0 threads the needle: small enough for 50 Hz, expressive enough for dexterity.

Why does pi-0 need BOTH pre-training and post-training?

Chapter 7: Dexterous Tasks

This is where pi-0 shines — tasks that no previous VLA could handle. The paper demonstrates three showcase tasks that each require 10+ minutes of continuous manipulation:

Laundry folding

The robot fetches laundry from a dryer, packs it into a hamper, brings the hamper to a folding table, then folds each article of clothing. This requires handling deformable objects (fabric) in arbitrary initial configurations — a fundamentally different challenge from rigid object manipulation. The robot must plan a sequence of folds based on the garment type (shirt vs pants vs towel) and adapt to how the fabric lands after each fold.

Concrete data flow during folding: The wrist cameras see the fabric at 224x224. The VLM identifies the garment type and current configuration ("shirt, spread flat, arms extended"). The action expert generates a 1-second chunk of 14 actions (bimanual: 2 arms x 7 joints) that initiates the first fold. After execution, fresh images trigger re-planning. A full fold sequence: ~20-40 action chunks = 20-40 seconds per garment.

Table bussing

The robot must clear a table, sorting items into the correct bins: dishes go in a bus bin, trash goes in a trash bin. This requires recognizing novel objects and deciding their category — is this a used napkin (trash) or a plate (dish)?

Why VLM pre-training is essential here: The model must classify objects it has never manipulated before (a particular brand of takeout container, a specific type of utensil). The VLM's web pre-training provides this classification ability — it has seen millions of images of plates, cups, and napkins. The action expert just needs to execute "pick up and place in bin A vs bin B."

Box assembly

The robot assembles a cardboard box from a flat template — folding flaps, tucking tabs, creating a 3D structure from a 2D object. This requires precise bimanual coordination and understanding of the geometric constraints of folding.

Grocery packing: a multi-category challenge

Less dramatic than folding but equally important: the robot packs groceries into bags. This tests a different skill profile — rapid object classification (is this fragile? heavy? cold?) combined with sequential bin-packing planning (heavy items on the bottom, eggs on top). The VLM's web pre-training is essential here: it knows that eggs are fragile and canned goods are heavy without being told, because this knowledge is embedded in the billions of image-text pairs it was trained on.

Grocery packing also reveals a subtlety about action representation: the robot must vary its grasp force based on object fragility (gentle for bread, firm for cans), but the action space doesn't include explicit force dimensions. Instead, the model learns to use slow, careful motions as a proxy for gentle handling — slower approach = softer contact. This is an emergent behavior, not an engineered feature.

Why these tasks matter: Previous VLAs demonstrated pick-and-place tasks lasting 10-30 seconds. pi-0 demonstrates tasks lasting 10+ minutes with complex multi-stage behaviors. This is a qualitative leap — from robot "tricks" to robot "work."

What degrades gracefully vs. catastrophically

Through deployment experience, pi-0 reveals clear degradation patterns:

ConditionEffectSeverity
Novel object (similar shape)Works — VLM generalizesMinimal
Novel object (very different shape)Grasps fail — action expert hasn't seen this geometryModerate
Changed lightingMinor performance drop — SigLIP is robust to illuminationLow
Changed camera angleSignificant degradation — spatial mapping breaksHigh
Ambiguous instructionModel chooses one valid interpretation — usually reasonableLow-moderate
Completely new environment layoutFails — workspace mapping is environment-specificCritical

The last failure mode — new environments — is exactly what pi-0.5 was designed to solve.

Inference latency breakdown for folding

During a laundry-folding episode (bimanual, ~2 minutes per garment), the inference pipeline runs continuously:

StepTimeFrequency
2 wrist cameras capture5msEvery 20ms (50 Hz)
SigLIP: 2 images → 512 tokens6msEvery 500ms (2 Hz re-plan)
Gemma forward pass8msEvery 500ms
Flow matching: 10 steps20msEvery 500ms
Execute 25 bimanual actions500msAt 50 Hz
Total per re-plan cycle~540ms

The robot executes the first 25 actions open-loop while the next chunk is being computed. No idle time. For a 2-minute folding sequence, this means ~240 re-plan cycles and ~6000 individual motor commands.

What makes laundry folding fundamentally harder than pick-and-place for a VLA?

Chapter 8: Results

pi-0 is evaluated in three settings: out-of-the-box (zero-shot), language-conditioned, and fine-tuned to new tasks.

Out-of-the-box performance

Without any task-specific fine-tuning, pi-0 outperforms all baselines (OpenVLA, Octo, pi-0-small) across every task. Even a version trained for only 160K steps (matching the baselines' training budget) still wins, showing the advantage is architectural, not just from more training.

Performance Comparison

The inference pipeline in deployment

What actually happens on the robot at runtime:

  1. Capture: 3 cameras capture 224x224 images at 10 Hz (every 100ms).
  2. Encode: SigLIP processes all 3 images in parallel — ~8ms on A100.
  3. Plan: Gemma transformer forward pass with all tokens — ~10ms.
  4. Denoise: 10 flow matching steps to generate 50-action chunk — ~20ms.
  5. Execute: Send first 25 actions to robot at 50 Hz — 500ms of open-loop execution.
  6. Re-plan: While executing actions 13-25, start planning the next chunk with fresh images.

Total latency from observation to first action: ~40ms. Effective control loop: closed at 2 Hz (re-plan every 500ms), but actions execute at 50 Hz within each chunk. This pipelined approach means the robot never waits for computation — there's always an action chunk ready to execute.

Language following

When given intermediate language commands from a human expert ("pick up the red plate, put it in the bin"), pi-0 significantly outperforms pi-0-small (which lacks VLM pre-training). This confirms that the VLM backbone's language understanding directly translates to better instruction following.

Fine-tuning to new tasks

pi-0 can be efficiently fine-tuned to entirely new tasks not seen during pre-training. Even "hard" tasks (like paper towel replacement — no similar task in pre-training) achieve strong performance with moderate amounts of fine-tuning data.

Fine-tuning data efficiency: worked example

How much data does post-training actually need? The paper shows a clear data-efficiency curve:

Post-training DataTaskSuccess Rate
0 demos (zero-shot)Simple pick-and-place~65% (from pre-training alone)
50 demos (~2 hours)Simple pick-and-place~85%
200 demos (~8 hours)Laundry folding~60%
500 demos (~20 hours)Laundry folding~80%
1000 demos (~40 hours)Laundry folding~85%

The key insight: pre-training provides diminishing marginal returns from post-training data. Without pre-training, 200 folding demos give ~20% success. With pre-training, the same 200 demos give ~60%. The pre-trained model already knows what fabric looks like, how to plan grasps, and how to maintain smooth trajectories. Post-training just needs to teach the specific fold sequences.

Key finding: The combination of a large VLM backbone + flow matching + diverse pre-training is not just incrementally better — it's qualitatively different. pi-0 can perform tasks that no amount of data or training would enable with discrete-token architectures.
What does the comparison between pi-0 and pi-0-small (no VLM pre-training) reveal?

Chapter 9: Connections

pi-0 is the foundation that spawned an entire family of models from Physical Intelligence:

pi-0 (2024)
VLM + flow matching. The foundation model.
FAST (2025)
Better action tokenization for discrete pre-training.
pi-0.5 (2025)
Open-world generalization via co-training on heterogeneous data.
Helix / RTC (2025)
Real-time chunking for practical deployment.
pi*0.6 (2025)
Learning from experience via RL — going beyond imitation.

What would break if you changed one thing

To understand why pi-0 works, it helps to ask what happens if you remove or change each component:

ChangeWhat HappensWhy
Remove VLM pre-trainingLanguage following breaks. Novel object recognition fails. Task success drops ~30 pp.No visual-semantic grounding. Model must learn "what is a cup" from robot data alone.
Replace flow matching with DDPMInference becomes 10x slower. Must reduce to 5 Hz control. Dexterous tasks degrade.DDPM needs 50-100 steps vs flow matching's 10. Can't meet 50 Hz budget.
Remove action expert (single model)VLM's language understanding degrades after 50K steps. Catastrophic forgetting.Action-specific gradients corrupt VLM weights optimized for language.
Single-step prediction (H=1)Jerky motions. Fabric folding impossible. Temporal coherence lost.No ability to plan smooth trajectories longer than 20ms.
Remove cross-embodiment dataPerformance on trained tasks unchanged. But fine-tuning to new robots requires more data.Cross-embodiment provides general manipulation priors, not task-specific skills.
Replace PaLiGemma 3B with 7B VLMBetter language understanding, but inference too slow for 50 Hz. Must drop to 10 Hz.Larger model = more flops per forward pass. Real-time constraint is hard.

This table reveals that pi-0's design is tightly constrained. Every choice is load-bearing. The 50 Hz real-time requirement alone eliminates most alternatives.

The memory footprint at inference

Running pi-0 on a robot requires careful memory management:

ComponentGPU MemoryNotes
SigLIP encoder (400M params)~1.6 GB (FP16)Frozen, loaded once
Gemma backbone (2.4B params)~4.8 GB (FP16)KV-cache adds ~0.5 GB
Action expert (600M params)~1.2 GB (FP16)Active during denoising
Action tokens (50 x 2048)~0.4 GB10 copies for denoising steps
Total~8.5 GBFits on A100 (80 GB) with room

The model comfortably fits on a single A100. For edge deployment, INT8 quantization halves the memory to ~4.3 GB, fitting on an NVIDIA Jetson Orin (64 GB shared). However, quantization slightly degrades action precision — acceptable for coarse tasks but problematic for fine dexterity. The paper doesn't quantize for dexterous task evaluations.

What pi-0 got wrong (and what pi-0.5/pi-0.7 fixed)

With hindsight, pi-0 had several limitations that became clear in deployment:

Engineering decisions that defined pi-0

Looking back, the key decisions that made pi-0 work — and that subsequent models inherited or modified:

DecisionAlternative ConsideredWhy pi-0's Choice Won
Flow matching (not DDPM)Diffusion Policy uses DDPM10x fewer steps = real-time capable
Action expert (separate weights)Single unified modelPrevents catastrophic forgetting
Action chunks (H=50)Single-step predictionTemporal coherence for smooth motion
Joint angles (not end-effector)Cartesian controlFull expressivity for dexterous tasks
PaLiGemma 3B (not 7B+)Larger VLMsReal-time inference budget
Freeze VLM in post-trainingFull fine-tuningRetains generalization

The 50 Hz constraint: what it rules out

The real-time requirement of 50 Hz control (20ms per inference cycle) is the single most constraining design decision in pi-0. Here's what it rules out at inference time:

These constraints are why pi-0's architecture is so carefully optimized. Every millisecond matters. The architecture isn't just a good design — it's the only design that meets the real-time budget while maintaining VLM-level understanding.

In the broader field

ModelAction Repr.Key Difference from pi-0
RT-2Discrete tokens (256 bins)Coarse actions, no dexterity
OpenVLADiscrete tokensOpen-source but limited precision
OctoDiffusionDiffusion head but smaller model
pi-0Flow matchingVLM + flow matching + action expert
Related lessons: pi-0.5Gleams: VLAGleams: Flow Matching
"A human being should be able to change a diaper, plan an invasion, butcher a hog... Specialization is for insects."
— Robert A. Heinlein (quoted in the paper)