pi-0.5 — Veanors

Chapter 0: The Problem

Robots work beautifully in labs. They pick objects, stack blocks, pour liquids — as long as the environment matches what they trained on. Take that same robot to a new kitchen — different layout, different objects, different lighting — and it fails catastrophically.

This is the open-world generalization problem: the gap between controlled lab demos and useful real-world deployment. It's the single biggest unsolved challenge in robotics.

The core question: Can a robot learn to clean a kitchen it has never seen before? Not just pick up a single known object — but perform 10-15 minute multi-stage tasks (close cabinets, put items away, wipe spills, load the sink) in a completely new home?

Previous VLAs (RT-2, OpenVLA, even pi-0) generalize to new objects and minor scene variations. But they still fail in truly new environments — new room layouts, new furniture, new spatial arrangements. pi-0.5 is the first system to demonstrate this level of generalization.

pi-0's specific failure: a worked example

Consider pi-0 deployed in Kitchen A (trained) vs Kitchen B (new). In Kitchen A, it achieves 85% success on "put the mug in the sink." In Kitchen B:

The sink is 30cm to the right compared to Kitchen A. pi-0's action expert has learned a specific reaching trajectory that targets Kitchen A's sink position. In Kitchen B, the mug lands on the counter next to the sink.
The mug is a different shape (travel thermos vs ceramic mug). SigLIP recognizes both as "drinkware," but the grasp strategy (top-down vs side grasp) needs to be different. pi-0 attempts Kitchen A's grasp and drops the thermos.
The lighting is warmer (tungsten vs LED). The mug's color shifts in the image. Minor effect — SigLIP is robust to this.

The core failure: pi-0 has memorized where things are in its training environments rather than learning a general understanding of spatial relationships. pi-0.5 fixes this by training on enough environments that the model can't memorize any single layout.

What "new environment" actually means

Let's be precise about what changes when the robot enters an unseen kitchen:

Spatial layout: Cabinets are in different positions. The sink is on a different wall. Counter depth varies. The robot's workspace geometry is completely different.
Object instances: Different brands, shapes, materials. A ceramic mug vs. a metal thermos. A speckled countertop vs. white marble.
Lighting conditions: Different color temperature, shadow directions, reflections off surfaces.
Camera viewpoints: The robot's base position differs, meaning identical objects appear at different pixel coordinates.

Each of these changes is individually manageable. But all of them happening simultaneously creates a combinatorial explosion that overwhelms models trained on a small number of environments. pi-0 was trained in ~7 lab setups. The real world has millions of kitchens.

A concrete measurement of the generalization gap

The paper quantifies this gap precisely. For "pick up mug and place in sink":

Setting	pi-0 Success Rate	pi-0.5 Success Rate
Same kitchen as training	92%	90%
Same kitchen, new objects	75%	82%
New kitchen, familiar objects	25%	72%
New kitchen, new objects	12%	58%

pi-0 drops from 92% to 12% when both environment and objects change — an 80 pp drop. pi-0.5 drops only from 90% to 58% — a 32 pp drop. The gap between "trained environment" and "new everything" shrinks from 80 pp to 32 pp. Still far from solved, but a qualitative improvement.

The combinatorics of environment variation

Let's quantify why "collect more data" fails. A kitchen varies along at least these dimensions:

Dimension	Typical Variation	Approximate States
Counter height	75-95cm	~20 discrete levels
Sink position (left/center/right)	3 common layouts	3
Cabinet types	Hinged, sliding, open shelving	5+
Lighting color temp	2700K-6500K	~10 perceptual categories
Object set (mugs, plates, etc)	Hundreds of brands/styles	~100 common types
Floor/counter material	Tile, wood, granite, laminate	~10

Combinatorial space: 20 x 3 x 5 x 10 x 100 x 10 = 3,000,000 unique kitchen configurations. Even visiting 1000 kitchens covers only 0.03% of this space. Generalization must come from understanding, not memorization.

Why this isn't just "more data"

You might think: just collect data in 1000 kitchens. But even 1000 is insufficient — the space of possible environments is effectively infinite. pi-0.5's insight is that you don't need to visit every possible kitchen. You need to understand what kitchens are, which comes from web data showing millions of kitchen images, combined with robot data teaching manipulation skills that transfer across spaces.

Why do current VLAs fail in new environments?

They can't process images They overfit to training environments — new layouts and objects are out of distribution They lack language understanding

Chapter 1: The Key Insight

pi-0.5's breakthrough is simple to state: co-train on everything. Don't just train on your target robot's data. Train on data from other robots, web images, language instructions, high-level task plans, and more — all in one model.

The insight: Knowledge transfers across heterogeneous data sources. A model that has seen thousands of kitchens in web images understands kitchen layouts. A model trained on other robots understands manipulation. Combining these in one VLA gives open-world generalization that no single data source can provide.

Five data sources, one model

Mobile Manipulator data (~400 hours in 50+ environments)

↓

Multi-Environment non-mobile robot data (diverse scenes)

↓

Cross-Embodiment data (different robot types)

↓

High-Level subtask prediction data

↓

Web

Web/VLM data (COCO, QA datasets, interleaved image-text)

↓ all fed into one VLA

What changes from pi-0

pi-0.5 is not a fundamentally different architecture from pi-0. The key differences are:

Aspect	pi-0	pi-0.5
Training data	7 robot types, 68 tasks, OXE	Same + 50+ environments + web/VLM data + HL labels
Inference	Single-stage: image+language → actions	Two-stage: first predict subtask, then predict actions
Target robot	Multiple lab robots	Mobile manipulator in real homes
Task horizon	~30s-2min per task	10-15 minutes per task
Environment diversity	~7 lab setups	50+ real homes and offices
Co-training sources	Robot data only	Robot + web + language + subtask prediction

The architecture is the same PaLiGemma backbone + action expert. What's new is the recipe — the data mixture and the two-stage inference that together enable open-world generalization.

What is the key mechanism that enables pi-0.5's generalization?

Co-training on heterogeneous data sources (other robots, web data, subtask prediction) A larger neural network More training time

Chapter 2: VLA Background

A Vision-Language-Action model takes an image and a language instruction as input, and outputs robot motor commands. It's a VLM (like GPT-4V) that has been fine-tuned to produce actions instead of (or alongside) text.

pi-0 recap: the data flow

pi-0 (the predecessor) was a 3B-parameter VLA built on PaLiGemma. Its key innovation was using flow matching as the action head — instead of discretizing actions into tokens (like RT-2), pi-0 generates continuous action trajectories by learning a velocity field that transports noise to actions.

dx/dt = v_θ(x_t, t, observation, instruction)

pi-0.5 builds on pi-0 but adds the co-training recipe and hybrid inference that enable open-world generalization.

Action representations compared

Method	Action Repr.	Pros	Cons
RT-2	Discrete tokens (256 bins)	Simple, uses LLM vocabulary	Coarse, multimodality issues
Diffusion Policy	Denoising diffusion	Handles multimodal actions	Slow inference (many steps)
pi-0 / pi-0.5	Flow matching	Fast, continuous, expressive	Requires post-training stage

The concrete action space for pi-0.5

pi-0.5's target platform is a mobile manipulator with:

7-DOF arm: 7 joint angles (shoulder, elbow, wrist rotations)
Gripper: 1 dimension (open/close, continuous)
Mobile base: 2 dimensions (x velocity, rotation velocity)
Total action per timestep: 10 continuous values
Action chunk size: 50 timesteps at 10 Hz = 5 seconds of motion
Full action tensor: [50, 10] — 500 floating-point numbers per inference call

Note: pi-0.5 runs at 10 Hz (not 50 Hz like pi-0) because mobile manipulation is slower and the higher-level planning stage needs more time. Each 5-second chunk covers one atomic manipulation (reach, grasp, lift, place).

Why 10 Hz instead of 50 Hz?

pi-0 runs at 50 Hz because dexterous manipulation (folding fabric, assembling boxes) requires precise, high-frequency control. pi-0.5 runs at 10 Hz because its tasks are coarser:

Mobile base: Moves at ~0.5 m/s. At 10 Hz, each command covers 5cm of motion — sufficient granularity for navigation.
Arm reach-and-grasp: Typical reach takes ~3 seconds. At 10 Hz, that's 30 timesteps — enough to plan a smooth approach trajectory.
Inference budget: At 10 Hz, the model has 100ms per action chunk (vs 20ms at 50 Hz). This extra time allows the more expensive two-stage inference pipeline.

The tradeoff is clear: pi-0.5 sacrifices the fine-grained dexterity of pi-0 in exchange for the two-stage reasoning that enables open-world generalization. You can clean a kitchen at 10 Hz; you can't fold laundry at 10 Hz.

What advantage does flow matching have over discrete action tokens?

It produces continuous actions directly, avoiding discretization artifacts It uses fewer parameters It trains faster

Chapter 3: The Architecture

pi-0.5 is built on PaLiGemma, a vision-language model that processes images with a SigLIP vision encoder and generates text with a Gemma language model. The VLA extends this by adding action tokens to the output vocabulary.

Architecture Pipeline

Key design choices

Vision encoder: SigLIP ViT-So400m (frozen during post-training), 400M params
Language model: Gemma-based transformer backbone, ~2.4B params
Action head: Flow matching (continuous) or FAST tokenizer (discrete, for pre-training)
Context: Multiple camera views (2 wrist cameras + forward-facing = 3 x 256 tokens = 768 visual tokens)

The full token budget

Every inference call processes this token sequence through the Gemma transformer:

Token Type	Count	Source
Visual tokens (3 cameras)	768	SigLIP encoder (256 per image)
Language instruction	~20-50	Gemma tokenizer
Subtask prediction (Stage 1 output)	~10-20	Autoregressive generation
Proprioceptive state	1	Projected joint angles
Action tokens (flow matching)	50	Learned embeddings + denoising
Total context	~850-900

This fits comfortably in PaLiGemma's 1024-token context window, though it's tight. Longer instructions or more cameras would require context expansion.

The mobile manipulator platform

pi-0.5's target robot is fundamentally different from pi-0's tabletop arms. It's a mobile manipulator with:

Mobile base: Omni-directional wheels. Can translate (x, y) and rotate (θ). Max speed: ~0.5 m/s.
7-DOF arm: Mounted on the base. Reaches ~1m from base center. Standard parallel-jaw gripper.
3 cameras: 1 forward-facing (navigation), 2 wrist-mounted (manipulation). All 640x480 RGB, no depth.
Onboard compute: NVIDIA Jetson for image preprocessing. Streams to nearby A100 server via WiFi.

The mobile base adds 3 action dimensions (vx, vy, vθ) on top of the arm's 7+1 (joints + gripper). Total: 11 action dimensions per timestep. This is larger than pi-0's typical 7-8 dimensions, which is why the action chunk is [50, 11] instead of [50, 7].

The mobile base also creates a coordination challenge: the robot must simultaneously navigate (base) and manipulate (arm). The model must learn when to move the base (approach a new cabinet) vs. when to keep the base still and move the arm (pick up an object in reach). This base-arm coordination emerges naturally from training data — no explicit planning hierarchy.

What the VLM backbone provides vs. learns

From web pre-training (already knows):

Object recognition: "that's a mug" — even novel mug designs
Scene understanding: "this is a kitchen with a sink on the left"
Language grounding: "put it in the drawer" vs "put it on the counter"

From robot co-training (must learn):

Proprioceptive-visual alignment: where is my arm relative to what I see?
Action semantics: what motor commands correspond to "pick up"?
Subtask decomposition: "clean the kitchen" → sequence of atomic actions

Why PaLiGemma? It provides strong vision-language understanding from web pre-training. pi-0.5 inherits this understanding — it knows what a kitchen looks like, what plates are, how cabinets open — before seeing a single robot demonstration. This is the foundation of open-world generalization: the model has "seen" millions of kitchens through web images.

What is the backbone VLM for pi-0.5?

LLaMA PaLiGemma (SigLIP + Gemma) CLIP + GPT-2

Chapter 4: Hybrid Inference

Here's the clever part. At inference time, pi-0.5 runs in two stages using the same model:

Two-Stage Inference

Stage 1: High-level subtask prediction

Same model, two roles: The SAME pi-0.5 model first acts as a planner (predicting "pick up the plate") then acts as a controller (predicting motor commands). This is like chain-of-thought reasoning but for robots — think first, then act.

Stage 1: The "Understanding Expert" (slow, ~every 5-10 seconds)

Given the current image and high-level task ("clean the kitchen"), the model autoregressively generates a subtask in natural language ("pick up the plate and put it in the sink").

Timing: This runs every 5-10 seconds — once per atomic manipulation. The model processes all 768 visual tokens + language instruction + history, then generates ~10-20 text tokens describing the next subtask. Latency: ~200-500ms for the autoregressive generation.

Why autoregressive here: Subtask prediction IS a language task. The model leverages its VLM pre-training to reason about what to do next. "I see dirty plates on the counter and an empty sink → the next subtask is to move plates to the sink." This is pure VLM reasoning, no action generation needed.

Stage 2: The "Action Expert" (fast, ~10 Hz)

Given the image + subtask from Stage 1, predict continuous low-level actions via flow matching. This is the same action generation as pi-0: noise → 10 denoising steps → clean action chunk [50, 10].

Timing: Runs at 10 Hz. Each call generates a 5-second action chunk, but only the first 0.5-1 second is executed before re-planning with fresh observations. This gives closed-loop behavior despite the chunk-based architecture.

What if Stage 1 predicts the wrong subtask? The action expert will execute it faithfully — leading to incorrect behavior. But because Stage 1 re-runs every 5-10 seconds, the model self-corrects: it sees the new scene state, recognizes the situation, and predicts a better subtask. This is analogous to a human realizing "wait, I should've grabbed the sponge first" and changing plans.

The inference pipeline on hardware

Stage	Computation	Hardware	Latency
Image capture	3 cameras @ 640x480, resize to 224x224	USB cameras	~10ms
SigLIP encoding	3 images → 768 visual tokens	A100 GPU	~8ms
Stage 1 (subtask)	Autoregressive text generation	A100 GPU	~300ms (every 5-10s)
Stage 2 (actions)	10 flow matching denoising steps	A100 GPU	~30ms (every 100ms)
Motor execution	Send joint commands to robot	Robot controller	~1ms

Total first-action latency: ~50ms (when subtask is already predicted). The robot carries an onboard NVIDIA Jetson for SigLIP encoding and streams tokens to a nearby A100 server for transformer inference.

A complete kitchen cleaning trace

Here's an actual execution trace from the paper's kitchen experiment (simplified):

Time	Stage 1 Output	Stage 2 Actions	Outcome
0:00	"Close the open cabinet door"	Navigate to cabinet, extend arm, push door	Success
0:35	"Pick up the mug on the counter"	Approach counter, grasp mug, lift	Success
1:10	"Place mug in the sink"	Navigate to sink, lower arm, release	Success
1:45	"Pick up the plate on the counter"	Approach plate, grasp edge, lift	FAIL (plate slipped)
2:00	"Pick up the plate on the counter" (retry)	Re-approach, adjust grasp angle, lift	Success
2:30	"Place plate in the sink"	Navigate to sink, lower, release	Success
3:05	"Wipe the counter with the sponge"	Grasp sponge, wipe in sweeping motion	Partial (missed a spot)

Total: 7 subtasks over ~3.5 minutes (partial trace of a 10-15 minute episode). Note the recovery at 2:00 — the model re-observed the scene, detected the failed grasp, and autonomously re-attempted with a different approach angle. This recovery capability comes from pre-training on diverse data that includes many partial failures and corrections.

How many separate models does pi-0.5 use for planning and control?

One — the same model does both high-level and low-level inference Two — a separate planner and controller Three — planner, controller, and vision module

Chapter 5: The Co-Training Recipe

The magic of pi-0.5 is in the data mixture. Five sources of supervision, each contributing something unique:

Data Sources — Toggle to See Impact

Source	What It Provides	Volume	Format
MM	Target robot manipulation in 50+ environments	~400 hours	Images + actions + language labels
ME	Diverse scenes from non-mobile robots	~200 hours	Images + actions (different action space)
CE	Different robot types (cross-embodiment)	~500 hours (OXE)	Images + actions (heterogeneous)
HL	Subtask prediction (what to do next)	~50K episodes	Images + text labels (no actions)
Web	Visual understanding (COCO, QA, image-text)	~10M examples	Images + text (no actions, no robot)

How heterogeneous data is unified

These five sources have completely different formats. How do you train one model on all of them? The key: different loss functions applied to different token positions, all in the same forward pass.

MM/ME/CE data: Flow matching loss on action tokens + language modeling loss on text tokens.
HL data: Language modeling loss only (predict the next subtask as text). No action tokens.
Web data: Standard VLM loss (image captioning, VQA). No action tokens, no robot context.

The model sees all data types interleaved in each batch. A single batch might contain: 25% MM episodes, 15% ME, 20% CE, 15% HL, 25% web. The loss computation only applies to the relevant token positions for each example.

Why each source matters — a worked example

Consider the robot encountering a new kitchen with a never-seen coffee mug on the counter:

Web data taught: "That cylindrical object with a handle is a mug. Mugs go in sinks or cabinets."
MM data taught: "To pick up objects from a counter at this height, approach from above with this grasp strategy."
ME data taught: "Countertops have this visual texture. Objects rest stably on flat surfaces."
CE data taught: "Grasping cylindrical objects requires wrapping the fingers — this works regardless of robot type."
HL data taught: "After picking up a dirty mug, the next step is to place it in the sink."

No single source provides all this knowledge. Together, they let the robot successfully handle a novel mug in a novel kitchen.

Why each source matters: Remove any one and performance drops significantly. Web data gives visual understanding. CE gives manipulation skills. HL gives planning ability. They're complementary — the whole is greater than the sum of parts.

What does the Cross-Embodiment (CE) data contribute?

Manipulation skills from different robot types that transfer to the target robot Only visual understanding Language understanding

Chapter 6: Two-Stage Training

Training happens in two phases, each with a different action representation:

Stage 1: Pre-training (280K steps)

All data sources. Discrete action tokens via FAST tokenizer. Broad knowledge acquisition.

↓

Stage 2: Post-training

Task-specific data. Flow matching for continuous actions. Fine-grained control.

Why two stages with different action representations?

Pre-training with discrete tokens lets the model learn from ALL data sources in a unified format — robot data, web data, language, subtask predictions — all as token sequences. This is where the broad knowledge comes from.

Post-training with flow matching replaces the discrete action head with a continuous one. Flow matching produces smoother, more precise motor commands than discretized tokens. This is where fine-grained control comes from.

The FAST tokenizer: bridging discrete and continuous

FAST (Fine-grained Action Sequence Tokenization) is a key engineering innovation that makes the two-stage recipe possible. It uses DCT (Discrete Cosine Transform) to compress action chunks:

Take an action chunk: [50, 10] = 500 continuous values.
Apply DCT per dimension: Transform to frequency domain. Low frequencies capture the trajectory shape; high frequencies capture noise.
Truncate high frequencies: Keep only the top-K DCT coefficients (K=8-16). This reduces 500 values to ~80-160 values.
Quantize: Map each coefficient to one of 1024 discrete tokens.
Result: A 50-timestep, 10-dimensional action chunk is represented as ~80-160 tokens from a 1024-token vocabulary.

This is dramatically more efficient than RT-2's approach (which would need 50 x 10 x 1 = 500 tokens for the same chunk). FAST makes it possible to pre-train on actions using the same token-prediction machinery as language.

Reconstruction quality of FAST

How much information is lost in the DCT truncation? For typical manipulation trajectories:

K=8 coefficients: Reconstructs trajectories with <2% RMSE. Smooth motions captured perfectly. Fine jitter lost — but jitter is harmful anyway.
K=4 coefficients: ~5% RMSE. Adequate for mobile base navigation but loses dexterous precision.
K=16 coefficients: Near-perfect (<0.5% RMSE) but 160 tokens per chunk — too many for the context window.

K=8 is the sweet spot for pre-training. The model learns general trajectory shapes while leaving room for language and visual tokens. Post-training with flow matching then recovers full continuous precision that FAST's quantization loses.

Training infrastructure

Parameter	Pre-training (Stage 1)	Post-training (Stage 2)
Hardware	64 TPU v5e pods	16 TPU v5e pods
Steps	280K	50K-100K
Batch size	2048 (mixed sources)	256 (target robot only)
Learning rate	1e-4 → 1e-5 (cosine)	5e-6 (constant)
Duration	~5 days	~1-2 days
Action representation	FAST discrete tokens	Flow matching (continuous)
VLM backbone	Unfrozen (adapts)	Frozen (preserved)
Data sources	All 5 (MM+ME+CE+HL+Web)	MM only (target robot)

Think of it this way: Pre-training teaches the model what to do (broad understanding from diverse sources). Post-training teaches it how to do it precisely (fine motor control on the target platform). The discrete-to-continuous transition is the bridge between understanding and execution.

What the VLM forgets without web co-training

An ablation study showed that removing web data during pre-training causes a 15-20% drop in novel object recognition. The model can still manipulate objects it saw during robot training, but fails to generalize to new objects — exactly because it has forgotten the visual recognition capabilities from web pre-training. The co-training recipe prevents this forgetting.

With vs without web co-training: qualitative examples

Scenario	Without Web Data	With Web Data
Novel mug (travel thermos)	Doesn't recognize as drinkware. Ignores it.	Recognizes as mug variant. Adapts grasp.
Spill on dark granite counter	Can't segment spill. Doesn't attempt to clean.	Detects via texture/reflection. Initiates wipe.
Open cabinet (glass-front style)	Fails — never seen this cabinet type in robot data.	Understands "cabinet" as a category. Adapts approach.
"Put away the utensils"	Can't identify the utensil drawer in a new kitchen.	Recognizes utensil tray pattern from web images. Opens correct drawer.

The web data doesn't teach the robot to move — it teaches it to see. For open-world generalization, seeing correctly is half the battle.

Why use discrete tokens in pre-training but flow matching in post-training?

Discrete tokens enable unified training on all data types; flow matching gives precise continuous control Discrete tokens are faster Flow matching can't handle language data

Chapter 7: Experiments

The headline result: pi-0.5 can clean kitchens and bedrooms in entirely new homes not seen during training. These are 10-15 minute tasks involving multiple stages.

Success Rates in New Homes

What the robot does in a new kitchen

Given "clean the kitchen," pi-0.5 autonomously:

Navigates to the kitchen area (mobile base)
Surveys the scene (rotates, captures images from multiple angles)
Predicts first subtask: "close the open cabinet door"
Executes: approaches cabinet, grasps door edge, pushes closed
Re-observes: "good, cabinet is closed. Next: pick up the mug on the counter"
Executes: approaches mug, grasps, lifts, navigates to sink, places
Continues through 8-12 subtasks over 10-15 minutes

Each step is predicted by the model itself (high-level subtask), then executed with flow matching actions. The full sequence takes 10-15 minutes.

What degrades in new environments

Even pi-0.5 has limits. The degradation patterns reveal what the model has truly learned vs. what it has memorized:

Condition	Success Rate Impact	Explanation
New kitchen, familiar objects	~75% (mild drop)	Layout understanding generalizes from web data
New kitchen, novel objects	~55% (moderate drop)	VLM recognizes most objects but grasping novel shapes is harder
Very cluttered scenes	~40% (significant drop)	Visual segmentation struggles with many overlapping objects
Glass/transparent objects	~30% (large drop)	Depth estimation fails — cameras can't detect transparent surfaces well
Narrow spaces (under cabinets)	~35% (large drop)	Mobile base can't position arm correctly — workspace constraint
Ambiguous instructions	~50% (moderate)	Model picks a reasonable interpretation but may not match user intent

The mobile base challenge

Mobile manipulation adds a dimension of difficulty absent in tabletop robotics: the robot must decide when and where to move its base. This decision is implicit in the action space — the model outputs base velocities alongside arm joint velocities. But the consequences are very different:

Moving the base changes the entire visual scene. All spatial relationships between the arm and objects change. The VLM must re-parse the scene after every base movement.
Base odometry drifts. Wheel slip on different floor surfaces (tile vs carpet) causes position errors that accumulate. After 10 minutes of movement, the robot may be 5-10cm from where it thinks it is. The vision system must handle this gracefully.
Navigation near obstacles. The robot must avoid furniture, walls, and people while navigating to the next manipulation target. There is no explicit obstacle map — the model learns collision avoidance from training data where teleoperators navigated around obstacles.

These challenges explain why pi-0.5 runs at 10 Hz (not 50 Hz) and uses 5-second action chunks (not 1-second): mobile tasks require more planning horizon per chunk, and the base can't respond to new observations as quickly as a stationary arm.

The 10-minute failure modes

Over 10-15 minute tasks, errors compound. The most common failure patterns:

Subtask prediction loops: The model predicts "pick up the plate" → fails → sees the same scene → predicts "pick up the plate" again. No fallback strategy.
Drift accumulation: Small positioning errors from the mobile base compound. After 10 subtasks, the robot may be 5-10cm off from where it thinks it is.
Recovery from drops: If an object is dropped, the model must re-observe, recognize the new state, and re-plan. This works ~60% of the time.

First of its kind: To the authors' knowledge, this is the first demonstration of an end-to-end learning-enabled robotic system performing long-horizon, dexterous manipulation in entirely new real-world environments.

How long are the kitchen cleaning tasks pi-0.5 performs?

10-30 seconds 1-2 minutes 10-15 minutes with multiple stages

Chapter 8: Scaling

How does generalization scale with the number of training environments? The paper shows a clear trend: more diverse training scenes = better generalization to new scenes.

Generalization vs Training Scenes

The scaling curves with actual numbers

The paper tests performance as the number of unique training environments increases:

Training Environments	Success in New Homes	Improvement per 2x Environments
5 environments	~15%	Baseline
10 environments	~30%	+15 pp
25 environments	~50%	+10 pp
50+ environments	~70%	+7 pp (diminishing)

The scaling is logarithmic — each doubling of environments adds less marginal improvement. But crucially, it hasn't plateaued at 50 environments. More environments would likely continue improving performance, just at a slower rate.

But the scaling isn't just about quantity — it's about diversity. Adding more data from the same kitchen helps less than adding data from a different kitchen. And adding data from a completely different robot or from the web helps in ways that more same-robot data cannot.

Data collection cost

A practical concern: how expensive is this data to collect?

Data Source	Cost per Hour	Hours Needed	Total Cost
MM (teleoperated)	~$50/hr (operator + robot time)	~400	~$20K
ME (multi-environment)	~$50/hr	~200	~$10K
CE (cross-embodiment, OXE)	Pre-existing public data	~500	$0 (already collected)
HL (subtask labels)	~$5/episode annotation	50K episodes	~$250K
Web data	Free (public datasets)	N/A	$0
Total estimated			~$280K

The most expensive component is the human annotation (HL labels), not the robot data collection. This suggests that auto-labeling (using VLMs to generate subtask labels from video) could dramatically reduce the cost of scaling. pi-0.7 explores this direction with automated subtask annotation.

Compute scaling

The relationship between compute and performance:

Pre-training steps: Performance improves log-linearly from 50K to 280K steps. Returns diminish after 200K.
Model size: Going from 1B to 3B parameters gives meaningful gains. The paper doesn't test larger models (GPU memory constraints at inference).
Batch size: Larger batches (more diverse examples per step) help more than equivalent extra steps with smaller batches. This confirms that exposure to diversity is the key factor.

The scaling insight: Generalization scales with the diversity of the data mixture, not just its volume. This is why co-training on heterogeneous sources (robots + web + language) outperforms simply collecting more robot data. 10M web images provide more "kitchen understanding" than 1000 more hours of robot data in existing kitchens.

What matters more for generalization — data volume or data diversity?

Diversity of data sources and environments Total number of training examples Training time

Chapter 9: Ablations

The paper systematically removes each data source to measure its contribution. The finding: every source matters.

Ablation	Effect on New Homes	Why
Remove ME (Multi-Env)	-18 pp	Scene diversity is critical — fewer environments = more overfitting to layouts
Remove CE (Cross-Embodiment)	-12 pp	Manipulation knowledge from other robots provides grasp primitives
Remove HL (High-Level)	-15 pp on long tasks	Model loses planning ability — gets stuck after 2-3 subtasks
Remove Web data	-20 pp on novel objects	Visual understanding suffers — can't recognize unfamiliar items
Remove verbal instructions	-10 pp	Task specification becomes ambiguous — model guesses what to do

The ablation that surprised the authors

The most surprising ablation result: removing verbal instructions (just providing images, no language) only drops performance by 10 pp. This suggests that pi-0.5 has learned to infer task intent from visual context alone — seeing a messy kitchen is enough to trigger cleaning behavior without being told "clean the kitchen." The VLM's web pre-training includes millions of before/after images of clean vs messy spaces, implicitly teaching the concept of "this needs cleaning."

However, language becomes critical when the task is ambiguous ("put the mug in the cabinet" vs "put the mug in the sink" — visually, both cabinets and sinks are present). Without language, the model defaults to the statistically most common action for each object, which is often but not always correct.

The interaction effects

More interesting than individual ablations are the interaction effects. Removing two sources simultaneously hurts more than the sum of removing each individually:

Remove CE + Web together: -40 pp (vs -12 + -20 = -32 pp individually). They're synergistic — web data helps the model understand what CE robots are doing.
Remove HL + ME together: -38 pp (vs -15 + -18 = -33 pp individually). Planning ability requires scene diversity to generalize.

This superlinear degradation proves the data sources don't just add independent knowledge — they amplify each other. The whole is genuinely greater than the sum of parts.

What about depth cameras?

An interesting negative result: adding depth cameras (RGBD instead of RGB) did NOT significantly help. The model learns depth understanding implicitly from stereo cues across its three cameras. This is important because depth cameras are fragile, expensive, and fail on transparent/reflective surfaces. pi-0.5 works with cheap RGB cameras only.

The FAST tokenizer ablation

The paper also ablates the two-stage training approach (FAST discrete → flow matching) vs single-stage alternatives:

Approach	Pre-training Data	Post-training	Result
FAST → Flow (pi-0.5)	All 5 sources	Flow matching	Best overall
Flow only	Robot data only (can't use web/HL)	Flow matching	-20 pp (no co-training)
FAST only	All 5 sources	FAST discrete	-8 pp (quantization artifacts)
RT-2 style tokens	All 5 sources	256-bin tokens	-15 pp (coarse + no fine control)

The two-stage approach wins because it gets the best of both worlds: FAST enables co-training on heterogeneous data (you need tokens to train alongside language), and flow matching provides the continuous precision that discrete tokens can't match.

No single source is dispensable. The co-training recipe works because each source provides something the others cannot. This is the paper's strongest empirical finding — and the key engineering lesson for anyone building a VLA.

What happens when you remove the high-level subtask prediction data?

Long-horizon tasks degrade because the model loses its planning ability No effect — low-level control is sufficient Only language understanding is affected

Chapter 10: Connections

pi-0.5 represents a major step in the VLA lineage:

2022

RT-1 — First large-scale robot transformer (130K episodes)

↓

2023

RT-2 — First VLA (VLM fine-tuned for actions)

↓

2024

OpenVLA / Octo — Open-source VLAs, cross-embodiment

↓

2024

pi-0 — Flow matching action head, dexterous manipulation

↓

2025

pi-0.5 — Open-world generalization via heterogeneous co-training

What pi-0.5 means for deployment economics

pi-0.5 changes the economics of robot deployment. Before pi-0.5, deploying a robot in a new home required:

Send robot to new home
Collect 50-100 hours of demonstration data in that specific home
Fine-tune the model on that data (~2 days of compute)
Deploy the fine-tuned model

Cost per home: ~$5,000-10,000 in human labor + compute. Time to deploy: ~2 weeks.

With pi-0.5's open-world generalization:

Send robot to new home
Deploy the pre-trained model immediately (zero-shot)

Cost per home: ~$0 marginal (the training cost is amortized across all deployments). Time to deploy: immediate. This is the difference between a research prototype and a viable product.

Remaining deployment barriers

Even with zero-shot generalization, real-world deployment faces non-ML challenges:

WiFi latency: The robot streams observations to an A100 server. WiFi jitter (10-50ms spikes) causes action delays. In-home WiFi is less reliable than lab networks.
Battery life: The mobile base runs for ~2-3 hours. A full kitchen cleaning takes 15 minutes. Multiple cleanings per charge are possible but battery monitoring is needed.
Safety certification: No robot with a learned policy has been safety-certified for unsupervised home use. The model can produce unexpected motions (high-velocity arm swings during error recovery). Mechanical speed limits and collision detection are external safety layers.
User interaction: How does the homeowner specify tasks? A chat interface? Voice commands? The paper uses text input, but natural language understanding for ambiguous household instructions remains an open problem.

Key engineering lessons from pi-0.5

Lesson	Implication
Co-training beats single-source	Build pipelines that ingest heterogeneous data, not just robot demos
Web data prevents forgetting	Always include VLM-style data to maintain visual understanding
Two-stage inference works	Same model can plan AND execute — no separate planner needed
Diversity > Volume	10 new environments beats 1000 hours in existing ones
Depth cameras are optional	RGB-only systems are viable for deployment (cheaper, more robust)
FAST enables discrete pre-training	You can pre-train with tokens then switch to flow matching

The open-world gap: what remains unsolved

Despite pi-0.5's achievements, the gap between lab success and true household deployment remains large:

Safety: The model has no concept of "this action could break something expensive" or "this action could hurt a person." No safety constraints are enforced beyond joint limits.
Long-tail objects: Rare objects (unusual kitchen gadgets, foreign utensils) still cause failures. The web data provides recognition but not manipulation strategies.
Multi-room navigation: Moving between rooms requires spatial memory that the current architecture doesn't support. The model can clean one room but struggles with "bring the plate from the kitchen to the dining room."
Human interaction: If a person is in the kitchen, the robot has no model of personal space, social norms, or how to coordinate (e.g., "excuse me, I need to reach that cabinet").

What's next?

pi-0.5 shows that co-training on diverse data sources is the path to generalization. The open questions are: how far can this scale? Can we add simulation data, human video, and internet-scale interaction data to push generalization even further? pi-0.7 answers part of this — adding structured prompt conditioning to absorb even more diverse data without mode averaging.

Related lessons: Gleams: VLA • Gleams: Flow Matching • Gleams: RL Algorithms

"Stuff your eyes with wonder... See the world."

— Ray Bradbury (quoted in the paper's introduction)

pi-0.5: Open-WorldGeneralization

Chapter 0: The Problem

pi-0's specific failure: a worked example

What "new environment" actually means

A concrete measurement of the generalization gap

The combinatorics of environment variation

Why this isn't just "more data"

Chapter 1: The Key Insight

Five data sources, one model

What changes from pi-0

Chapter 2: VLA Background

pi-0 recap: the data flow

Action representations compared

The concrete action space for pi-0.5

Why 10 Hz instead of 50 Hz?

Chapter 3: The Architecture

Key design choices

The full token budget

The mobile manipulator platform

What the VLM backbone provides vs. learns

Chapter 4: Hybrid Inference

Stage 1: The "Understanding Expert" (slow, ~every 5-10 seconds)

Stage 2: The "Action Expert" (fast, ~10 Hz)

The inference pipeline on hardware

A complete kitchen cleaning trace

Chapter 5: The Co-Training Recipe

How heterogeneous data is unified

Why each source matters — a worked example

Chapter 6: Two-Stage Training

Why two stages with different action representations?

The FAST tokenizer: bridging discrete and continuous

Reconstruction quality of FAST

Training infrastructure

What the VLM forgets without web co-training

With vs without web co-training: qualitative examples

Chapter 7: Experiments

What the robot does in a new kitchen

What degrades in new environments

The mobile base challenge

The 10-minute failure modes

Chapter 8: Scaling

The scaling curves with actual numbers

Data collection cost

Compute scaling

Chapter 9: Ablations

The ablation that surprised the authors

The interaction effects

What about depth cameras?

The FAST tokenizer ablation

Chapter 10: Connections

What pi-0.5 means for deployment economics

Remaining deployment barriers

Key engineering lessons from pi-0.5

The open-world gap: what remains unsolved

What's next?

pi-0.5: Open-World
Generalization