Physical Intelligence, 2025

pi-0.5: Open-World
Generalization

The first VLA that cleans kitchens in homes it has never seen — by learning from everything: other robots, web data, language instructions, and high-level reasoning.

Prerequisites: Basic ML + VLA intuition
11
Chapters
5+
Simulations

Chapter 0: The Problem

Robots work beautifully in labs. They pick objects, stack blocks, pour liquids — as long as the environment matches what they trained on. Take that same robot to a new kitchen — different layout, different objects, different lighting — and it fails catastrophically.

This is the open-world generalization problem: the gap between controlled lab demos and useful real-world deployment. It's the single biggest unsolved challenge in robotics.

The core question: Can a robot learn to clean a kitchen it has never seen before? Not just pick up a single known object — but perform 10-15 minute multi-stage tasks (close cabinets, put items away, wipe spills, load the sink) in a completely new home?

Previous VLAs (RT-2, OpenVLA, even pi-0) generalize to new objects and minor scene variations. But they still fail in truly new environments — new room layouts, new furniture, new spatial arrangements. pi-0.5 is the first system to demonstrate this level of generalization.

pi-0's specific failure: a worked example

Consider pi-0 deployed in Kitchen A (trained) vs Kitchen B (new). In Kitchen A, it achieves 85% success on "put the mug in the sink." In Kitchen B:

The core failure: pi-0 has memorized where things are in its training environments rather than learning a general understanding of spatial relationships. pi-0.5 fixes this by training on enough environments that the model can't memorize any single layout.

What "new environment" actually means

Let's be precise about what changes when the robot enters an unseen kitchen:

Each of these changes is individually manageable. But all of them happening simultaneously creates a combinatorial explosion that overwhelms models trained on a small number of environments. pi-0 was trained in ~7 lab setups. The real world has millions of kitchens.

A concrete measurement of the generalization gap

The paper quantifies this gap precisely. For "pick up mug and place in sink":

Settingpi-0 Success Ratepi-0.5 Success Rate
Same kitchen as training92%90%
Same kitchen, new objects75%82%
New kitchen, familiar objects25%72%
New kitchen, new objects12%58%

pi-0 drops from 92% to 12% when both environment and objects change — an 80 pp drop. pi-0.5 drops only from 90% to 58% — a 32 pp drop. The gap between "trained environment" and "new everything" shrinks from 80 pp to 32 pp. Still far from solved, but a qualitative improvement.

The combinatorics of environment variation

Let's quantify why "collect more data" fails. A kitchen varies along at least these dimensions:

DimensionTypical VariationApproximate States
Counter height75-95cm~20 discrete levels
Sink position (left/center/right)3 common layouts3
Cabinet typesHinged, sliding, open shelving5+
Lighting color temp2700K-6500K~10 perceptual categories
Object set (mugs, plates, etc)Hundreds of brands/styles~100 common types
Floor/counter materialTile, wood, granite, laminate~10

Combinatorial space: 20 x 3 x 5 x 10 x 100 x 10 = 3,000,000 unique kitchen configurations. Even visiting 1000 kitchens covers only 0.03% of this space. Generalization must come from understanding, not memorization.

Why this isn't just "more data"

You might think: just collect data in 1000 kitchens. But even 1000 is insufficient — the space of possible environments is effectively infinite. pi-0.5's insight is that you don't need to visit every possible kitchen. You need to understand what kitchens are, which comes from web data showing millions of kitchen images, combined with robot data teaching manipulation skills that transfer across spaces.

Why do current VLAs fail in new environments?

Chapter 1: The Key Insight

pi-0.5's breakthrough is simple to state: co-train on everything. Don't just train on your target robot's data. Train on data from other robots, web images, language instructions, high-level task plans, and more — all in one model.

The insight: Knowledge transfers across heterogeneous data sources. A model that has seen thousands of kitchens in web images understands kitchen layouts. A model trained on other robots understands manipulation. Combining these in one VLA gives open-world generalization that no single data source can provide.

Five data sources, one model

MM
Mobile Manipulator data (~400 hours in 50+ environments)
ME
Multi-Environment non-mobile robot data (diverse scenes)
CE
Cross-Embodiment data (different robot types)
HL
High-Level subtask prediction data
Web
Web/VLM data (COCO, QA datasets, interleaved image-text)
↓ all fed into one VLA

What changes from pi-0

pi-0.5 is not a fundamentally different architecture from pi-0. The key differences are:

Aspectpi-0pi-0.5
Training data7 robot types, 68 tasks, OXESame + 50+ environments + web/VLM data + HL labels
InferenceSingle-stage: image+language → actionsTwo-stage: first predict subtask, then predict actions
Target robotMultiple lab robotsMobile manipulator in real homes
Task horizon~30s-2min per task10-15 minutes per task
Environment diversity~7 lab setups50+ real homes and offices
Co-training sourcesRobot data onlyRobot + web + language + subtask prediction

The architecture is the same PaLiGemma backbone + action expert. What's new is the recipe — the data mixture and the two-stage inference that together enable open-world generalization.

What is the key mechanism that enables pi-0.5's generalization?

Chapter 2: VLA Background

A Vision-Language-Action model takes an image and a language instruction as input, and outputs robot motor commands. It's a VLM (like GPT-4V) that has been fine-tuned to produce actions instead of (or alongside) text.

pi-0 recap: the data flow

pi-0 (the predecessor) was a 3B-parameter VLA built on PaLiGemma. Its key innovation was using flow matching as the action head — instead of discretizing actions into tokens (like RT-2), pi-0 generates continuous action trajectories by learning a velocity field that transports noise to actions.

dx/dt = vθ(xt, t, observation, instruction)

pi-0.5 builds on pi-0 but adds the co-training recipe and hybrid inference that enable open-world generalization.

Action representations compared

MethodAction Repr.ProsCons
RT-2Discrete tokens (256 bins)Simple, uses LLM vocabularyCoarse, multimodality issues
Diffusion PolicyDenoising diffusionHandles multimodal actionsSlow inference (many steps)
pi-0 / pi-0.5Flow matchingFast, continuous, expressiveRequires post-training stage

The concrete action space for pi-0.5

pi-0.5's target platform is a mobile manipulator with:

Note: pi-0.5 runs at 10 Hz (not 50 Hz like pi-0) because mobile manipulation is slower and the higher-level planning stage needs more time. Each 5-second chunk covers one atomic manipulation (reach, grasp, lift, place).

Why 10 Hz instead of 50 Hz?

pi-0 runs at 50 Hz because dexterous manipulation (folding fabric, assembling boxes) requires precise, high-frequency control. pi-0.5 runs at 10 Hz because its tasks are coarser:

The tradeoff is clear: pi-0.5 sacrifices the fine-grained dexterity of pi-0 in exchange for the two-stage reasoning that enables open-world generalization. You can clean a kitchen at 10 Hz; you can't fold laundry at 10 Hz.

What advantage does flow matching have over discrete action tokens?

Chapter 3: The Architecture

pi-0.5 is built on PaLiGemma, a vision-language model that processes images with a SigLIP vision encoder and generates text with a Gemma language model. The VLA extends this by adding action tokens to the output vocabulary.

Architecture Pipeline

Key design choices

The full token budget

Every inference call processes this token sequence through the Gemma transformer:

Token TypeCountSource
Visual tokens (3 cameras)768SigLIP encoder (256 per image)
Language instruction~20-50Gemma tokenizer
Subtask prediction (Stage 1 output)~10-20Autoregressive generation
Proprioceptive state1Projected joint angles
Action tokens (flow matching)50Learned embeddings + denoising
Total context~850-900

This fits comfortably in PaLiGemma's 1024-token context window, though it's tight. Longer instructions or more cameras would require context expansion.

The mobile manipulator platform

pi-0.5's target robot is fundamentally different from pi-0's tabletop arms. It's a mobile manipulator with:

The mobile base adds 3 action dimensions (vx, vy, vθ) on top of the arm's 7+1 (joints + gripper). Total: 11 action dimensions per timestep. This is larger than pi-0's typical 7-8 dimensions, which is why the action chunk is [50, 11] instead of [50, 7].

The mobile base also creates a coordination challenge: the robot must simultaneously navigate (base) and manipulate (arm). The model must learn when to move the base (approach a new cabinet) vs. when to keep the base still and move the arm (pick up an object in reach). This base-arm coordination emerges naturally from training data — no explicit planning hierarchy.

What the VLM backbone provides vs. learns

From web pre-training (already knows):

From robot co-training (must learn):

Why PaLiGemma? It provides strong vision-language understanding from web pre-training. pi-0.5 inherits this understanding — it knows what a kitchen looks like, what plates are, how cabinets open — before seeing a single robot demonstration. This is the foundation of open-world generalization: the model has "seen" millions of kitchens through web images.
What is the backbone VLM for pi-0.5?

Chapter 4: Hybrid Inference

Here's the clever part. At inference time, pi-0.5 runs in two stages using the same model:

Two-Stage Inference
Stage 1: High-level subtask prediction
Same model, two roles: The SAME pi-0.5 model first acts as a planner (predicting "pick up the plate") then acts as a controller (predicting motor commands). This is like chain-of-thought reasoning but for robots — think first, then act.

Stage 1: The "Understanding Expert" (slow, ~every 5-10 seconds)

Given the current image and high-level task ("clean the kitchen"), the model autoregressively generates a subtask in natural language ("pick up the plate and put it in the sink").

Timing: This runs every 5-10 seconds — once per atomic manipulation. The model processes all 768 visual tokens + language instruction + history, then generates ~10-20 text tokens describing the next subtask. Latency: ~200-500ms for the autoregressive generation.

Why autoregressive here: Subtask prediction IS a language task. The model leverages its VLM pre-training to reason about what to do next. "I see dirty plates on the counter and an empty sink → the next subtask is to move plates to the sink." This is pure VLM reasoning, no action generation needed.

Stage 2: The "Action Expert" (fast, ~10 Hz)

Given the image + subtask from Stage 1, predict continuous low-level actions via flow matching. This is the same action generation as pi-0: noise → 10 denoising steps → clean action chunk [50, 10].

Timing: Runs at 10 Hz. Each call generates a 5-second action chunk, but only the first 0.5-1 second is executed before re-planning with fresh observations. This gives closed-loop behavior despite the chunk-based architecture.

What if Stage 1 predicts the wrong subtask? The action expert will execute it faithfully — leading to incorrect behavior. But because Stage 1 re-runs every 5-10 seconds, the model self-corrects: it sees the new scene state, recognizes the situation, and predicts a better subtask. This is analogous to a human realizing "wait, I should've grabbed the sponge first" and changing plans.

The inference pipeline on hardware

StageComputationHardwareLatency
Image capture3 cameras @ 640x480, resize to 224x224USB cameras~10ms
SigLIP encoding3 images → 768 visual tokensA100 GPU~8ms
Stage 1 (subtask)Autoregressive text generationA100 GPU~300ms (every 5-10s)
Stage 2 (actions)10 flow matching denoising stepsA100 GPU~30ms (every 100ms)
Motor executionSend joint commands to robotRobot controller~1ms

Total first-action latency: ~50ms (when subtask is already predicted). The robot carries an onboard NVIDIA Jetson for SigLIP encoding and streams tokens to a nearby A100 server for transformer inference.

A complete kitchen cleaning trace

Here's an actual execution trace from the paper's kitchen experiment (simplified):

TimeStage 1 OutputStage 2 ActionsOutcome
0:00"Close the open cabinet door"Navigate to cabinet, extend arm, push doorSuccess
0:35"Pick up the mug on the counter"Approach counter, grasp mug, liftSuccess
1:10"Place mug in the sink"Navigate to sink, lower arm, releaseSuccess
1:45"Pick up the plate on the counter"Approach plate, grasp edge, liftFAIL (plate slipped)
2:00"Pick up the plate on the counter" (retry)Re-approach, adjust grasp angle, liftSuccess
2:30"Place plate in the sink"Navigate to sink, lower, releaseSuccess
3:05"Wipe the counter with the sponge"Grasp sponge, wipe in sweeping motionPartial (missed a spot)

Total: 7 subtasks over ~3.5 minutes (partial trace of a 10-15 minute episode). Note the recovery at 2:00 — the model re-observed the scene, detected the failed grasp, and autonomously re-attempted with a different approach angle. This recovery capability comes from pre-training on diverse data that includes many partial failures and corrections.

How many separate models does pi-0.5 use for planning and control?

Chapter 5: The Co-Training Recipe

The magic of pi-0.5 is in the data mixture. Five sources of supervision, each contributing something unique:

Data Sources — Toggle to See Impact
SourceWhat It ProvidesVolumeFormat
MMTarget robot manipulation in 50+ environments~400 hoursImages + actions + language labels
MEDiverse scenes from non-mobile robots~200 hoursImages + actions (different action space)
CEDifferent robot types (cross-embodiment)~500 hours (OXE)Images + actions (heterogeneous)
HLSubtask prediction (what to do next)~50K episodesImages + text labels (no actions)
WebVisual understanding (COCO, QA, image-text)~10M examplesImages + text (no actions, no robot)

How heterogeneous data is unified

These five sources have completely different formats. How do you train one model on all of them? The key: different loss functions applied to different token positions, all in the same forward pass.

The model sees all data types interleaved in each batch. A single batch might contain: 25% MM episodes, 15% ME, 20% CE, 15% HL, 25% web. The loss computation only applies to the relevant token positions for each example.

Why each source matters — a worked example

Consider the robot encountering a new kitchen with a never-seen coffee mug on the counter:

  1. Web data taught: "That cylindrical object with a handle is a mug. Mugs go in sinks or cabinets."
  2. MM data taught: "To pick up objects from a counter at this height, approach from above with this grasp strategy."
  3. ME data taught: "Countertops have this visual texture. Objects rest stably on flat surfaces."
  4. CE data taught: "Grasping cylindrical objects requires wrapping the fingers — this works regardless of robot type."
  5. HL data taught: "After picking up a dirty mug, the next step is to place it in the sink."

No single source provides all this knowledge. Together, they let the robot successfully handle a novel mug in a novel kitchen.

Why each source matters: Remove any one and performance drops significantly. Web data gives visual understanding. CE gives manipulation skills. HL gives planning ability. They're complementary — the whole is greater than the sum of parts.
What does the Cross-Embodiment (CE) data contribute?

Chapter 6: Two-Stage Training

Training happens in two phases, each with a different action representation:

Stage 1: Pre-training (280K steps)
All data sources. Discrete action tokens via FAST tokenizer. Broad knowledge acquisition.
Stage 2: Post-training
Task-specific data. Flow matching for continuous actions. Fine-grained control.

Why two stages with different action representations?

Pre-training with discrete tokens lets the model learn from ALL data sources in a unified format — robot data, web data, language, subtask predictions — all as token sequences. This is where the broad knowledge comes from.

Post-training with flow matching replaces the discrete action head with a continuous one. Flow matching produces smoother, more precise motor commands than discretized tokens. This is where fine-grained control comes from.

The FAST tokenizer: bridging discrete and continuous

FAST (Fine-grained Action Sequence Tokenization) is a key engineering innovation that makes the two-stage recipe possible. It uses DCT (Discrete Cosine Transform) to compress action chunks:

  1. Take an action chunk: [50, 10] = 500 continuous values.
  2. Apply DCT per dimension: Transform to frequency domain. Low frequencies capture the trajectory shape; high frequencies capture noise.
  3. Truncate high frequencies: Keep only the top-K DCT coefficients (K=8-16). This reduces 500 values to ~80-160 values.
  4. Quantize: Map each coefficient to one of 1024 discrete tokens.
  5. Result: A 50-timestep, 10-dimensional action chunk is represented as ~80-160 tokens from a 1024-token vocabulary.

This is dramatically more efficient than RT-2's approach (which would need 50 x 10 x 1 = 500 tokens for the same chunk). FAST makes it possible to pre-train on actions using the same token-prediction machinery as language.

Reconstruction quality of FAST

How much information is lost in the DCT truncation? For typical manipulation trajectories:

K=8 is the sweet spot for pre-training. The model learns general trajectory shapes while leaving room for language and visual tokens. Post-training with flow matching then recovers full continuous precision that FAST's quantization loses.

Training infrastructure

ParameterPre-training (Stage 1)Post-training (Stage 2)
Hardware64 TPU v5e pods16 TPU v5e pods
Steps280K50K-100K
Batch size2048 (mixed sources)256 (target robot only)
Learning rate1e-4 → 1e-5 (cosine)5e-6 (constant)
Duration~5 days~1-2 days
Action representationFAST discrete tokensFlow matching (continuous)
VLM backboneUnfrozen (adapts)Frozen (preserved)
Data sourcesAll 5 (MM+ME+CE+HL+Web)MM only (target robot)
Think of it this way: Pre-training teaches the model what to do (broad understanding from diverse sources). Post-training teaches it how to do it precisely (fine motor control on the target platform). The discrete-to-continuous transition is the bridge between understanding and execution.

What the VLM forgets without web co-training

An ablation study showed that removing web data during pre-training causes a 15-20% drop in novel object recognition. The model can still manipulate objects it saw during robot training, but fails to generalize to new objects — exactly because it has forgotten the visual recognition capabilities from web pre-training. The co-training recipe prevents this forgetting.

With vs without web co-training: qualitative examples

ScenarioWithout Web DataWith Web Data
Novel mug (travel thermos)Doesn't recognize as drinkware. Ignores it.Recognizes as mug variant. Adapts grasp.
Spill on dark granite counterCan't segment spill. Doesn't attempt to clean.Detects via texture/reflection. Initiates wipe.
Open cabinet (glass-front style)Fails — never seen this cabinet type in robot data.Understands "cabinet" as a category. Adapts approach.
"Put away the utensils"Can't identify the utensil drawer in a new kitchen.Recognizes utensil tray pattern from web images. Opens correct drawer.

The web data doesn't teach the robot to move — it teaches it to see. For open-world generalization, seeing correctly is half the battle.

Why use discrete tokens in pre-training but flow matching in post-training?

Chapter 7: Experiments

The headline result: pi-0.5 can clean kitchens and bedrooms in entirely new homes not seen during training. These are 10-15 minute tasks involving multiple stages.

Success Rates in New Homes

What the robot does in a new kitchen

Given "clean the kitchen," pi-0.5 autonomously:

  1. Navigates to the kitchen area (mobile base)
  2. Surveys the scene (rotates, captures images from multiple angles)
  3. Predicts first subtask: "close the open cabinet door"
  4. Executes: approaches cabinet, grasps door edge, pushes closed
  5. Re-observes: "good, cabinet is closed. Next: pick up the mug on the counter"
  6. Executes: approaches mug, grasps, lifts, navigates to sink, places
  7. Continues through 8-12 subtasks over 10-15 minutes

Each step is predicted by the model itself (high-level subtask), then executed with flow matching actions. The full sequence takes 10-15 minutes.

What degrades in new environments

Even pi-0.5 has limits. The degradation patterns reveal what the model has truly learned vs. what it has memorized:

ConditionSuccess Rate ImpactExplanation
New kitchen, familiar objects~75% (mild drop)Layout understanding generalizes from web data
New kitchen, novel objects~55% (moderate drop)VLM recognizes most objects but grasping novel shapes is harder
Very cluttered scenes~40% (significant drop)Visual segmentation struggles with many overlapping objects
Glass/transparent objects~30% (large drop)Depth estimation fails — cameras can't detect transparent surfaces well
Narrow spaces (under cabinets)~35% (large drop)Mobile base can't position arm correctly — workspace constraint
Ambiguous instructions~50% (moderate)Model picks a reasonable interpretation but may not match user intent

The mobile base challenge

Mobile manipulation adds a dimension of difficulty absent in tabletop robotics: the robot must decide when and where to move its base. This decision is implicit in the action space — the model outputs base velocities alongside arm joint velocities. But the consequences are very different:

These challenges explain why pi-0.5 runs at 10 Hz (not 50 Hz) and uses 5-second action chunks (not 1-second): mobile tasks require more planning horizon per chunk, and the base can't respond to new observations as quickly as a stationary arm.

The 10-minute failure modes

Over 10-15 minute tasks, errors compound. The most common failure patterns:

First of its kind: To the authors' knowledge, this is the first demonstration of an end-to-end learning-enabled robotic system performing long-horizon, dexterous manipulation in entirely new real-world environments.
How long are the kitchen cleaning tasks pi-0.5 performs?

Chapter 8: Scaling

How does generalization scale with the number of training environments? The paper shows a clear trend: more diverse training scenes = better generalization to new scenes.

Generalization vs Training Scenes

The scaling curves with actual numbers

The paper tests performance as the number of unique training environments increases:

Training EnvironmentsSuccess in New HomesImprovement per 2x Environments
5 environments~15%Baseline
10 environments~30%+15 pp
25 environments~50%+10 pp
50+ environments~70%+7 pp (diminishing)

The scaling is logarithmic — each doubling of environments adds less marginal improvement. But crucially, it hasn't plateaued at 50 environments. More environments would likely continue improving performance, just at a slower rate.

But the scaling isn't just about quantity — it's about diversity. Adding more data from the same kitchen helps less than adding data from a different kitchen. And adding data from a completely different robot or from the web helps in ways that more same-robot data cannot.

Data collection cost

A practical concern: how expensive is this data to collect?

Data SourceCost per HourHours NeededTotal Cost
MM (teleoperated)~$50/hr (operator + robot time)~400~$20K
ME (multi-environment)~$50/hr~200~$10K
CE (cross-embodiment, OXE)Pre-existing public data~500$0 (already collected)
HL (subtask labels)~$5/episode annotation50K episodes~$250K
Web dataFree (public datasets)N/A$0
Total estimated~$280K

The most expensive component is the human annotation (HL labels), not the robot data collection. This suggests that auto-labeling (using VLMs to generate subtask labels from video) could dramatically reduce the cost of scaling. pi-0.7 explores this direction with automated subtask annotation.

Compute scaling

The relationship between compute and performance:

The scaling insight: Generalization scales with the diversity of the data mixture, not just its volume. This is why co-training on heterogeneous sources (robots + web + language) outperforms simply collecting more robot data. 10M web images provide more "kitchen understanding" than 1000 more hours of robot data in existing kitchens.
What matters more for generalization — data volume or data diversity?

Chapter 9: Ablations

The paper systematically removes each data source to measure its contribution. The finding: every source matters.

AblationEffect on New HomesWhy
Remove ME (Multi-Env)-18 ppScene diversity is critical — fewer environments = more overfitting to layouts
Remove CE (Cross-Embodiment)-12 ppManipulation knowledge from other robots provides grasp primitives
Remove HL (High-Level)-15 pp on long tasksModel loses planning ability — gets stuck after 2-3 subtasks
Remove Web data-20 pp on novel objectsVisual understanding suffers — can't recognize unfamiliar items
Remove verbal instructions-10 ppTask specification becomes ambiguous — model guesses what to do

The ablation that surprised the authors

The most surprising ablation result: removing verbal instructions (just providing images, no language) only drops performance by 10 pp. This suggests that pi-0.5 has learned to infer task intent from visual context alone — seeing a messy kitchen is enough to trigger cleaning behavior without being told "clean the kitchen." The VLM's web pre-training includes millions of before/after images of clean vs messy spaces, implicitly teaching the concept of "this needs cleaning."

However, language becomes critical when the task is ambiguous ("put the mug in the cabinet" vs "put the mug in the sink" — visually, both cabinets and sinks are present). Without language, the model defaults to the statistically most common action for each object, which is often but not always correct.

The interaction effects

More interesting than individual ablations are the interaction effects. Removing two sources simultaneously hurts more than the sum of removing each individually:

This superlinear degradation proves the data sources don't just add independent knowledge — they amplify each other. The whole is genuinely greater than the sum of parts.

What about depth cameras?

An interesting negative result: adding depth cameras (RGBD instead of RGB) did NOT significantly help. The model learns depth understanding implicitly from stereo cues across its three cameras. This is important because depth cameras are fragile, expensive, and fail on transparent/reflective surfaces. pi-0.5 works with cheap RGB cameras only.

The FAST tokenizer ablation

The paper also ablates the two-stage training approach (FAST discrete → flow matching) vs single-stage alternatives:

ApproachPre-training DataPost-trainingResult
FAST → Flow (pi-0.5)All 5 sourcesFlow matchingBest overall
Flow onlyRobot data only (can't use web/HL)Flow matching-20 pp (no co-training)
FAST onlyAll 5 sourcesFAST discrete-8 pp (quantization artifacts)
RT-2 style tokensAll 5 sources256-bin tokens-15 pp (coarse + no fine control)

The two-stage approach wins because it gets the best of both worlds: FAST enables co-training on heterogeneous data (you need tokens to train alongside language), and flow matching provides the continuous precision that discrete tokens can't match.

No single source is dispensable. The co-training recipe works because each source provides something the others cannot. This is the paper's strongest empirical finding — and the key engineering lesson for anyone building a VLA.
What happens when you remove the high-level subtask prediction data?

Chapter 10: Connections

pi-0.5 represents a major step in the VLA lineage:

2022
RT-1 — First large-scale robot transformer (130K episodes)
2023
RT-2 — First VLA (VLM fine-tuned for actions)
2024
OpenVLA / Octo — Open-source VLAs, cross-embodiment
2024
pi-0 — Flow matching action head, dexterous manipulation
2025
pi-0.5 — Open-world generalization via heterogeneous co-training

What pi-0.5 means for deployment economics

pi-0.5 changes the economics of robot deployment. Before pi-0.5, deploying a robot in a new home required:

  1. Send robot to new home
  2. Collect 50-100 hours of demonstration data in that specific home
  3. Fine-tune the model on that data (~2 days of compute)
  4. Deploy the fine-tuned model

Cost per home: ~$5,000-10,000 in human labor + compute. Time to deploy: ~2 weeks.

With pi-0.5's open-world generalization:

  1. Send robot to new home
  2. Deploy the pre-trained model immediately (zero-shot)

Cost per home: ~$0 marginal (the training cost is amortized across all deployments). Time to deploy: immediate. This is the difference between a research prototype and a viable product.

Remaining deployment barriers

Even with zero-shot generalization, real-world deployment faces non-ML challenges:

Key engineering lessons from pi-0.5

LessonImplication
Co-training beats single-sourceBuild pipelines that ingest heterogeneous data, not just robot demos
Web data prevents forgettingAlways include VLM-style data to maintain visual understanding
Two-stage inference worksSame model can plan AND execute — no separate planner needed
Diversity > Volume10 new environments beats 1000 hours in existing ones
Depth cameras are optionalRGB-only systems are viable for deployment (cheaper, more robust)
FAST enables discrete pre-trainingYou can pre-train with tokens then switch to flow matching

The open-world gap: what remains unsolved

Despite pi-0.5's achievements, the gap between lab success and true household deployment remains large:

What's next?

pi-0.5 shows that co-training on diverse data sources is the path to generalization. The open questions are: how far can this scale? Can we add simulation data, human video, and internet-scale interaction data to push generalization even further? pi-0.7 answers part of this — adding structured prompt conditioning to absorb even more diverse data without mode averaging.

"Stuff your eyes with wonder... See the world."
— Ray Bradbury (quoted in the paper's introduction)