The first VLA that cleans kitchens in homes it has never seen — by learning from everything: other robots, web data, language instructions, and high-level reasoning.
Robots work beautifully in labs. They pick objects, stack blocks, pour liquids — as long as the environment matches what they trained on. Take that same robot to a new kitchen — different layout, different objects, different lighting — and it fails catastrophically.
This is the open-world generalization problem: the gap between controlled lab demos and useful real-world deployment. It's the single biggest unsolved challenge in robotics.
Previous VLAs (RT-2, OpenVLA, even pi-0) generalize to new objects and minor scene variations. But they still fail in truly new environments — new room layouts, new furniture, new spatial arrangements. pi-0.5 is the first system to demonstrate this level of generalization.
Consider pi-0 deployed in Kitchen A (trained) vs Kitchen B (new). In Kitchen A, it achieves 85% success on "put the mug in the sink." In Kitchen B:
The core failure: pi-0 has memorized where things are in its training environments rather than learning a general understanding of spatial relationships. pi-0.5 fixes this by training on enough environments that the model can't memorize any single layout.
Let's be precise about what changes when the robot enters an unseen kitchen:
Each of these changes is individually manageable. But all of them happening simultaneously creates a combinatorial explosion that overwhelms models trained on a small number of environments. pi-0 was trained in ~7 lab setups. The real world has millions of kitchens.
The paper quantifies this gap precisely. For "pick up mug and place in sink":
| Setting | pi-0 Success Rate | pi-0.5 Success Rate |
|---|---|---|
| Same kitchen as training | 92% | 90% |
| Same kitchen, new objects | 75% | 82% |
| New kitchen, familiar objects | 25% | 72% |
| New kitchen, new objects | 12% | 58% |
pi-0 drops from 92% to 12% when both environment and objects change — an 80 pp drop. pi-0.5 drops only from 90% to 58% — a 32 pp drop. The gap between "trained environment" and "new everything" shrinks from 80 pp to 32 pp. Still far from solved, but a qualitative improvement.
Let's quantify why "collect more data" fails. A kitchen varies along at least these dimensions:
| Dimension | Typical Variation | Approximate States |
|---|---|---|
| Counter height | 75-95cm | ~20 discrete levels |
| Sink position (left/center/right) | 3 common layouts | 3 |
| Cabinet types | Hinged, sliding, open shelving | 5+ |
| Lighting color temp | 2700K-6500K | ~10 perceptual categories |
| Object set (mugs, plates, etc) | Hundreds of brands/styles | ~100 common types |
| Floor/counter material | Tile, wood, granite, laminate | ~10 |
Combinatorial space: 20 x 3 x 5 x 10 x 100 x 10 = 3,000,000 unique kitchen configurations. Even visiting 1000 kitchens covers only 0.03% of this space. Generalization must come from understanding, not memorization.
You might think: just collect data in 1000 kitchens. But even 1000 is insufficient — the space of possible environments is effectively infinite. pi-0.5's insight is that you don't need to visit every possible kitchen. You need to understand what kitchens are, which comes from web data showing millions of kitchen images, combined with robot data teaching manipulation skills that transfer across spaces.
pi-0.5's breakthrough is simple to state: co-train on everything. Don't just train on your target robot's data. Train on data from other robots, web images, language instructions, high-level task plans, and more — all in one model.
pi-0.5 is not a fundamentally different architecture from pi-0. The key differences are:
| Aspect | pi-0 | pi-0.5 |
|---|---|---|
| Training data | 7 robot types, 68 tasks, OXE | Same + 50+ environments + web/VLM data + HL labels |
| Inference | Single-stage: image+language → actions | Two-stage: first predict subtask, then predict actions |
| Target robot | Multiple lab robots | Mobile manipulator in real homes |
| Task horizon | ~30s-2min per task | 10-15 minutes per task |
| Environment diversity | ~7 lab setups | 50+ real homes and offices |
| Co-training sources | Robot data only | Robot + web + language + subtask prediction |
The architecture is the same PaLiGemma backbone + action expert. What's new is the recipe — the data mixture and the two-stage inference that together enable open-world generalization.
A Vision-Language-Action model takes an image and a language instruction as input, and outputs robot motor commands. It's a VLM (like GPT-4V) that has been fine-tuned to produce actions instead of (or alongside) text.
pi-0 (the predecessor) was a 3B-parameter VLA built on PaLiGemma. Its key innovation was using flow matching as the action head — instead of discretizing actions into tokens (like RT-2), pi-0 generates continuous action trajectories by learning a velocity field that transports noise to actions.
pi-0.5 builds on pi-0 but adds the co-training recipe and hybrid inference that enable open-world generalization.
| Method | Action Repr. | Pros | Cons |
|---|---|---|---|
| RT-2 | Discrete tokens (256 bins) | Simple, uses LLM vocabulary | Coarse, multimodality issues |
| Diffusion Policy | Denoising diffusion | Handles multimodal actions | Slow inference (many steps) |
| pi-0 / pi-0.5 | Flow matching | Fast, continuous, expressive | Requires post-training stage |
pi-0.5's target platform is a mobile manipulator with:
Note: pi-0.5 runs at 10 Hz (not 50 Hz like pi-0) because mobile manipulation is slower and the higher-level planning stage needs more time. Each 5-second chunk covers one atomic manipulation (reach, grasp, lift, place).
pi-0 runs at 50 Hz because dexterous manipulation (folding fabric, assembling boxes) requires precise, high-frequency control. pi-0.5 runs at 10 Hz because its tasks are coarser:
The tradeoff is clear: pi-0.5 sacrifices the fine-grained dexterity of pi-0 in exchange for the two-stage reasoning that enables open-world generalization. You can clean a kitchen at 10 Hz; you can't fold laundry at 10 Hz.
pi-0.5 is built on PaLiGemma, a vision-language model that processes images with a SigLIP vision encoder and generates text with a Gemma language model. The VLA extends this by adding action tokens to the output vocabulary.
Every inference call processes this token sequence through the Gemma transformer:
| Token Type | Count | Source |
|---|---|---|
| Visual tokens (3 cameras) | 768 | SigLIP encoder (256 per image) |
| Language instruction | ~20-50 | Gemma tokenizer |
| Subtask prediction (Stage 1 output) | ~10-20 | Autoregressive generation |
| Proprioceptive state | 1 | Projected joint angles |
| Action tokens (flow matching) | 50 | Learned embeddings + denoising |
| Total context | ~850-900 |
This fits comfortably in PaLiGemma's 1024-token context window, though it's tight. Longer instructions or more cameras would require context expansion.
pi-0.5's target robot is fundamentally different from pi-0's tabletop arms. It's a mobile manipulator with:
The mobile base adds 3 action dimensions (vx, vy, vθ) on top of the arm's 7+1 (joints + gripper). Total: 11 action dimensions per timestep. This is larger than pi-0's typical 7-8 dimensions, which is why the action chunk is [50, 11] instead of [50, 7].
The mobile base also creates a coordination challenge: the robot must simultaneously navigate (base) and manipulate (arm). The model must learn when to move the base (approach a new cabinet) vs. when to keep the base still and move the arm (pick up an object in reach). This base-arm coordination emerges naturally from training data — no explicit planning hierarchy.
From web pre-training (already knows):
From robot co-training (must learn):
Here's the clever part. At inference time, pi-0.5 runs in two stages using the same model:
Given the current image and high-level task ("clean the kitchen"), the model autoregressively generates a subtask in natural language ("pick up the plate and put it in the sink").
Timing: This runs every 5-10 seconds — once per atomic manipulation. The model processes all 768 visual tokens + language instruction + history, then generates ~10-20 text tokens describing the next subtask. Latency: ~200-500ms for the autoregressive generation.
Why autoregressive here: Subtask prediction IS a language task. The model leverages its VLM pre-training to reason about what to do next. "I see dirty plates on the counter and an empty sink → the next subtask is to move plates to the sink." This is pure VLM reasoning, no action generation needed.
Given the image + subtask from Stage 1, predict continuous low-level actions via flow matching. This is the same action generation as pi-0: noise → 10 denoising steps → clean action chunk [50, 10].
Timing: Runs at 10 Hz. Each call generates a 5-second action chunk, but only the first 0.5-1 second is executed before re-planning with fresh observations. This gives closed-loop behavior despite the chunk-based architecture.
What if Stage 1 predicts the wrong subtask? The action expert will execute it faithfully — leading to incorrect behavior. But because Stage 1 re-runs every 5-10 seconds, the model self-corrects: it sees the new scene state, recognizes the situation, and predicts a better subtask. This is analogous to a human realizing "wait, I should've grabbed the sponge first" and changing plans.
| Stage | Computation | Hardware | Latency |
|---|---|---|---|
| Image capture | 3 cameras @ 640x480, resize to 224x224 | USB cameras | ~10ms |
| SigLIP encoding | 3 images → 768 visual tokens | A100 GPU | ~8ms |
| Stage 1 (subtask) | Autoregressive text generation | A100 GPU | ~300ms (every 5-10s) |
| Stage 2 (actions) | 10 flow matching denoising steps | A100 GPU | ~30ms (every 100ms) |
| Motor execution | Send joint commands to robot | Robot controller | ~1ms |
Total first-action latency: ~50ms (when subtask is already predicted). The robot carries an onboard NVIDIA Jetson for SigLIP encoding and streams tokens to a nearby A100 server for transformer inference.
Here's an actual execution trace from the paper's kitchen experiment (simplified):
| Time | Stage 1 Output | Stage 2 Actions | Outcome |
|---|---|---|---|
| 0:00 | "Close the open cabinet door" | Navigate to cabinet, extend arm, push door | Success |
| 0:35 | "Pick up the mug on the counter" | Approach counter, grasp mug, lift | Success |
| 1:10 | "Place mug in the sink" | Navigate to sink, lower arm, release | Success |
| 1:45 | "Pick up the plate on the counter" | Approach plate, grasp edge, lift | FAIL (plate slipped) |
| 2:00 | "Pick up the plate on the counter" (retry) | Re-approach, adjust grasp angle, lift | Success |
| 2:30 | "Place plate in the sink" | Navigate to sink, lower, release | Success |
| 3:05 | "Wipe the counter with the sponge" | Grasp sponge, wipe in sweeping motion | Partial (missed a spot) |
Total: 7 subtasks over ~3.5 minutes (partial trace of a 10-15 minute episode). Note the recovery at 2:00 — the model re-observed the scene, detected the failed grasp, and autonomously re-attempted with a different approach angle. This recovery capability comes from pre-training on diverse data that includes many partial failures and corrections.
The magic of pi-0.5 is in the data mixture. Five sources of supervision, each contributing something unique:
| Source | What It Provides | Volume | Format |
|---|---|---|---|
| MM | Target robot manipulation in 50+ environments | ~400 hours | Images + actions + language labels |
| ME | Diverse scenes from non-mobile robots | ~200 hours | Images + actions (different action space) |
| CE | Different robot types (cross-embodiment) | ~500 hours (OXE) | Images + actions (heterogeneous) |
| HL | Subtask prediction (what to do next) | ~50K episodes | Images + text labels (no actions) |
| Web | Visual understanding (COCO, QA, image-text) | ~10M examples | Images + text (no actions, no robot) |
These five sources have completely different formats. How do you train one model on all of them? The key: different loss functions applied to different token positions, all in the same forward pass.
The model sees all data types interleaved in each batch. A single batch might contain: 25% MM episodes, 15% ME, 20% CE, 15% HL, 25% web. The loss computation only applies to the relevant token positions for each example.
Consider the robot encountering a new kitchen with a never-seen coffee mug on the counter:
No single source provides all this knowledge. Together, they let the robot successfully handle a novel mug in a novel kitchen.
Training happens in two phases, each with a different action representation:
Pre-training with discrete tokens lets the model learn from ALL data sources in a unified format — robot data, web data, language, subtask predictions — all as token sequences. This is where the broad knowledge comes from.
Post-training with flow matching replaces the discrete action head with a continuous one. Flow matching produces smoother, more precise motor commands than discretized tokens. This is where fine-grained control comes from.
FAST (Fine-grained Action Sequence Tokenization) is a key engineering innovation that makes the two-stage recipe possible. It uses DCT (Discrete Cosine Transform) to compress action chunks:
This is dramatically more efficient than RT-2's approach (which would need 50 x 10 x 1 = 500 tokens for the same chunk). FAST makes it possible to pre-train on actions using the same token-prediction machinery as language.
How much information is lost in the DCT truncation? For typical manipulation trajectories:
K=8 is the sweet spot for pre-training. The model learns general trajectory shapes while leaving room for language and visual tokens. Post-training with flow matching then recovers full continuous precision that FAST's quantization loses.
| Parameter | Pre-training (Stage 1) | Post-training (Stage 2) |
|---|---|---|
| Hardware | 64 TPU v5e pods | 16 TPU v5e pods |
| Steps | 280K | 50K-100K |
| Batch size | 2048 (mixed sources) | 256 (target robot only) |
| Learning rate | 1e-4 → 1e-5 (cosine) | 5e-6 (constant) |
| Duration | ~5 days | ~1-2 days |
| Action representation | FAST discrete tokens | Flow matching (continuous) |
| VLM backbone | Unfrozen (adapts) | Frozen (preserved) |
| Data sources | All 5 (MM+ME+CE+HL+Web) | MM only (target robot) |
An ablation study showed that removing web data during pre-training causes a 15-20% drop in novel object recognition. The model can still manipulate objects it saw during robot training, but fails to generalize to new objects — exactly because it has forgotten the visual recognition capabilities from web pre-training. The co-training recipe prevents this forgetting.
| Scenario | Without Web Data | With Web Data |
|---|---|---|
| Novel mug (travel thermos) | Doesn't recognize as drinkware. Ignores it. | Recognizes as mug variant. Adapts grasp. |
| Spill on dark granite counter | Can't segment spill. Doesn't attempt to clean. | Detects via texture/reflection. Initiates wipe. |
| Open cabinet (glass-front style) | Fails — never seen this cabinet type in robot data. | Understands "cabinet" as a category. Adapts approach. |
| "Put away the utensils" | Can't identify the utensil drawer in a new kitchen. | Recognizes utensil tray pattern from web images. Opens correct drawer. |
The web data doesn't teach the robot to move — it teaches it to see. For open-world generalization, seeing correctly is half the battle.
The headline result: pi-0.5 can clean kitchens and bedrooms in entirely new homes not seen during training. These are 10-15 minute tasks involving multiple stages.
Given "clean the kitchen," pi-0.5 autonomously:
Each step is predicted by the model itself (high-level subtask), then executed with flow matching actions. The full sequence takes 10-15 minutes.
Even pi-0.5 has limits. The degradation patterns reveal what the model has truly learned vs. what it has memorized:
| Condition | Success Rate Impact | Explanation |
|---|---|---|
| New kitchen, familiar objects | ~75% (mild drop) | Layout understanding generalizes from web data |
| New kitchen, novel objects | ~55% (moderate drop) | VLM recognizes most objects but grasping novel shapes is harder |
| Very cluttered scenes | ~40% (significant drop) | Visual segmentation struggles with many overlapping objects |
| Glass/transparent objects | ~30% (large drop) | Depth estimation fails — cameras can't detect transparent surfaces well |
| Narrow spaces (under cabinets) | ~35% (large drop) | Mobile base can't position arm correctly — workspace constraint |
| Ambiguous instructions | ~50% (moderate) | Model picks a reasonable interpretation but may not match user intent |
Mobile manipulation adds a dimension of difficulty absent in tabletop robotics: the robot must decide when and where to move its base. This decision is implicit in the action space — the model outputs base velocities alongside arm joint velocities. But the consequences are very different:
These challenges explain why pi-0.5 runs at 10 Hz (not 50 Hz) and uses 5-second action chunks (not 1-second): mobile tasks require more planning horizon per chunk, and the base can't respond to new observations as quickly as a stationary arm.
Over 10-15 minute tasks, errors compound. The most common failure patterns:
How does generalization scale with the number of training environments? The paper shows a clear trend: more diverse training scenes = better generalization to new scenes.
The paper tests performance as the number of unique training environments increases:
| Training Environments | Success in New Homes | Improvement per 2x Environments |
|---|---|---|
| 5 environments | ~15% | Baseline |
| 10 environments | ~30% | +15 pp |
| 25 environments | ~50% | +10 pp |
| 50+ environments | ~70% | +7 pp (diminishing) |
The scaling is logarithmic — each doubling of environments adds less marginal improvement. But crucially, it hasn't plateaued at 50 environments. More environments would likely continue improving performance, just at a slower rate.
But the scaling isn't just about quantity — it's about diversity. Adding more data from the same kitchen helps less than adding data from a different kitchen. And adding data from a completely different robot or from the web helps in ways that more same-robot data cannot.
A practical concern: how expensive is this data to collect?
| Data Source | Cost per Hour | Hours Needed | Total Cost |
|---|---|---|---|
| MM (teleoperated) | ~$50/hr (operator + robot time) | ~400 | ~$20K |
| ME (multi-environment) | ~$50/hr | ~200 | ~$10K |
| CE (cross-embodiment, OXE) | Pre-existing public data | ~500 | $0 (already collected) |
| HL (subtask labels) | ~$5/episode annotation | 50K episodes | ~$250K |
| Web data | Free (public datasets) | N/A | $0 |
| Total estimated | ~$280K |
The most expensive component is the human annotation (HL labels), not the robot data collection. This suggests that auto-labeling (using VLMs to generate subtask labels from video) could dramatically reduce the cost of scaling. pi-0.7 explores this direction with automated subtask annotation.
The relationship between compute and performance:
The paper systematically removes each data source to measure its contribution. The finding: every source matters.
| Ablation | Effect on New Homes | Why |
|---|---|---|
| Remove ME (Multi-Env) | -18 pp | Scene diversity is critical — fewer environments = more overfitting to layouts |
| Remove CE (Cross-Embodiment) | -12 pp | Manipulation knowledge from other robots provides grasp primitives |
| Remove HL (High-Level) | -15 pp on long tasks | Model loses planning ability — gets stuck after 2-3 subtasks |
| Remove Web data | -20 pp on novel objects | Visual understanding suffers — can't recognize unfamiliar items |
| Remove verbal instructions | -10 pp | Task specification becomes ambiguous — model guesses what to do |
The most surprising ablation result: removing verbal instructions (just providing images, no language) only drops performance by 10 pp. This suggests that pi-0.5 has learned to infer task intent from visual context alone — seeing a messy kitchen is enough to trigger cleaning behavior without being told "clean the kitchen." The VLM's web pre-training includes millions of before/after images of clean vs messy spaces, implicitly teaching the concept of "this needs cleaning."
However, language becomes critical when the task is ambiguous ("put the mug in the cabinet" vs "put the mug in the sink" — visually, both cabinets and sinks are present). Without language, the model defaults to the statistically most common action for each object, which is often but not always correct.
More interesting than individual ablations are the interaction effects. Removing two sources simultaneously hurts more than the sum of removing each individually:
This superlinear degradation proves the data sources don't just add independent knowledge — they amplify each other. The whole is genuinely greater than the sum of parts.
An interesting negative result: adding depth cameras (RGBD instead of RGB) did NOT significantly help. The model learns depth understanding implicitly from stereo cues across its three cameras. This is important because depth cameras are fragile, expensive, and fail on transparent/reflective surfaces. pi-0.5 works with cheap RGB cameras only.
The paper also ablates the two-stage training approach (FAST discrete → flow matching) vs single-stage alternatives:
| Approach | Pre-training Data | Post-training | Result |
|---|---|---|---|
| FAST → Flow (pi-0.5) | All 5 sources | Flow matching | Best overall |
| Flow only | Robot data only (can't use web/HL) | Flow matching | -20 pp (no co-training) |
| FAST only | All 5 sources | FAST discrete | -8 pp (quantization artifacts) |
| RT-2 style tokens | All 5 sources | 256-bin tokens | -15 pp (coarse + no fine control) |
The two-stage approach wins because it gets the best of both worlds: FAST enables co-training on heterogeneous data (you need tokens to train alongside language), and flow matching provides the continuous precision that discrete tokens can't match.
pi-0.5 represents a major step in the VLA lineage:
pi-0.5 changes the economics of robot deployment. Before pi-0.5, deploying a robot in a new home required:
Cost per home: ~$5,000-10,000 in human labor + compute. Time to deploy: ~2 weeks.
With pi-0.5's open-world generalization:
Cost per home: ~$0 marginal (the training cost is amortized across all deployments). Time to deploy: immediate. This is the difference between a research prototype and a viable product.
Even with zero-shot generalization, real-world deployment faces non-ML challenges:
| Lesson | Implication |
|---|---|
| Co-training beats single-source | Build pipelines that ingest heterogeneous data, not just robot demos |
| Web data prevents forgetting | Always include VLM-style data to maintain visual understanding |
| Two-stage inference works | Same model can plan AND execute — no separate planner needed |
| Diversity > Volume | 10 new environments beats 1000 hours in existing ones |
| Depth cameras are optional | RGB-only systems are viable for deployment (cheaper, more robust) |
| FAST enables discrete pre-training | You can pre-train with tokens then switch to flow matching |
Despite pi-0.5's achievements, the gap between lab success and true household deployment remains large:
pi-0.5 shows that co-training on diverse data sources is the path to generalization. The open questions are: how far can this scale? Can we add simulation data, human video, and internet-scale interaction data to push generalization even further? pi-0.7 answers part of this — adding structured prompt conditioning to absorb even more diverse data without mode averaging.