A 5B-parameter VLA with emergent compositional generalization — steered by diversified prompts that specify not just WHAT to do, but HOW. Makes espresso, folds laundry, peels vegetables, and transfers zero-shot across robot bodies.
You’re training a robot to clean a kitchen. You have 10,000 demonstrations — some fast, some slow. Some follow one strategy (wipe counter then clear dishes), others follow the opposite order. Some come from one robot arm, others from a completely different platform. Some are great; some made mistakes along the way but eventually succeeded.
A naïve model trained on all this data will average the strategies together. Fast and slow get blended into medium-speed. Left-first and right-first get blended into indecisive hesitation. Good and bad demonstrations get blended into mediocre behavior. The richer and more diverse your dataset, the worse this averaging problem becomes.
This is the curse of multi-modality in behavior cloning. The model sees many valid ways to do a task, can’t distinguish between them at inference time, and produces an incoherent mixture that corresponds to none of them.
Previous VLAs like RT-2 and even π0 take a single language instruction (“clean the kitchen”) and map it directly to actions. This works for simple, atomic tasks. But for complex multi-step tasks, a single instruction is hopelessly underspecified:
Consider a concrete example. Two valid strategies for “clear the table”:
If the model averages: joint 1 = 0 degrees. The arm reaches straight ahead — toward neither target. It hovers indecisively in the center of the table. This isn’t a hypothetical failure mode; it’s the dominant failure pattern in behavior cloning with diverse data.
Flow matching helps (it can represent multi-modal distributions), but only for low-level action diversity. It can’t resolve strategic ambiguity: which side of the table to start with is not a continuous distribution to sample from — it’s a discrete choice that should be made once and committed to.
π0.7’s insight is deceptively simple: if your data is diverse, your prompt must be equally diverse. Instead of conditioning on a single language instruction, π0.7 conditions on a rich, multi-faceted prompt that specifies not just what to do, but how to do it.
Every training episode in π0.7 is annotated with four types of context:
This is an instance of a deeper principle: conditional models can represent multi-modal distributions without averaging, as long as the conditioning variable disambiguates the modes. Gaussian mixture models work the same way — each component has a different mean, but only one fires for each value of the latent variable. π0.7’s prompt plays the role of the latent variable.
| Aspect | π0 | π0.5 | π0.7 |
|---|---|---|---|
| Conditioning | Language instruction only | Language + subtask (from same model) | Language + subtask + subgoal image + metadata + control mode |
| Model size | 3B (PaLiGemma) | 3B (PaLiGemma) | 5B (Gemma 3 + 860M action expert) |
| Video history | Current frame only | Current frame + recent frames | MEM encoder compresses full history |
| Data quality | All treated equally | All treated equally | Quality/speed/mistake metadata per episode |
| Cross-embodiment | Robot ID token | Robot ID token | Control mode + robot metadata |
The diversified prompt isn’t just a training trick — it makes the model steerable at deployment. You can:
π0.7 is a 5-billion parameter Vision-Language-Action model built from three major components, each serving a distinct role. Let’s walk through the full architecture.
The backbone is Gemma 3, a 4-billion parameter vision-language model pre-trained on internet-scale text and image data. It processes the language instruction, subtask instructions, episode metadata, and images through its standard multimodal transformer architecture.
The VLM’s job is understanding: parsing the instruction, recognizing objects in the scene, grounding language to visual features. It outputs rich contextual representations that tell the action expert what the robot should do.
Raw observation history is expensive — sending all camera frames through the full VLM would be prohibitively slow. π0.7 uses Multi-Scale Embodied Memory (MEM), a specialized video encoder that compresses the robot’s observation history into a compact set of memory tokens.
MEM processes frames at multiple temporal scales: recent frames at high resolution, older frames at lower resolution. This gives the model a sense of what has happened without overwhelming the context window. The compressed memory tokens are injected into the VLM’s context alongside the language and image tokens.
Concrete numbers: MEM compresses the last 10 seconds of observation history (3 cameras x 10 fps x 10 seconds = 300 frames) into just 64 memory tokens. Without MEM, representing 300 frames would require 300 x 256 = 76,800 visual tokens — far beyond any transformer’s context window. MEM achieves a 1200x compression ratio.
Why does temporal history matter? Several manipulation scenarios require it:
In the ablation, removing MEM drops performance by ~10 pp on multi-step tasks but barely affects single-step tasks. History only matters when the task requires memory of what happened before.
The action expert is an 860-million parameter transformer that generates robot actions via flow matching. It takes the VLM’s output representations as conditioning and iteratively denoises a random noise vector into a sequence of precise robot joint positions or end-effector poses.
Flow matching is chosen over discrete tokenization because robot actions are continuous — joint angles and Cartesian coordinates live in smooth, high-dimensional spaces where discretization introduces unnecessary quantization error.
| Token Type | Count | Source | Processed By |
|---|---|---|---|
| Visual tokens (current frame, 3 cams) | 768 | SigLIP encoder | VLM backbone |
| MEM history tokens | 64 | MEM encoder | Action expert |
| Language instruction | ~20-50 | Gemma tokenizer | VLM backbone |
| Subtask instruction | ~10-20 | Gemma tokenizer | VLM backbone |
| Subgoal image | 256 | SigLIP encoder | VLM backbone |
| Metadata (speed/quality/mode) | ~5-10 | Numerical encoding | VLM backbone |
| Action tokens (flow matching) | 50 | Denoising | Action expert |
| Total | ~1200 |
At inference, the pipeline flows like this:
Total latency: ~50ms from observation to first action. The model runs at roughly 2 Hz re-planning frequency with action chunking, generating 1-second action chunks that overlap for smooth execution.
The first and most important component of the diversified prompt is the subtask instruction: a short language description of the immediate next step the robot should take.
Consider the task “make espresso.” This is a high-level instruction that encompasses dozens of individual manipulations:
Each subtask is an atomic manipulation that the model can execute with a single action chunk (or a few chunks). The subtask instruction tells the model which atomic step to execute right now.
π0.7 gets subtask labels from three sources:
The training dataset contains approximately 200 unique subtask types. The distribution is long-tailed:
| Subtask Category | Examples | Frequency |
|---|---|---|
| Pick/place | “Pick up the mug”, “Place in sink” | ~35% of segments |
| Open/close | “Open cabinet”, “Close drawer” | ~15% |
| Tool use | “Press button”, “Turn knob” | ~12% |
| Bimanual | “Fold in half”, “Hold and pour” | ~10% |
| Navigation | “Move to counter”, “Approach fridge” | ~8% |
| Cleaning | “Wipe surface”, “Scrub plate” | ~8% |
| Other | “Peel vegetable”, “Tamp grounds” | ~12% |
The long tail matters: rare subtasks like “tamp espresso grounds” appear in only a few dozen episodes. But because the model learns compositional representations, it can execute these rare subtasks by combining patterns from more common ones (pressing downward + maintaining contact = tamping).
This is where π0.7’s emergent compositional generalization comes from. The model learns to execute individual atomic subtasks (“open door,” “pick up object,” “place in container”). At inference, you can compose these subtasks in novel sequences to achieve tasks never seen in training.
Worked example: The model has never been trained on “make a peanut butter sandwich.” But it has learned:
A human coach provides: “open the peanut butter jar” → “scoop peanut butter with the knife” → “spread on the bread” → “close the jar.” Each subtask maps to a known skill. The novel composition produces a novel task.
Language subtask instructions tell the robot what to do semantically. But language is inherently imprecise about spatial details. “Place the cup on the counter” doesn’t specify where on the counter, at what angle, or how far from the edge. Subgoal images fill this gap.
A subgoal image is a generated photograph of what the world should look like after the current subtask is completed. If the subtask is “close the fridge door,” the subgoal image shows the fridge with the door closed. If the subtask is “place portafilter in group head,” the subgoal image shows the portafilter locked in position.
These images are not real photographs — they are synthesized by a world model. π0.7 uses BAGEL, an image generation model, conditioned on the current observation and the subtask instruction, to imagine the near-future visual state.
BAGEL generates surprisingly accurate near-future predictions for rigid scenes (fridge doors, drawers, objects on counters). But quality degrades predictably:
The model learns to close the gap between current and subgoal images through action generation. This is effectively visual servoing with a learned world model as the reference generator.
Consider two scenarios where language fails:
Subgoal images provide pixel-level specificity about the desired world state. The model learns to close the gap between the current observation and the subgoal image, effectively turning each subtask into a visual servoing problem.
The subgoal image system has specific failure modes:
Here’s a problem that every robotics lab faces: not all demonstrations are equally good. Some are fast and smooth. Some are slow and careful. Some contain mistakes — the robot dropped the object, picked it up again, and eventually succeeded. Do you throw away the imperfect data?
Throwing it away is wasteful. But including it without annotation causes the model to learn the mistakes along with the successes. π0.7’s solution: label each episode with metadata so the model can learn from all data while understanding what makes each episode different.
Speed — the episode length in timesteps. A fast demonstration of “pick up the cup” might be 30 steps; a slow, careful one might be 120 steps. By conditioning on speed, the model learns the relationship between pace and behavior without averaging fast and slow together.
Quality — a score from 1 to 5 indicating how well the demonstration was executed. Quality 5 means clean, efficient execution. Quality 1 means the robot struggled, made errors, but eventually completed the task. At inference, you simply set quality=5 to get the best behavior.
Mistake labels — binary flags on each segment indicating whether it contains an error (dropped object, collision, wrong grasp). This lets the model learn what not to do from the mistake segments while still learning useful recovery behaviors.
The metadata is converted to text tokens and prepended to the language instruction:
This is simple but effective. The VLM backbone processes these tokens alongside the instruction, learning to condition its representations on the metadata. During training, the actual metadata values are used. During inference, you set desired values (quality=5, speed=fast).
Speed is encoded as the episode length in timesteps, normalized to a human-readable range. The model sees speed values from ~20 (very fast, aggressive motion) to ~200 (very slow, careful manipulation). At inference:
The model doesn’t just scale its velocities linearly. At low speed values, it takes fundamentally different trajectories — more direct, less cautious. At high speed values, it adds clearance motions (lifting higher before placing) and approach-from-above strategies. The model learned that slow demonstrations tend to be more careful because the human teleoperator chose to be careful, not just slow.
Mistake-labeled segments are powerful but tricky. Consider a demonstration where the robot drops an object at timestep 150, then recovers by re-grasping at timestep 200:
During training, the model sees all three segments. When conditioned on mistake=false, it learns the successful approach and the recovery re-grasp. When conditioned on mistake=true, it learns what a drop looks like (useful for knowing what to avoid). At inference with mistake=false, the model skips the dropping behavior entirely while still knowing how to recover if something goes wrong — because it saw the recovery segment labeled mistake=false.
This is genuinely clever: the mistake labels don’t just filter out bad data. They decompose imperfect episodes into useful components. Every demonstration, no matter how messy, contributes something.
Consider 100 demonstrations of “pick up the mug”:
| Quality | Count | Behavior Pattern | Useful For |
|---|---|---|---|
| 5 | 20 | Clean top-down grasp, smooth lift, no hesitation | Deployment (set quality=5) |
| 4 | 30 | Correct grasp but slightly slow approach | Robust grasping strategies |
| 3 | 25 | Slightly misaligned grasp, minor correction needed | Recovery behaviors |
| 2 | 15 | First grasp failed, re-approached, eventually succeeded | Failure recovery, retry strategies |
| 1 | 10 | Multiple failures, eventually succeeded after 3+ attempts | Extreme recovery, workspace exploration |
Without metadata: the model would average all 100 demonstrations, producing a hesitant, mediocre policy. With metadata: the model learns 5 different quality levels. At quality=5, it produces the clean execution from the top 20 demos. But the lower-quality data isn’t wasted — it teaches the model about object properties, workspace geometry, and what recovery looks like.
The diversified prompt isn’t just about making a single dataset work better. It unlocks the ability to co-train on fundamentally different data sources that would be impossible to combine without context conditioning.
1. Teleoperated demonstrations — the core dataset. Humans teleoperating various robot platforms to perform household tasks. Each episode is labeled with subtask instructions, quality scores, and metadata. This is the highest-quality, most expensive data source.
2. Autonomous robot data — episodes collected by the robot practicing on its own, using earlier policy checkpoints. This data is cheaper but noisier. Quality and mistake labels distinguish it from clean demonstrations.
3. Human egocentric video — recordings of humans performing tasks from a head-mounted camera. No robot actions, no joint angles. But the visual sequences contain rich information about task structure, object affordances, and manipulation strategies. The model learns what things look like when done correctly, even without learning how to move the joints.
4. Web data — internet-scale text and images used to maintain the VLM backbone’s language understanding and visual recognition capabilities. Without this, the VLM’s pre-trained knowledge degrades during robot fine-tuning.
| Source | Volume | Actions? | Loss Applied | Contribution |
|---|---|---|---|---|
| Teleoperated demos | ~10K hours | Yes (joint angles) | Flow matching + subtask prediction | Core manipulation skills |
| Autonomous practice | ~5K hours | Yes (noisier) | Flow matching (quality-conditioned) | Recovery, exploration |
| Human egocentric video | ~2K hours | No | VLM understanding only | Task structure, affordances |
| Web data | ~50M image-text pairs | No | Standard VLM loss | Prevents catastrophic forgetting |
| Parameter | Value |
|---|---|
| Hardware | 128 TPU v5e pods |
| Pre-training steps | ~400K (FAST discrete tokens) |
| Post-training steps | ~100K (flow matching) |
| Batch size | 4096 (mixed sources) |
| Learning rate | 1e-4 (pre-training) → 5e-6 (post-training) |
| Total training duration | ~10 days |
| VLM backbone (Gemma 3) | Frozen during post-training (KI) |
| Action expert (860M) | Trained throughout |
During training, π0.7 randomly drops each prompt component with probability p=0.1. So 10% of the time, subtask instructions are replaced with empty strings. 10% of the time, subgoal images are replaced with blank images. This is directly analogous to classifier-free guidance in diffusion models.
Why? Two reasons:
A critical finding: prompt diversity and data diversity have a synergistic relationship. Adding more diverse data without prompt diversity doesn’t help (the model averages). Adding prompt diversity without more data doesn’t help (the model overfits to prompt variations). But adding both together produces superlinear improvement — the prompt conditioning enables the model to absorb data diversity, and the data diversity gives the model more to learn from.
The most striking result of π0.7 isn’t any single task — it’s the emergent capabilities that arise from the combination of diversified prompting, diverse data, and scale. These are behaviors that were never explicitly trained for but emerge naturally from the model’s learned representations.
Without any task-specific fine-tuning, π0.7 can perform remarkably dexterous tasks:
π0.7 demonstrates zero-shot cross-embodiment transfer: it can transfer a learned skill from one robot body to a completely different one without any additional training. For example, folding learned on a bimanual platform transfers to a different bimanual robot with different kinematics, different cameras, and different workspace geometry.
This works because the diversified prompt disambiguates the embodiment. The control mode (joint vs. end-effector) and the robot-specific metadata tell the model which body it’s controlling, allowing a single set of weights to drive multiple platforms.
π0.7 supports two control modes, switchable via prompt:
When transferring a skill from Robot A to Robot B, π0.7 uses Cartesian mode. The VLM understands the task (“fold this towel”), generates Cartesian trajectories (“move both grippers inward”), and each robot’s local IK controller translates to its own joint space. The diversified prompt specifies which mode to use, so the model learned both during training.
Perhaps the most impressive emergent capability: π0.7 can be coached to perform entirely novel tasks by composing known subtasks in new sequences. A human types step-by-step instructions, and the robot executes each one, even though the overall task was never seen in training.
Let’s walk through the full data flow for making espresso — one of π0.7’s showcase demonstrations:
| Step | Subtask Instruction | Subgoal Image | Actions Generated | Duration |
|---|---|---|---|---|
| 1 | “Unlock portafilter latch” | Latch in open position | [50, 14]: bimanual reach + twist motion | ~2s |
| 2 | “Remove portafilter” | Hand holding portafilter, machine empty | [50, 14]: pull-away trajectory | ~3s |
| 3 | “Place under grinder” | Portafilter below grinder spout | [50, 14]: navigate + position | ~4s |
| 4 | “Press grind button” | Button depressed, grounds falling | [50, 14]: reach + press | ~1s |
| 5 | “Tamp the grounds” | Flat, compressed coffee surface | [50, 14]: hold filter + press tamper | ~3s |
| 6 | “Lock portafilter in group head” | Portafilter locked in position | [50, 14]: insert + rotate motion | ~4s |
| 7 | “Place cup under spout” | Cup positioned below spout | [50, 14]: grasp cup + navigate + place | ~3s |
| 8 | “Press brew button” | Button depressed, espresso flowing | [50, 14]: reach + press | ~1s |
Total: 8 subtasks, ~21 seconds of manipulation. Each subtask generates 1-4 action chunks of 50 timesteps. The bimanual actions are 14-dimensional (7 joints + gripper per arm). The most challenging step is #6 (portafilter locking) — it requires precise force along a curved insertion path, coordinated with a 90-degree rotation. This is the step that fails most often (~30% failure rate).
Even π0.7 has clear limits. Understanding what breaks helps clarify what the model actually learned:
| Condition | Effect | Root Cause |
|---|---|---|
| Out-of-distribution object shape | Grasp fails ~40% of the time | Action expert hasn’t seen similar geometry; VLM recognizes object but action policy can’t grip it |
| Novel robot morphology (not in training) | Complete failure | Control mode metadata can’t bridge to truly unseen kinematics |
| Ambiguous language instruction | Model picks one interpretation, sometimes wrong | Prompt disambiguation helps but can’t resolve genuine ambiguity |
| Completely dark room | Complete failure | VLM is vision-based; no visual input = no understanding |
| Very fast required movements | Quality degrades (overshoots) | Action chunking at 50 Hz has latency; truly reactive motions need >100 Hz |
| Force-sensitive tasks (e.g., egg handling) | Inconsistent | No force/torque feedback in the observation space; relies on visual cues only |
Let’s look at the quantitative evidence for π0.7’s claims. The evaluations span multiple dimensions: task performance, ablations of the diversified prompt, scaling studies, and comparisons to prior work.
π0.7 achieves strong zero-shot performance on dexterous tasks without task-specific fine-tuning. On a suite of 8 challenging manipulation tasks (espresso, folding, peeling, box assembly, etc.), the model achieves an average success rate significantly above prior VLAs that require per-task fine-tuning.
The ablation studies reveal the critical importance of each prompt component:
| Step | Computation | Frequency | Latency |
|---|---|---|---|
| Camera capture | 3 cameras @ 224x224 | 10 Hz | ~5ms |
| SigLIP encoding | 3 images → 768 tokens | 10 Hz | ~8ms |
| MEM history compression | 300 frames → 64 tokens | 2 Hz | ~5ms |
| Subtask prediction (HL policy) | Autoregressive text | 0.1-0.2 Hz | ~400ms |
| Subgoal generation (BAGEL) | Image generation | 0.1-0.2 Hz | ~500ms |
| Flow matching (action expert) | 10 denoising steps | 2 Hz | ~30ms |
| Motor command execution | 50 joint commands | 50 Hz | ~1ms |
The pipeline is asynchronous: subtask prediction and subgoal generation run in the background while the action expert continuously generates motor commands. The action expert never waits for the slow stages — it uses the most recent subtask and subgoal until they’re updated.
π0.7 continues to improve with more data and larger models. The scaling curves show no sign of saturating, suggesting that the diversified prompt approach has room for further gains as robotics datasets grow.
Concrete scaling numbers from the paper:
| Data Scale | Success Rate (dexterous suite) | vs Baseline Improvement |
|---|---|---|
| 1K hours, no prompt diversity | 38% | Baseline |
| 1K hours, with prompt diversity | 55% | +17 pp (prompt helps at all scales) |
| 5K hours, no prompt diversity | 48% | +10 pp (data alone has limits) |
| 5K hours, with prompt diversity | 72% | +34 pp (synergy kicks in) |
| 10K hours, with prompt diversity | 84% | +46 pp (best result) |
The key observation: going from 1K to 10K hours with prompt diversity gives +29 pp. Going from no-prompt to full-prompt at 5K hours gives +24 pp. But the interaction effect (both together) is larger than either alone. This superlinear scaling is the central quantitative finding of the paper.
On cross-embodiment transfer tasks (folding trained on Robot A, evaluated on Robot B), π0.7 achieves success rates comparable to models trained directly on Robot B’s data — without ever having seen Robot B during training for that specific task.
π0.7 is the latest in Physical Intelligence’s line of VLA models, each building on the last:
| Decision | Alternative | Why π0.7’s Choice |
|---|---|---|
| Diversified prompt (4 types) | Language-only conditioning | Resolves mode averaging, enables steerability |
| Gemma 3 (4B) + action expert (860M) | Single unified model | Knowledge Insulation prevents forgetting |
| MEM for history (64 tokens) | Raw frame stacking | 1200x compression makes history tractable |
| BAGEL subgoal images | Language-only subtasks | Pixel-level precision for spatial tasks |
| Quality/speed metadata | Filter bad data out | Turns all data into an asset, 2-3x more usable data |
| Joint + Cartesian modes | Joint-only | Different tasks benefit from different control spaces |
Despite its capabilities, π0.7 has fundamental limitations that define the next generation of challenges:
π0.7 represents a shift in how we think about robot learning. Instead of training specialized models for each task, embodiment, and quality level, you train one model on everything and let the prompt select the desired behavior. This is the same insight that made language models powerful: a single model, conditioned on diverse prompts, can perform any task in its training distribution — and many tasks outside it.