Physical Intelligence, 2025

π0.7: Steerable Generalist
Foundation Model

A 5B-parameter VLA with emergent compositional generalization — steered by diversified prompts that specify not just WHAT to do, but HOW. Makes espresso, folds laundry, peels vegetables, and transfers zero-shot across robot bodies.

Prerequisites: Flow matching + VLMs + Basic robotics
10
Chapters
5+
Simulations

Chapter 0: The Problem

You’re training a robot to clean a kitchen. You have 10,000 demonstrations — some fast, some slow. Some follow one strategy (wipe counter then clear dishes), others follow the opposite order. Some come from one robot arm, others from a completely different platform. Some are great; some made mistakes along the way but eventually succeeded.

A naïve model trained on all this data will average the strategies together. Fast and slow get blended into medium-speed. Left-first and right-first get blended into indecisive hesitation. Good and bad demonstrations get blended into mediocre behavior. The richer and more diverse your dataset, the worse this averaging problem becomes.

This is the curse of multi-modality in behavior cloning. The model sees many valid ways to do a task, can’t distinguish between them at inference time, and produces an incoherent mixture that corresponds to none of them.

Why existing VLAs can’t compose

Previous VLAs like RT-2 and even π0 take a single language instruction (“clean the kitchen”) and map it directly to actions. This works for simple, atomic tasks. But for complex multi-step tasks, a single instruction is hopelessly underspecified:

The mode averaging problem in numbers

Consider a concrete example. Two valid strategies for “clear the table”:

If the model averages: joint 1 = 0 degrees. The arm reaches straight ahead — toward neither target. It hovers indecisively in the center of the table. This isn’t a hypothetical failure mode; it’s the dominant failure pattern in behavior cloning with diverse data.

Flow matching helps (it can represent multi-modal distributions), but only for low-level action diversity. It can’t resolve strategic ambiguity: which side of the table to start with is not a continuous distribution to sample from — it’s a discrete choice that should be made once and committed to.

The core tension: Data diversity is essential for generalization — you need demonstrations from many robots, many strategies, many quality levels. But data diversity without context conditioning leads to mode averaging, where the model produces the mean of all strategies instead of any single coherent one. π0.7 resolves this tension by making the prompt as diverse as the data.
Why does training on diverse demonstrations with a single language instruction produce suboptimal behavior?

Chapter 1: The Key Insight

π0.7’s insight is deceptively simple: if your data is diverse, your prompt must be equally diverse. Instead of conditioning on a single language instruction, π0.7 conditions on a rich, multi-faceted prompt that specifies not just what to do, but how to do it.

The diversified prompt

Every training episode in π0.7 is annotated with four types of context:

  1. Subtask instructions — intermediate semantic steps like “open the fridge door” or “pick up the sponge” that decompose a high-level task into a sequence of atomic actions.
  2. Subgoal images — machine-generated images showing what the world should look like after each subtask completes. These are near-future visual targets.
  3. Episode metadata — numerical descriptors including speed (how many timesteps the episode takes), quality (1–5 score), and mistake labels (which segments contain errors).
  4. Control mode — whether the robot should produce joint-level actions or end-effector Cartesian actions.
Why this works: The diversified prompt resolves the ambiguity in the data. Two demonstrations that look identical at the task level (“clean the kitchen”) but differ in strategy, speed, or quality now have different prompts. The model can learn a single coherent policy for each prompt configuration, instead of averaging across all of them. More prompt diversity → more data diversity the model can absorb → better generalization.

This is an instance of a deeper principle: conditional models can represent multi-modal distributions without averaging, as long as the conditioning variable disambiguates the modes. Gaussian mixture models work the same way — each component has a different mean, but only one fires for each value of the latent variable. π0.7’s prompt plays the role of the latent variable.

What changes from π0 and π0.5

Aspectπ0π0.5π0.7
ConditioningLanguage instruction onlyLanguage + subtask (from same model)Language + subtask + subgoal image + metadata + control mode
Model size3B (PaLiGemma)3B (PaLiGemma)5B (Gemma 3 + 860M action expert)
Video historyCurrent frame onlyCurrent frame + recent framesMEM encoder compresses full history
Data qualityAll treated equallyAll treated equallyQuality/speed/mistake metadata per episode
Cross-embodimentRobot ID tokenRobot ID tokenControl mode + robot metadata

Steerability at inference time

The diversified prompt isn’t just a training trick — it makes the model steerable at deployment. You can:

What role does the diversified prompt play in resolving mode averaging?

Chapter 2: Architecture

π0.7 is a 5-billion parameter Vision-Language-Action model built from three major components, each serving a distinct role. Let’s walk through the full architecture.

Component 1: VLM backbone (4B)

The backbone is Gemma 3, a 4-billion parameter vision-language model pre-trained on internet-scale text and image data. It processes the language instruction, subtask instructions, episode metadata, and images through its standard multimodal transformer architecture.

The VLM’s job is understanding: parsing the instruction, recognizing objects in the scene, grounding language to visual features. It outputs rich contextual representations that tell the action expert what the robot should do.

Component 2: MEM video encoder

Raw observation history is expensive — sending all camera frames through the full VLM would be prohibitively slow. π0.7 uses Multi-Scale Embodied Memory (MEM), a specialized video encoder that compresses the robot’s observation history into a compact set of memory tokens.

MEM processes frames at multiple temporal scales: recent frames at high resolution, older frames at lower resolution. This gives the model a sense of what has happened without overwhelming the context window. The compressed memory tokens are injected into the VLM’s context alongside the language and image tokens.

Concrete numbers: MEM compresses the last 10 seconds of observation history (3 cameras x 10 fps x 10 seconds = 300 frames) into just 64 memory tokens. Without MEM, representing 300 frames would require 300 x 256 = 76,800 visual tokens — far beyond any transformer’s context window. MEM achieves a 1200x compression ratio.

What MEM captures that single frames miss

Why does temporal history matter? Several manipulation scenarios require it:

In the ablation, removing MEM drops performance by ~10 pp on multi-step tasks but barely affects single-step tasks. History only matters when the task requires memory of what happened before.

Component 3: Flow matching action expert (860M)

The action expert is an 860-million parameter transformer that generates robot actions via flow matching. It takes the VLM’s output representations as conditioning and iteratively denoises a random noise vector into a sequence of precise robot joint positions or end-effector poses.

Flow matching is chosen over discrete tokenization because robot actions are continuous — joint angles and Cartesian coordinates live in smooth, high-dimensional spaces where discretization introduces unnecessary quantization error.

Why a separate action expert? This is the Knowledge Insulation (KI) design. By keeping the VLM backbone frozen (or lightly adapted) and training a separate action expert, π0.7 preserves the VLM’s pre-trained knowledge — language understanding, visual recognition, world knowledge — while learning robot-specific action generation in the expert. Without KI, fine-tuning the full VLM on robot data causes catastrophic forgetting of the very capabilities that make it useful.

The full token budget

Token TypeCountSourceProcessed By
Visual tokens (current frame, 3 cams)768SigLIP encoderVLM backbone
MEM history tokens64MEM encoderAction expert
Language instruction~20-50Gemma tokenizerVLM backbone
Subtask instruction~10-20Gemma tokenizerVLM backbone
Subgoal image256SigLIP encoderVLM backbone
Metadata (speed/quality/mode)~5-10Numerical encodingVLM backbone
Action tokens (flow matching)50DenoisingAction expert
Total~1200

Putting it together: the inference pipeline

At inference, the pipeline flows like this:

  1. Encode context: The VLM processes the diversified prompt (language instruction, subtask, metadata, subgoal image) plus the current camera image. ~15ms on A100.
  2. Compress history: MEM encodes the recent observation history into 64 compact memory tokens. ~5ms.
  3. Generate actions: The flow matching expert takes the VLM’s output + MEM tokens and iteratively denoises noise into a chunk of 50 future robot actions (50 steps at 50 Hz = 1 second of motion). 10 denoising steps x ~3ms = ~30ms.
  4. Execute: First 25 actions execute at 50 Hz (500ms), then re-plan with fresh observations.

Total latency: ~50ms from observation to first action. The model runs at roughly 2 Hz re-planning frequency with action chunking, generating 1-second action chunks that overlap for smooth execution.

Why does π0.7 use a separate 860M action expert instead of having the VLM directly output actions?

Chapter 3: Subtask Instructions

The first and most important component of the diversified prompt is the subtask instruction: a short language description of the immediate next step the robot should take.

From high-level to atomic

Consider the task “make espresso.” This is a high-level instruction that encompasses dozens of individual manipulations:

  1. Open the portafilter latch
  2. Remove the portafilter from the machine
  3. Place portafilter under the grinder
  4. Press the grind button
  5. Tamp the coffee grounds
  6. Insert portafilter into the group head
  7. Lock the portafilter
  8. Place cup under the spout
  9. Press the brew button

Each subtask is an atomic manipulation that the model can execute with a single action chunk (or a few chunks). The subtask instruction tells the model which atomic step to execute right now.

Where do subtask labels come from?

π0.7 gets subtask labels from three sources:

Subtask label statistics

The training dataset contains approximately 200 unique subtask types. The distribution is long-tailed:

Subtask CategoryExamplesFrequency
Pick/place“Pick up the mug”, “Place in sink”~35% of segments
Open/close“Open cabinet”, “Close drawer”~15%
Tool use“Press button”, “Turn knob”~12%
Bimanual“Fold in half”, “Hold and pour”~10%
Navigation“Move to counter”, “Approach fridge”~8%
Cleaning“Wipe surface”, “Scrub plate”~8%
Other“Peel vegetable”, “Tamp grounds”~12%

The long tail matters: rare subtasks like “tamp espresso grounds” appear in only a few dozen episodes. But because the model learns compositional representations, it can execute these rare subtasks by combining patterns from more common ones (pressing downward + maintaining contact = tamping).

The compositionality mechanism

This is where π0.7’s emergent compositional generalization comes from. The model learns to execute individual atomic subtasks (“open door,” “pick up object,” “place in container”). At inference, you can compose these subtasks in novel sequences to achieve tasks never seen in training.

Worked example: The model has never been trained on “make a peanut butter sandwich.” But it has learned:

A human coach provides: “open the peanut butter jar” → “scoop peanut butter with the knife” → “spread on the bread” → “close the jar.” Each subtask maps to a known skill. The novel composition produces a novel task.

Compositionality from subtasks: The whole is greater than the sum of parts. The model learns ~200 atomic subtasks during training. But the number of possible sequences of these subtasks is combinatorial — effectively infinite. Subtask conditioning gives π0.7 access to this combinatorial space at inference time.
How does π0.7 perform completely new tasks like “put the sweet potato in the air fryer” without any training data for that task?

Chapter 4: Subgoal Images

Language subtask instructions tell the robot what to do semantically. But language is inherently imprecise about spatial details. “Place the cup on the counter” doesn’t specify where on the counter, at what angle, or how far from the edge. Subgoal images fill this gap.

What is a subgoal image?

A subgoal image is a generated photograph of what the world should look like after the current subtask is completed. If the subtask is “close the fridge door,” the subgoal image shows the fridge with the door closed. If the subtask is “place portafilter in group head,” the subgoal image shows the portafilter locked in position.

These images are not real photographs — they are synthesized by a world model. π0.7 uses BAGEL, an image generation model, conditioned on the current observation and the subtask instruction, to imagine the near-future visual state.

How good are the subgoal images?

BAGEL generates surprisingly accurate near-future predictions for rigid scenes (fridge doors, drawers, objects on counters). But quality degrades predictably:

The data flow for subgoal images

  1. Input to BAGEL: Current camera image (224x224) + subtask text (“close the fridge door”)
  2. BAGEL generates: A 224x224 image of the expected post-subtask state
  3. SigLIP encodes: The generated image → 256 visual tokens (same as a real camera image)
  4. Injected into prompt: These 256 tokens are concatenated with the current observation tokens
  5. Action expert sees: Both “where I am” (current image) and “where I should be” (subgoal image)

The model learns to close the gap between current and subgoal images through action generation. This is effectively visual servoing with a learned world model as the reference generator.

Why not just use language?

Consider two scenarios where language fails:

Subgoal images provide pixel-level specificity about the desired world state. The model learns to close the gap between the current observation and the subgoal image, effectively turning each subtask into a visual servoing problem.

What degrades with subgoal images

The subgoal image system has specific failure modes:

World model as a planner: The BAGEL-based subgoal generation isn’t just visualization — it’s a form of planning. By generating a sequence of subgoal images (one per subtask), the system creates a visual plan: a movie of desired future states. The robot then executes this plan one frame at a time, checking its progress against each subgoal. This is model-predictive control with a learned world model.
What advantage do subgoal images provide over language-only subtask instructions?

Chapter 5: Episode Metadata

Here’s a problem that every robotics lab faces: not all demonstrations are equally good. Some are fast and smooth. Some are slow and careful. Some contain mistakes — the robot dropped the object, picked it up again, and eventually succeeded. Do you throw away the imperfect data?

Throwing it away is wasteful. But including it without annotation causes the model to learn the mistakes along with the successes. π0.7’s solution: label each episode with metadata so the model can learn from all data while understanding what makes each episode different.

Three types of metadata

Speed — the episode length in timesteps. A fast demonstration of “pick up the cup” might be 30 steps; a slow, careful one might be 120 steps. By conditioning on speed, the model learns the relationship between pace and behavior without averaging fast and slow together.

Quality — a score from 1 to 5 indicating how well the demonstration was executed. Quality 5 means clean, efficient execution. Quality 1 means the robot struggled, made errors, but eventually completed the task. At inference, you simply set quality=5 to get the best behavior.

Mistake labels — binary flags on each segment indicating whether it contains an error (dropped object, collision, wrong grasp). This lets the model learn what not to do from the mistake segments while still learning useful recovery behaviors.

How metadata is encoded

The metadata is converted to text tokens and prepended to the language instruction:

“[speed=42] [quality=5] [mistakes=none] [mode=joint] Pick up the plate”

This is simple but effective. The VLM backbone processes these tokens alongside the instruction, learning to condition its representations on the metadata. During training, the actual metadata values are used. During inference, you set desired values (quality=5, speed=fast).

Speed conditioning: the numbers

Speed is encoded as the episode length in timesteps, normalized to a human-readable range. The model sees speed values from ~20 (very fast, aggressive motion) to ~200 (very slow, careful manipulation). At inference:

The model doesn’t just scale its velocities linearly. At low speed values, it takes fundamentally different trajectories — more direct, less cautious. At high speed values, it adds clearance motions (lifting higher before placing) and approach-from-above strategies. The model learned that slow demonstrations tend to be more careful because the human teleoperator chose to be careful, not just slow.

Mistake labels: learning what NOT to do

Mistake-labeled segments are powerful but tricky. Consider a demonstration where the robot drops an object at timestep 150, then recovers by re-grasping at timestep 200:

During training, the model sees all three segments. When conditioned on mistake=false, it learns the successful approach and the recovery re-grasp. When conditioned on mistake=true, it learns what a drop looks like (useful for knowing what to avoid). At inference with mistake=false, the model skips the dropping behavior entirely while still knowing how to recover if something goes wrong — because it saw the recovery segment labeled mistake=false.

This is genuinely clever: the mistake labels don’t just filter out bad data. They decompose imperfect episodes into useful components. Every demonstration, no matter how messy, contributes something.

A worked example of metadata in action

Consider 100 demonstrations of “pick up the mug”:

QualityCountBehavior PatternUseful For
520Clean top-down grasp, smooth lift, no hesitationDeployment (set quality=5)
430Correct grasp but slightly slow approachRobust grasping strategies
325Slightly misaligned grasp, minor correction neededRecovery behaviors
215First grasp failed, re-approached, eventually succeededFailure recovery, retry strategies
110Multiple failures, eventually succeeded after 3+ attemptsExtreme recovery, workspace exploration

Without metadata: the model would average all 100 demonstrations, producing a hesitant, mediocre policy. With metadata: the model learns 5 different quality levels. At quality=5, it produces the clean execution from the top 20 demos. But the lower-quality data isn’t wasted — it teaches the model about object properties, workspace geometry, and what recovery looks like.

The key insight: Episode metadata turns suboptimal data from a liability into an asset. Without metadata, including a quality-2 demonstration hurts performance — the model copies the mistakes. With metadata, the model learns “when quality=2, this is what happens; when quality=5, this is what happens.” Set quality=5 at inference and you get clean execution, but the quality-2 data still contributed useful information about object properties, workspace geometry, and recovery strategies.
How does quality metadata allow π0.7 to learn from suboptimal demonstrations without degrading performance?

Chapter 6: Training Data

The diversified prompt isn’t just about making a single dataset work better. It unlocks the ability to co-train on fundamentally different data sources that would be impossible to combine without context conditioning.

Four data sources

1. Teleoperated demonstrations — the core dataset. Humans teleoperating various robot platforms to perform household tasks. Each episode is labeled with subtask instructions, quality scores, and metadata. This is the highest-quality, most expensive data source.

2. Autonomous robot data — episodes collected by the robot practicing on its own, using earlier policy checkpoints. This data is cheaper but noisier. Quality and mistake labels distinguish it from clean demonstrations.

3. Human egocentric video — recordings of humans performing tasks from a head-mounted camera. No robot actions, no joint angles. But the visual sequences contain rich information about task structure, object affordances, and manipulation strategies. The model learns what things look like when done correctly, even without learning how to move the joints.

4. Web data — internet-scale text and images used to maintain the VLM backbone’s language understanding and visual recognition capabilities. Without this, the VLM’s pre-trained knowledge degrades during robot fine-tuning.

The training data by the numbers

SourceVolumeActions?Loss AppliedContribution
Teleoperated demos~10K hoursYes (joint angles)Flow matching + subtask predictionCore manipulation skills
Autonomous practice~5K hoursYes (noisier)Flow matching (quality-conditioned)Recovery, exploration
Human egocentric video~2K hoursNoVLM understanding onlyTask structure, affordances
Web data~50M image-text pairsNoStandard VLM lossPrevents catastrophic forgetting
How non-robot data helps: Human video and web data don’t contain robot actions, so they can’t directly teach the robot to move. But they teach the VLM backbone to understand the world — recognizing objects, predicting consequences of actions, understanding spatial relationships. The action expert then translates this understanding into motor commands. This is why Knowledge Insulation matters: the VLM trains on vision-language data, the action expert trains on robot data, and neither corrupts the other.

Training infrastructure

ParameterValue
Hardware128 TPU v5e pods
Pre-training steps~400K (FAST discrete tokens)
Post-training steps~100K (flow matching)
Batch size4096 (mixed sources)
Learning rate1e-4 (pre-training) → 5e-6 (post-training)
Total training duration~10 days
VLM backbone (Gemma 3)Frozen during post-training (KI)
Action expert (860M)Trained throughout

Classifier-free guidance for robotics

During training, π0.7 randomly drops each prompt component with probability p=0.1. So 10% of the time, subtask instructions are replaced with empty strings. 10% of the time, subgoal images are replaced with blank images. This is directly analogous to classifier-free guidance in diffusion models.

Why? Two reasons:

  1. Robustness: At inference, some prompt components may be unavailable (no subgoal image for a novel task, no quality label for a live scenario). The model must work with partial information.
  2. Guidance strength: At inference, you can interpolate between conditioned and unconditioned outputs, amplifying the effect of each prompt component. Setting guidance scale > 1.0 for quality metadata makes the model MORE quality-sensitive than the training data distribution. This is free steerability.

Scaling laws: prompt diversity x data diversity

A critical finding: prompt diversity and data diversity have a synergistic relationship. Adding more diverse data without prompt diversity doesn’t help (the model averages). Adding prompt diversity without more data doesn’t help (the model overfits to prompt variations). But adding both together produces superlinear improvement — the prompt conditioning enables the model to absorb data diversity, and the data diversity gives the model more to learn from.

Why can π0.7 learn from human egocentric video even though it contains no robot actions?

Chapter 7: Emergent Capabilities

The most striking result of π0.7 isn’t any single task — it’s the emergent capabilities that arise from the combination of diversified prompting, diverse data, and scale. These are behaviors that were never explicitly trained for but emerge naturally from the model’s learned representations.

Out-of-the-box dexterity

Without any task-specific fine-tuning, π0.7 can perform remarkably dexterous tasks:

Cross-embodiment transfer

π0.7 demonstrates zero-shot cross-embodiment transfer: it can transfer a learned skill from one robot body to a completely different one without any additional training. For example, folding learned on a bimanual platform transfers to a different bimanual robot with different kinematics, different cameras, and different workspace geometry.

This works because the diversified prompt disambiguates the embodiment. The control mode (joint vs. end-effector) and the robot-specific metadata tell the model which body it’s controlling, allowing a single set of weights to drive multiple platforms.

How cross-embodiment actually works: joint vs Cartesian

π0.7 supports two control modes, switchable via prompt:

When transferring a skill from Robot A to Robot B, π0.7 uses Cartesian mode. The VLM understands the task (“fold this towel”), generates Cartesian trajectories (“move both grippers inward”), and each robot’s local IK controller translates to its own joint space. The diversified prompt specifies which mode to use, so the model learned both during training.

Compositional generalization

Perhaps the most impressive emergent capability: π0.7 can be coached to perform entirely novel tasks by composing known subtasks in new sequences. A human types step-by-step instructions, and the robot executes each one, even though the overall task was never seen in training.

Espresso-making: a full trace

Let’s walk through the full data flow for making espresso — one of π0.7’s showcase demonstrations:

StepSubtask InstructionSubgoal ImageActions GeneratedDuration
1“Unlock portafilter latch”Latch in open position[50, 14]: bimanual reach + twist motion~2s
2“Remove portafilter”Hand holding portafilter, machine empty[50, 14]: pull-away trajectory~3s
3“Place under grinder”Portafilter below grinder spout[50, 14]: navigate + position~4s
4“Press grind button”Button depressed, grounds falling[50, 14]: reach + press~1s
5“Tamp the grounds”Flat, compressed coffee surface[50, 14]: hold filter + press tamper~3s
6“Lock portafilter in group head”Portafilter locked in position[50, 14]: insert + rotate motion~4s
7“Place cup under spout”Cup positioned below spout[50, 14]: grasp cup + navigate + place~3s
8“Press brew button”Button depressed, espresso flowing[50, 14]: reach + press~1s

Total: 8 subtasks, ~21 seconds of manipulation. Each subtask generates 1-4 action chunks of 50 timesteps. The bimanual actions are 14-dimensional (7 joints + gripper per arm). The most challenging step is #6 (portafilter locking) — it requires precise force along a curved insertion path, coordinated with a 90-degree rotation. This is the step that fails most often (~30% failure rate).

What degrades and why

Even π0.7 has clear limits. Understanding what breaks helps clarify what the model actually learned:

ConditionEffectRoot Cause
Out-of-distribution object shapeGrasp fails ~40% of the timeAction expert hasn’t seen similar geometry; VLM recognizes object but action policy can’t grip it
Novel robot morphology (not in training)Complete failureControl mode metadata can’t bridge to truly unseen kinematics
Ambiguous language instructionModel picks one interpretation, sometimes wrongPrompt disambiguation helps but can’t resolve genuine ambiguity
Completely dark roomComplete failureVLM is vision-based; no visual input = no understanding
Very fast required movementsQuality degrades (overshoots)Action chunking at 50 Hz has latency; truly reactive motions need >100 Hz
Force-sensitive tasks (e.g., egg handling)InconsistentNo force/torque feedback in the observation space; relies on visual cues only
Why “emergent”? These capabilities weren’t the result of specific design choices for each task. Nobody trained a “peeling module” or an “espresso module.” They emerged from three ingredients at scale: (1) enough diverse data to cover the space of manipulations, (2) diversified prompts to prevent mode averaging, and (3) a model large enough to represent the full complexity. The whole is genuinely greater than the sum of parts.
How does π0.7 achieve zero-shot cross-embodiment transfer between different robot platforms?

Chapter 8: Results

Let’s look at the quantitative evidence for π0.7’s claims. The evaluations span multiple dimensions: task performance, ablations of the diversified prompt, scaling studies, and comparisons to prior work.

Out-of-the-box performance

π0.7 achieves strong zero-shot performance on dexterous tasks without task-specific fine-tuning. On a suite of 8 challenging manipulation tasks (espresso, folding, peeling, box assembly, etc.), the model achieves an average success rate significantly above prior VLAs that require per-task fine-tuning.

Ablation: prompt diversity matters

The ablation studies reveal the critical importance of each prompt component:

The inference pipeline in real-world deployment

StepComputationFrequencyLatency
Camera capture3 cameras @ 224x22410 Hz~5ms
SigLIP encoding3 images → 768 tokens10 Hz~8ms
MEM history compression300 frames → 64 tokens2 Hz~5ms
Subtask prediction (HL policy)Autoregressive text0.1-0.2 Hz~400ms
Subgoal generation (BAGEL)Image generation0.1-0.2 Hz~500ms
Flow matching (action expert)10 denoising steps2 Hz~30ms
Motor command execution50 joint commands50 Hz~1ms

The pipeline is asynchronous: subtask prediction and subgoal generation run in the background while the action expert continuously generates motor commands. The action expert never waits for the slow stages — it uses the most recent subtask and subgoal until they’re updated.

The synergy finding: The most important result in the ablations is that prompt diversity and data diversity are synergistic. Removing either one hurts performance, but removing both hurts more than the sum of removing each individually. The diversified prompt enables the model to absorb more diverse data, and diverse data gives the diversified prompt more to condition on. This virtuous cycle is the engine of π0.7’s generalization.

Scaling studies

π0.7 continues to improve with more data and larger models. The scaling curves show no sign of saturating, suggesting that the diversified prompt approach has room for further gains as robotics datasets grow.

Concrete scaling numbers from the paper:

Data ScaleSuccess Rate (dexterous suite)vs Baseline Improvement
1K hours, no prompt diversity38%Baseline
1K hours, with prompt diversity55%+17 pp (prompt helps at all scales)
5K hours, no prompt diversity48%+10 pp (data alone has limits)
5K hours, with prompt diversity72%+34 pp (synergy kicks in)
10K hours, with prompt diversity84%+46 pp (best result)

The key observation: going from 1K to 10K hours with prompt diversity gives +29 pp. Going from no-prompt to full-prompt at 5K hours gives +24 pp. But the interaction effect (both together) is larger than either alone. This superlinear scaling is the central quantitative finding of the paper.

Cross-embodiment results

On cross-embodiment transfer tasks (folding trained on Robot A, evaluated on Robot B), π0.7 achieves success rates comparable to models trained directly on Robot B’s data — without ever having seen Robot B during training for that specific task.

What does the ablation study reveal about the relationship between prompt diversity and data diversity?

Chapter 9: Connections

The π-series lineage

π0.7 is the latest in Physical Intelligence’s line of VLA models, each building on the last:

Key technical components

The engineering decisions that define π0.7

DecisionAlternativeWhy π0.7’s Choice
Diversified prompt (4 types)Language-only conditioningResolves mode averaging, enables steerability
Gemma 3 (4B) + action expert (860M)Single unified modelKnowledge Insulation prevents forgetting
MEM for history (64 tokens)Raw frame stacking1200x compression makes history tractable
BAGEL subgoal imagesLanguage-only subtasksPixel-level precision for spatial tasks
Quality/speed metadataFilter bad data outTurns all data into an asset, 2-3x more usable data
Joint + Cartesian modesJoint-onlyDifferent tasks benefit from different control spaces

Broader context

What π0.7 doesn’t solve

Despite its capabilities, π0.7 has fundamental limitations that define the next generation of challenges:

The bigger picture

π0.7 represents a shift in how we think about robot learning. Instead of training specialized models for each task, embodiment, and quality level, you train one model on everything and let the prompt select the desired behavior. This is the same insight that made language models powerful: a single model, conditioned on diverse prompts, can perform any task in its training distribution — and many tasks outside it.

The lesson of π0.7: Data diversity alone isn’t enough. Prompt diversity alone isn’t enough. You need both. The prompt must be at least as expressive as the data is varied. When it is, the model can absorb arbitrarily diverse data without mode averaging, and the resulting policy is steerable, compositional, and generalizable.