pi-0.7: Steerable Generalist Robot Foundation Model

Chapter 0: The Problem

You’re training a robot to clean a kitchen. You have 10,000 demonstrations — some fast, some slow. Some follow one strategy (wipe counter then clear dishes), others follow the opposite order. Some come from one robot arm, others from a completely different platform. Some are great; some made mistakes along the way but eventually succeeded.

A naïve model trained on all this data will average the strategies together. Fast and slow get blended into medium-speed. Left-first and right-first get blended into indecisive hesitation. Good and bad demonstrations get blended into mediocre behavior. The richer and more diverse your dataset, the worse this averaging problem becomes.

This is the curse of multi-modality in behavior cloning. The model sees many valid ways to do a task, can’t distinguish between them at inference time, and produces an incoherent mixture that corresponds to none of them.

Why existing VLAs can’t compose

Previous VLAs like RT-2 and even π0 take a single language instruction (“clean the kitchen”) and map it directly to actions. This works for simple, atomic tasks. But for complex multi-step tasks, a single instruction is hopelessly underspecified:

Which subtask first? “Clean the kitchen” doesn’t say whether to start with the counter, the dishes, or the trash.
How fast? The same task can be done carefully or quickly — the model can’t know which you want.
What quality level? A demonstration where the robot dropped something and recovered is still useful data — but you don’t want the model to copy the dropping part.
Which body? A folding demonstration from a bimanual robot can’t be naively transferred to a single-arm platform without additional context.

The mode averaging problem in numbers

Consider a concrete example. Two valid strategies for “clear the table”:

Strategy A: Start with the left side. Reach left (joint 1 = -30 degrees). Clear items left to right.
Strategy B: Start with the right side. Reach right (joint 1 = +30 degrees). Clear items right to left.

If the model averages: joint 1 = 0 degrees. The arm reaches straight ahead — toward neither target. It hovers indecisively in the center of the table. This isn’t a hypothetical failure mode; it’s the dominant failure pattern in behavior cloning with diverse data.

Flow matching helps (it can represent multi-modal distributions), but only for low-level action diversity. It can’t resolve strategic ambiguity: which side of the table to start with is not a continuous distribution to sample from — it’s a discrete choice that should be made once and committed to.

The core tension: Data diversity is essential for generalization — you need demonstrations from many robots, many strategies, many quality levels. But data diversity without context conditioning leads to mode averaging, where the model produces the mean of all strategies instead of any single coherent one. π0.7 resolves this tension by making the prompt as diverse as the data.

Why does training on diverse demonstrations with a single language instruction produce suboptimal behavior?

The model averages over multiple valid strategies, producing an incoherent mixture that corresponds to no single coherent strategy The model overfits to the most common demonstration The language instruction is too long for the model to process

Chapter 1: The Key Insight

π0.7’s insight is deceptively simple: if your data is diverse, your prompt must be equally diverse. Instead of conditioning on a single language instruction, π0.7 conditions on a rich, multi-faceted prompt that specifies not just what to do, but how to do it.

The diversified prompt

Every training episode in π0.7 is annotated with four types of context:

Subtask instructions — intermediate semantic steps like “open the fridge door” or “pick up the sponge” that decompose a high-level task into a sequence of atomic actions.
Subgoal images — machine-generated images showing what the world should look like after each subtask completes. These are near-future visual targets.
Episode metadata — numerical descriptors including speed (how many timesteps the episode takes), quality (1–5 score), and mistake labels (which segments contain errors).
Control mode — whether the robot should produce joint-level actions or end-effector Cartesian actions.

Why this works: The diversified prompt resolves the ambiguity in the data. Two demonstrations that look identical at the task level (“clean the kitchen”) but differ in strategy, speed, or quality now have different prompts. The model can learn a single coherent policy for each prompt configuration, instead of averaging across all of them. More prompt diversity → more data diversity the model can absorb → better generalization.

This is an instance of a deeper principle: conditional models can represent multi-modal distributions without averaging, as long as the conditioning variable disambiguates the modes. Gaussian mixture models work the same way — each component has a different mean, but only one fires for each value of the latent variable. π0.7’s prompt plays the role of the latent variable.

What changes from π0 and π0.5

Aspect	π0	π0.5	π0.7
Conditioning	Language instruction only	Language + subtask (from same model)	Language + subtask + subgoal image + metadata + control mode
Model size	3B (PaLiGemma)	3B (PaLiGemma)	5B (Gemma 3 + 860M action expert)
Video history	Current frame only	Current frame + recent frames	MEM encoder compresses full history
Data quality	All treated equally	All treated equally	Quality/speed/mistake metadata per episode
Cross-embodiment	Robot ID token	Robot ID token	Control mode + robot metadata

Steerability at inference time

The diversified prompt isn’t just a training trick — it makes the model steerable at deployment. You can:

Set quality=5, speed=fast for high-performance execution
Provide subtask instructions step-by-step to coach the robot through novel tasks
Supply a subgoal image showing the desired world state
Switch between joint and Cartesian control modes on the fly

What role does the diversified prompt play in resolving mode averaging?

It disambiguates the different strategies in the data so the model can learn each one separately instead of averaging them together It filters out bad demonstrations before training It increases the model’s parameter count

Chapter 2: Architecture

π0.7 is a 5-billion parameter Vision-Language-Action model built from three major components, each serving a distinct role. Let’s walk through the full architecture.

Component 1: VLM backbone (4B)

The backbone is Gemma 3, a 4-billion parameter vision-language model pre-trained on internet-scale text and image data. It processes the language instruction, subtask instructions, episode metadata, and images through its standard multimodal transformer architecture.

The VLM’s job is understanding: parsing the instruction, recognizing objects in the scene, grounding language to visual features. It outputs rich contextual representations that tell the action expert what the robot should do.

Component 2: MEM video encoder

Raw observation history is expensive — sending all camera frames through the full VLM would be prohibitively slow. π0.7 uses Multi-Scale Embodied Memory (MEM), a specialized video encoder that compresses the robot’s observation history into a compact set of memory tokens.

MEM processes frames at multiple temporal scales: recent frames at high resolution, older frames at lower resolution. This gives the model a sense of what has happened without overwhelming the context window. The compressed memory tokens are injected into the VLM’s context alongside the language and image tokens.

Concrete numbers: MEM compresses the last 10 seconds of observation history (3 cameras x 10 fps x 10 seconds = 300 frames) into just 64 memory tokens. Without MEM, representing 300 frames would require 300 x 256 = 76,800 visual tokens — far beyond any transformer’s context window. MEM achieves a 1200x compression ratio.

What MEM captures that single frames miss

Why does temporal history matter? Several manipulation scenarios require it:

Object tracking: The robot pushed an object behind another object 3 seconds ago. The current frame doesn’t show it, but MEM remembers it was there.
Motion detection: A human walked through the workspace 5 seconds ago. MEM encodes “something moved recently” which triggers cautious behavior.
Progress tracking: The robot has been folding a shirt for 8 seconds. MEM encodes the fold history: “first fold done, second fold in progress.” Without this, each new frame looks like the initial state of a partially-folded shirt — the model can’t distinguish “I just folded this” from “someone left a half-folded shirt here.”
Velocity estimation: Is the robot’s arm currently moving or stationary? A single frame can’t tell. MEM encodes the motion trajectory from recent frames.

In the ablation, removing MEM drops performance by ~10 pp on multi-step tasks but barely affects single-step tasks. History only matters when the task requires memory of what happened before.

Component 3: Flow matching action expert (860M)

The action expert is an 860-million parameter transformer that generates robot actions via flow matching. It takes the VLM’s output representations as conditioning and iteratively denoises a random noise vector into a sequence of precise robot joint positions or end-effector poses.

Flow matching is chosen over discrete tokenization because robot actions are continuous — joint angles and Cartesian coordinates live in smooth, high-dimensional spaces where discretization introduces unnecessary quantization error.

Why a separate action expert? This is the Knowledge Insulation (KI) design. By keeping the VLM backbone frozen (or lightly adapted) and training a separate action expert, π0.7 preserves the VLM’s pre-trained knowledge — language understanding, visual recognition, world knowledge — while learning robot-specific action generation in the expert. Without KI, fine-tuning the full VLM on robot data causes catastrophic forgetting of the very capabilities that make it useful.

The full token budget

Token Type	Count	Source	Processed By
Visual tokens (current frame, 3 cams)	768	SigLIP encoder	VLM backbone
MEM history tokens	64	MEM encoder	Action expert
Language instruction	~20-50	Gemma tokenizer	VLM backbone
Subtask instruction	~10-20	Gemma tokenizer	VLM backbone
Subgoal image	256	SigLIP encoder	VLM backbone
Metadata (speed/quality/mode)	~5-10	Numerical encoding	VLM backbone
Action tokens (flow matching)	50	Denoising	Action expert
Total	~1200

Putting it together: the inference pipeline

At inference, the pipeline flows like this:

Encode context: The VLM processes the diversified prompt (language instruction, subtask, metadata, subgoal image) plus the current camera image. ~15ms on A100.
Compress history: MEM encodes the recent observation history into 64 compact memory tokens. ~5ms.
Generate actions: The flow matching expert takes the VLM’s output + MEM tokens and iteratively denoises noise into a chunk of 50 future robot actions (50 steps at 50 Hz = 1 second of motion). 10 denoising steps x ~3ms = ~30ms.
Execute: First 25 actions execute at 50 Hz (500ms), then re-plan with fresh observations.

Total latency: ~50ms from observation to first action. The model runs at roughly 2 Hz re-planning frequency with action chunking, generating 1-second action chunks that overlap for smooth execution.

Why does π0.7 use a separate 860M action expert instead of having the VLM directly output actions?

Knowledge Insulation — training the VLM on robot data would cause catastrophic forgetting of its pre-trained language/vision capabilities The VLM is too slow to generate actions at real-time rates Flow matching requires a specific architecture that VLMs can’t support

Chapter 3: Subtask Instructions

The first and most important component of the diversified prompt is the subtask instruction: a short language description of the immediate next step the robot should take.

From high-level to atomic

Consider the task “make espresso.” This is a high-level instruction that encompasses dozens of individual manipulations:

Open the portafilter latch
Remove the portafilter from the machine
Place portafilter under the grinder
Press the grind button
Tamp the coffee grounds
Insert portafilter into the group head
Lock the portafilter
Place cup under the spout
Press the brew button

Each subtask is an atomic manipulation that the model can execute with a single action chunk (or a few chunks). The subtask instruction tells the model which atomic step to execute right now.

Where do subtask labels come from?

π0.7 gets subtask labels from three sources:

Human annotation: Teleoperators label each segment of a demonstration with a short description. This is the highest quality but most expensive source (~$5-10 per episode in annotation cost).
High-level policy: A language model (the “high-level policy”) watches the current scene and generates the next subtask instruction. This enables autonomous execution of long-horizon tasks without human intervention at every step. Latency: ~300-500ms per subtask generation.
Human coaching: At deployment, a human can type subtask instructions in real-time to guide the robot through a novel task it has never seen. This is how π0.7 can do completely new tasks like “put the sweet potato in the air fryer” — the human decomposes the novel task, and the robot executes each subtask from its repertoire.

Subtask label statistics

The training dataset contains approximately 200 unique subtask types. The distribution is long-tailed:

Subtask Category	Examples	Frequency
Pick/place	“Pick up the mug”, “Place in sink”	~35% of segments
Open/close	“Open cabinet”, “Close drawer”	~15%
Tool use	“Press button”, “Turn knob”	~12%
Bimanual	“Fold in half”, “Hold and pour”	~10%
Navigation	“Move to counter”, “Approach fridge”	~8%
Cleaning	“Wipe surface”, “Scrub plate”	~8%
Other	“Peel vegetable”, “Tamp grounds”	~12%

The long tail matters: rare subtasks like “tamp espresso grounds” appear in only a few dozen episodes. But because the model learns compositional representations, it can execute these rare subtasks by combining patterns from more common ones (pressing downward + maintaining contact = tamping).

The compositionality mechanism

This is where π0.7’s emergent compositional generalization comes from. The model learns to execute individual atomic subtasks (“open door,” “pick up object,” “place in container”). At inference, you can compose these subtasks in novel sequences to achieve tasks never seen in training.

Worked example: The model has never been trained on “make a peanut butter sandwich.” But it has learned:

“Open the jar lid” (from jam-opening demonstrations)
“Scoop with knife” (from butter-spreading demonstrations)
“Spread on bread” (from toast-preparation demonstrations)
“Close the jar” (from cleanup demonstrations)

A human coach provides: “open the peanut butter jar” → “scoop peanut butter with the knife” → “spread on the bread” → “close the jar.” Each subtask maps to a known skill. The novel composition produces a novel task.

Compositionality from subtasks: The whole is greater than the sum of parts. The model learns ~200 atomic subtasks during training. But the number of possible sequences of these subtasks is combinatorial — effectively infinite. Subtask conditioning gives π0.7 access to this combinatorial space at inference time.

How does π0.7 perform completely new tasks like “put the sweet potato in the air fryer” without any training data for that task?

A human (or high-level policy) decomposes the novel task into known subtasks, and the model composes its learned atomic skills in a new sequence The model uses few-shot learning from similar tasks The model is fine-tuned on the new task with a handful of demonstrations

Chapter 4: Subgoal Images

Language subtask instructions tell the robot what to do semantically. But language is inherently imprecise about spatial details. “Place the cup on the counter” doesn’t specify where on the counter, at what angle, or how far from the edge. Subgoal images fill this gap.

What is a subgoal image?

A subgoal image is a generated photograph of what the world should look like after the current subtask is completed. If the subtask is “close the fridge door,” the subgoal image shows the fridge with the door closed. If the subtask is “place portafilter in group head,” the subgoal image shows the portafilter locked in position.

These images are not real photographs — they are synthesized by a world model. π0.7 uses BAGEL, an image generation model, conditioned on the current observation and the subtask instruction, to imagine the near-future visual state.

How good are the subgoal images?

BAGEL generates surprisingly accurate near-future predictions for rigid scenes (fridge doors, drawers, objects on counters). But quality degrades predictably:

Rigid objects: Excellent (fridge open → fridge closed). BAGEL just needs to “erase” the door gap. Pixel accuracy: ~90%.
Object placement: Good (mug on counter → mug in sink). Position is approximate but sufficient for grasping. Pixel accuracy: ~70%.
Deformable objects: Poor (unfolded shirt → folded shirt). BAGEL can’t predict exact fold lines. Pixel accuracy: ~30%. The model relies more on language subtasks here.
Novel objects: Mixed. BAGEL has seen many objects through web training but hallucinate unfamiliar ones.

The data flow for subgoal images

Input to BAGEL: Current camera image (224x224) + subtask text (“close the fridge door”)
BAGEL generates: A 224x224 image of the expected post-subtask state
SigLIP encodes: The generated image → 256 visual tokens (same as a real camera image)
Injected into prompt: These 256 tokens are concatenated with the current observation tokens
Action expert sees: Both “where I am” (current image) and “where I should be” (subgoal image)

The model learns to close the gap between current and subgoal images through action generation. This is effectively visual servoing with a learned world model as the reference generator.

Why not just use language?

Consider two scenarios where language fails:

Spatial precision: “Fold the shirt in thirds” — language can describe this, but an image showing the exact fold lines is far more precise.
Novel objects: “Pick up the blue thing” — if the object has never been seen, a generated image showing the robot holding it is more informative than any description.

Subgoal images provide pixel-level specificity about the desired world state. The model learns to close the gap between the current observation and the subgoal image, effectively turning each subtask into a visual servoing problem.

What degrades with subgoal images

The subgoal image system has specific failure modes:

BAGEL hallucination: The world model sometimes generates physically impossible scenes (object floating, incorrect spatial relationships). The robot then tries to achieve an impossible state. Mitigation: the model learns to weight subgoal images less when they conflict with physical constraints.
Camera viewpoint mismatch: The subgoal is generated from the current camera viewpoint. If the robot moves (mobile base), the viewpoint changes and the subgoal becomes misaligned. Solution: re-generate subgoal images after significant movement.
Deformable objects: Predicting the exact configuration of a folded shirt is extremely difficult for any generative model. The subgoal images for deformable object tasks are approximate at best.

World model as a planner: The BAGEL-based subgoal generation isn’t just visualization — it’s a form of planning. By generating a sequence of subgoal images (one per subtask), the system creates a visual plan: a movie of desired future states. The robot then executes this plan one frame at a time, checking its progress against each subgoal. This is model-predictive control with a learned world model.

What advantage do subgoal images provide over language-only subtask instructions?

Pixel-level spatial precision about the desired world state that language cannot express — exact positions, orientations, and configurations They are faster for the model to process than language tokens They eliminate the need for a camera at deployment

Chapter 5: Episode Metadata

Here’s a problem that every robotics lab faces: not all demonstrations are equally good. Some are fast and smooth. Some are slow and careful. Some contain mistakes — the robot dropped the object, picked it up again, and eventually succeeded. Do you throw away the imperfect data?

Throwing it away is wasteful. But including it without annotation causes the model to learn the mistakes along with the successes. π0.7’s solution: label each episode with metadata so the model can learn from all data while understanding what makes each episode different.

Three types of metadata

Speed — the episode length in timesteps. A fast demonstration of “pick up the cup” might be 30 steps; a slow, careful one might be 120 steps. By conditioning on speed, the model learns the relationship between pace and behavior without averaging fast and slow together.

Quality — a score from 1 to 5 indicating how well the demonstration was executed. Quality 5 means clean, efficient execution. Quality 1 means the robot struggled, made errors, but eventually completed the task. At inference, you simply set quality=5 to get the best behavior.

Mistake labels — binary flags on each segment indicating whether it contains an error (dropped object, collision, wrong grasp). This lets the model learn what not to do from the mistake segments while still learning useful recovery behaviors.

How metadata is encoded

The metadata is converted to text tokens and prepended to the language instruction:

“[speed=42] [quality=5] [mistakes=none] [mode=joint] Pick up the plate”

This is simple but effective. The VLM backbone processes these tokens alongside the instruction, learning to condition its representations on the metadata. During training, the actual metadata values are used. During inference, you set desired values (quality=5, speed=fast).

Speed conditioning: the numbers

Speed is encoded as the episode length in timesteps, normalized to a human-readable range. The model sees speed values from ~20 (very fast, aggressive motion) to ~200 (very slow, careful manipulation). At inference:

speed=30: Fast, aggressive motions. Good for simple pick-and-place where speed matters. Risk: overshooting, collisions.
speed=80: Medium pace. Balanced between speed and precision. Default for most deployments.
speed=150: Slow, deliberate. Good for tasks requiring precision (threading, insertion). Downside: tasks take 3x longer.

The model doesn’t just scale its velocities linearly. At low speed values, it takes fundamentally different trajectories — more direct, less cautious. At high speed values, it adds clearance motions (lifting higher before placing) and approach-from-above strategies. The model learned that slow demonstrations tend to be more careful because the human teleoperator chose to be careful, not just slow.

Mistake labels: learning what NOT to do

Mistake-labeled segments are powerful but tricky. Consider a demonstration where the robot drops an object at timestep 150, then recovers by re-grasping at timestep 200:

Timesteps 1-149: mistake=false. Normal approach and grasp attempt.
Timesteps 150-199: mistake=true. The drop, the confusion, the recovery approach.
Timesteps 200-250: mistake=false. Successful re-grasp and completion.

During training, the model sees all three segments. When conditioned on mistake=false, it learns the successful approach and the recovery re-grasp. When conditioned on mistake=true, it learns what a drop looks like (useful for knowing what to avoid). At inference with mistake=false, the model skips the dropping behavior entirely while still knowing how to recover if something goes wrong — because it saw the recovery segment labeled mistake=false.

This is genuinely clever: the mistake labels don’t just filter out bad data. They decompose imperfect episodes into useful components. Every demonstration, no matter how messy, contributes something.

A worked example of metadata in action

Consider 100 demonstrations of “pick up the mug”:

Quality	Count	Behavior Pattern	Useful For
5	20	Clean top-down grasp, smooth lift, no hesitation	Deployment (set quality=5)
4	30	Correct grasp but slightly slow approach	Robust grasping strategies
3	25	Slightly misaligned grasp, minor correction needed	Recovery behaviors
2	15	First grasp failed, re-approached, eventually succeeded	Failure recovery, retry strategies
1	10	Multiple failures, eventually succeeded after 3+ attempts	Extreme recovery, workspace exploration

Without metadata: the model would average all 100 demonstrations, producing a hesitant, mediocre policy. With metadata: the model learns 5 different quality levels. At quality=5, it produces the clean execution from the top 20 demos. But the lower-quality data isn’t wasted — it teaches the model about object properties, workspace geometry, and what recovery looks like.

The key insight: Episode metadata turns suboptimal data from a liability into an asset. Without metadata, including a quality-2 demonstration hurts performance — the model copies the mistakes. With metadata, the model learns “when quality=2, this is what happens; when quality=5, this is what happens.” Set quality=5 at inference and you get clean execution, but the quality-2 data still contributed useful information about object properties, workspace geometry, and recovery strategies.

How does quality metadata allow π0.7 to learn from suboptimal demonstrations without degrading performance?

The model learns to associate quality scores with execution patterns — at inference, setting quality=5 selects for clean behavior while the suboptimal data still provides useful environmental information The metadata is used to filter out bad demonstrations before training The model only trains on quality-5 data and ignores the rest

Chapter 6: Training Data

The diversified prompt isn’t just about making a single dataset work better. It unlocks the ability to co-train on fundamentally different data sources that would be impossible to combine without context conditioning.

Four data sources

1. Teleoperated demonstrations — the core dataset. Humans teleoperating various robot platforms to perform household tasks. Each episode is labeled with subtask instructions, quality scores, and metadata. This is the highest-quality, most expensive data source.

2. Autonomous robot data — episodes collected by the robot practicing on its own, using earlier policy checkpoints. This data is cheaper but noisier. Quality and mistake labels distinguish it from clean demonstrations.

3. Human egocentric video — recordings of humans performing tasks from a head-mounted camera. No robot actions, no joint angles. But the visual sequences contain rich information about task structure, object affordances, and manipulation strategies. The model learns what things look like when done correctly, even without learning how to move the joints.

4. Web data — internet-scale text and images used to maintain the VLM backbone’s language understanding and visual recognition capabilities. Without this, the VLM’s pre-trained knowledge degrades during robot fine-tuning.

The training data by the numbers

Source	Volume	Actions?	Loss Applied	Contribution
Teleoperated demos	~10K hours	Yes (joint angles)	Flow matching + subtask prediction	Core manipulation skills
Autonomous practice	~5K hours	Yes (noisier)	Flow matching (quality-conditioned)	Recovery, exploration
Human egocentric video	~2K hours	No	VLM understanding only	Task structure, affordances
Web data	~50M image-text pairs	No	Standard VLM loss	Prevents catastrophic forgetting

How non-robot data helps: Human video and web data don’t contain robot actions, so they can’t directly teach the robot to move. But they teach the VLM backbone to understand the world — recognizing objects, predicting consequences of actions, understanding spatial relationships. The action expert then translates this understanding into motor commands. This is why Knowledge Insulation matters: the VLM trains on vision-language data, the action expert trains on robot data, and neither corrupts the other.

Training infrastructure

Parameter	Value
Hardware	128 TPU v5e pods
Pre-training steps	~400K (FAST discrete tokens)
Post-training steps	~100K (flow matching)
Batch size	4096 (mixed sources)
Learning rate	1e-4 (pre-training) → 5e-6 (post-training)
Total training duration	~10 days
VLM backbone (Gemma 3)	Frozen during post-training (KI)
Action expert (860M)	Trained throughout

Classifier-free guidance for robotics

During training, π0.7 randomly drops each prompt component with probability p=0.1. So 10% of the time, subtask instructions are replaced with empty strings. 10% of the time, subgoal images are replaced with blank images. This is directly analogous to classifier-free guidance in diffusion models.

Why? Two reasons:

Robustness: At inference, some prompt components may be unavailable (no subgoal image for a novel task, no quality label for a live scenario). The model must work with partial information.
Guidance strength: At inference, you can interpolate between conditioned and unconditioned outputs, amplifying the effect of each prompt component. Setting guidance scale > 1.0 for quality metadata makes the model MORE quality-sensitive than the training data distribution. This is free steerability.

Scaling laws: prompt diversity x data diversity

A critical finding: prompt diversity and data diversity have a synergistic relationship. Adding more diverse data without prompt diversity doesn’t help (the model averages). Adding prompt diversity without more data doesn’t help (the model overfits to prompt variations). But adding both together produces superlinear improvement — the prompt conditioning enables the model to absorb data diversity, and the data diversity gives the model more to learn from.

Why can π0.7 learn from human egocentric video even though it contains no robot actions?

The VLM backbone learns visual understanding (objects, affordances, task structure) from the video, while the separate action expert learns motor commands from robot data — Knowledge Insulation keeps them from interfering The model converts human hand movements into robot joint angles Human video is only used for pre-training, not for robot fine-tuning

Chapter 7: Emergent Capabilities

The most striking result of π0.7 isn’t any single task — it’s the emergent capabilities that arise from the combination of diversified prompting, diverse data, and scale. These are behaviors that were never explicitly trained for but emerge naturally from the model’s learned representations.

Out-of-the-box dexterity

Without any task-specific fine-tuning, π0.7 can perform remarkably dexterous tasks:

Espresso machine operation — manipulating the portafilter, tamping, locking into the group head, pressing buttons. This requires precise force control and understanding of mechanical constraints.
Laundry folding — handling deformable fabrics, planning fold sequences, adapting to arbitrary initial configurations. This requires understanding of fabric dynamics that rigid-body physics can’t capture.
Trash bag manipulation — opening bags, holding them open while filling, tying them shut. Extremely deformable, contact-rich manipulation.
Box folding — creasing cardboard along scored lines, tucking flaps, creating a rigid structure from a flat sheet.
Vegetable peeling — holding a vegetable in one hand and peeling with the other, maintaining consistent pressure and angle throughout.

Cross-embodiment transfer

π0.7 demonstrates zero-shot cross-embodiment transfer: it can transfer a learned skill from one robot body to a completely different one without any additional training. For example, folding learned on a bimanual platform transfers to a different bimanual robot with different kinematics, different cameras, and different workspace geometry.

This works because the diversified prompt disambiguates the embodiment. The control mode (joint vs. end-effector) and the robot-specific metadata tell the model which body it’s controlling, allowing a single set of weights to drive multiple platforms.

How cross-embodiment actually works: joint vs Cartesian

π0.7 supports two control modes, switchable via prompt:

Joint mode: Output = 7 joint angle deltas (shoulder, elbow, wrist). Platform-specific — the same joint angles produce different end-effector motions on different robots. Used when fine joint control matters (portafilter locking, fabric folding).
Cartesian mode: Output = 6 end-effector pose deltas (x, y, z, roll, pitch, yaw) + gripper. Platform-agnostic — “move 5cm right” means the same thing regardless of robot kinematics. The robot’s inverse kinematics controller handles the joint-level execution. Used for cross-embodiment transfer.

When transferring a skill from Robot A to Robot B, π0.7 uses Cartesian mode. The VLM understands the task (“fold this towel”), generates Cartesian trajectories (“move both grippers inward”), and each robot’s local IK controller translates to its own joint space. The diversified prompt specifies which mode to use, so the model learned both during training.

Compositional generalization

Perhaps the most impressive emergent capability: π0.7 can be coached to perform entirely novel tasks by composing known subtasks in new sequences. A human types step-by-step instructions, and the robot executes each one, even though the overall task was never seen in training.

Espresso-making: a full trace

Let’s walk through the full data flow for making espresso — one of π0.7’s showcase demonstrations:

Step	Subtask Instruction	Subgoal Image	Actions Generated	Duration
1	“Unlock portafilter latch”	Latch in open position	[50, 14]: bimanual reach + twist motion	~2s
2	“Remove portafilter”	Hand holding portafilter, machine empty	[50, 14]: pull-away trajectory	~3s
3	“Place under grinder”	Portafilter below grinder spout	[50, 14]: navigate + position	~4s
4	“Press grind button”	Button depressed, grounds falling	[50, 14]: reach + press	~1s
5	“Tamp the grounds”	Flat, compressed coffee surface	[50, 14]: hold filter + press tamper	~3s
6	“Lock portafilter in group head”	Portafilter locked in position	[50, 14]: insert + rotate motion	~4s
7	“Place cup under spout”	Cup positioned below spout	[50, 14]: grasp cup + navigate + place	~3s
8	“Press brew button”	Button depressed, espresso flowing	[50, 14]: reach + press	~1s

Total: 8 subtasks, ~21 seconds of manipulation. Each subtask generates 1-4 action chunks of 50 timesteps. The bimanual actions are 14-dimensional (7 joints + gripper per arm). The most challenging step is #6 (portafilter locking) — it requires precise force along a curved insertion path, coordinated with a 90-degree rotation. This is the step that fails most often (~30% failure rate).

What degrades and why

Even π0.7 has clear limits. Understanding what breaks helps clarify what the model actually learned:

Condition	Effect	Root Cause
Out-of-distribution object shape	Grasp fails ~40% of the time	Action expert hasn’t seen similar geometry; VLM recognizes object but action policy can’t grip it
Novel robot morphology (not in training)	Complete failure	Control mode metadata can’t bridge to truly unseen kinematics
Ambiguous language instruction	Model picks one interpretation, sometimes wrong	Prompt disambiguation helps but can’t resolve genuine ambiguity
Completely dark room	Complete failure	VLM is vision-based; no visual input = no understanding
Very fast required movements	Quality degrades (overshoots)	Action chunking at 50 Hz has latency; truly reactive motions need >100 Hz
Force-sensitive tasks (e.g., egg handling)	Inconsistent	No force/torque feedback in the observation space; relies on visual cues only

Why “emergent”? These capabilities weren’t the result of specific design choices for each task. Nobody trained a “peeling module” or an “espresso module.” They emerged from three ingredients at scale: (1) enough diverse data to cover the space of manipulations, (2) diversified prompts to prevent mode averaging, and (3) a model large enough to represent the full complexity. The whole is genuinely greater than the sum of parts.

How does π0.7 achieve zero-shot cross-embodiment transfer between different robot platforms?

The diversified prompt (control mode, robot metadata) disambiguates which body is being controlled, allowing one model to drive multiple platforms without additional training The model learns a universal action space that is the same for all robots A separate adapter module is trained for each robot body

Chapter 8: Results

Let’s look at the quantitative evidence for π0.7’s claims. The evaluations span multiple dimensions: task performance, ablations of the diversified prompt, scaling studies, and comparisons to prior work.

Out-of-the-box performance

π0.7 achieves strong zero-shot performance on dexterous tasks without task-specific fine-tuning. On a suite of 8 challenging manipulation tasks (espresso, folding, peeling, box assembly, etc.), the model achieves an average success rate significantly above prior VLAs that require per-task fine-tuning.

Ablation: prompt diversity matters

The ablation studies reveal the critical importance of each prompt component:

No subtask instructions: Performance drops sharply on multi-step tasks (−25 pp). The model can’t resolve which step to execute.
No quality metadata: Including suboptimal data hurts performance (−12 pp). Without quality labels, the model copies mistakes.
No subgoal images: Spatial precision decreases (−8 pp), especially on tasks requiring exact placement.
No speed metadata: The model produces inconsistent execution speeds (−5 pp) — sometimes too fast for precision tasks, sometimes too slow for simple ones.

The inference pipeline in real-world deployment

Step	Computation	Frequency	Latency
Camera capture	3 cameras @ 224x224	10 Hz	~5ms
SigLIP encoding	3 images → 768 tokens	10 Hz	~8ms
MEM history compression	300 frames → 64 tokens	2 Hz	~5ms
Subtask prediction (HL policy)	Autoregressive text	0.1-0.2 Hz	~400ms
Subgoal generation (BAGEL)	Image generation	0.1-0.2 Hz	~500ms
Flow matching (action expert)	10 denoising steps	2 Hz	~30ms
Motor command execution	50 joint commands	50 Hz	~1ms

The pipeline is asynchronous: subtask prediction and subgoal generation run in the background while the action expert continuously generates motor commands. The action expert never waits for the slow stages — it uses the most recent subtask and subgoal until they’re updated.

The synergy finding: The most important result in the ablations is that prompt diversity and data diversity are synergistic. Removing either one hurts performance, but removing both hurts more than the sum of removing each individually. The diversified prompt enables the model to absorb more diverse data, and diverse data gives the diversified prompt more to condition on. This virtuous cycle is the engine of π0.7’s generalization.

Scaling studies

π0.7 continues to improve with more data and larger models. The scaling curves show no sign of saturating, suggesting that the diversified prompt approach has room for further gains as robotics datasets grow.

Concrete scaling numbers from the paper:

Data Scale	Success Rate (dexterous suite)	vs Baseline Improvement
1K hours, no prompt diversity	38%	Baseline
1K hours, with prompt diversity	55%	+17 pp (prompt helps at all scales)
5K hours, no prompt diversity	48%	+10 pp (data alone has limits)
5K hours, with prompt diversity	72%	+34 pp (synergy kicks in)
10K hours, with prompt diversity	84%	+46 pp (best result)

The key observation: going from 1K to 10K hours with prompt diversity gives +29 pp. Going from no-prompt to full-prompt at 5K hours gives +24 pp. But the interaction effect (both together) is larger than either alone. This superlinear scaling is the central quantitative finding of the paper.

Cross-embodiment results

On cross-embodiment transfer tasks (folding trained on Robot A, evaluated on Robot B), π0.7 achieves success rates comparable to models trained directly on Robot B’s data — without ever having seen Robot B during training for that specific task.

What does the ablation study reveal about the relationship between prompt diversity and data diversity?

They are synergistic — removing either one hurts, but removing both hurts more than the sum of individual removals, showing they amplify each other Prompt diversity matters more than data diversity They are independent — each contributes a fixed improvement regardless of the other

Chapter 9: Connections

The π-series lineage

π0.7 is the latest in Physical Intelligence’s line of VLA models, each building on the last:

π0 (2024): The foundation — first VLA with flow matching for continuous actions. Proved that a single model can control 7 different robot types. π0.7 inherits the flow matching action expert but adds the diversified prompt.
π0.5 (2025): Open-world generalization via heterogeneous co-training. First VLA to clean kitchens in new homes. π0.7 extends this with structured prompt conditioning instead of relying on data volume alone.
π*0.6 (2025): Learning from experience via RL. The first VLA that improves from its own real-world failures. π0.7’s quality metadata shares the spirit of learning from imperfect data, but through supervised conditioning rather than RL.
Helix (2025): Training-time action conditioning for real-time control. Helix’s efficient chunking enables the fast inference that π0.7 relies on for real-time execution.

Key technical components

FAST (2025): Efficient action tokenization via DCT compression. Reduces action token count 5–10x, enabling the large context windows π0.7 needs for its diversified prompt.
Knowledge Insulation (KI, 2025): The training recipe that keeps VLM knowledge intact during robot fine-tuning. Essential for π0.7’s ability to leverage Gemma 3’s pre-trained capabilities.
MEM (2025): Multi-scale embodied memory for compressing observation history. Gives π0.7 temporal context without overwhelming the transformer’s context window.
Flow matching: The continuous action generation framework (from Lipman et al., 2023) that avoids the quantization errors of discrete tokenization. Core to π0’s original design and inherited by π0.7.
BAGEL: The image generation model used to synthesize subgoal images. Turns language subtask instructions into pixel-precise visual targets for the robot.

The engineering decisions that define π0.7

Decision	Alternative	Why π0.7’s Choice
Diversified prompt (4 types)	Language-only conditioning	Resolves mode averaging, enables steerability
Gemma 3 (4B) + action expert (860M)	Single unified model	Knowledge Insulation prevents forgetting
MEM for history (64 tokens)	Raw frame stacking	1200x compression makes history tractable
BAGEL subgoal images	Language-only subtasks	Pixel-level precision for spatial tasks
Quality/speed metadata	Filter bad data out	Turns all data into an asset, 2-3x more usable data
Joint + Cartesian modes	Joint-only	Different tasks benefit from different control spaces

Broader context

Diffusion Policy (Chi et al., 2023): Pioneered diffusion/flow-based action generation for visuomotor policies. π0.7’s action expert is a descendant of this line of work.
Classifier-free guidance (Ho & Salimans, 2022): The technique of conditioning on or dropping conditioning signals during training. π0.7’s diversified prompt is a robotics analogue — rich conditioning during training enables steerable generation at inference.
Scaling laws (Kaplan et al., 2020): π0.7’s scaling studies confirm that robot foundation models follow predictable improvement curves with more data and compute, mirroring language model scaling.

What π0.7 doesn’t solve

Despite its capabilities, π0.7 has fundamental limitations that define the next generation of challenges:

No force feedback: The model relies entirely on vision. Tasks requiring precise force control (e.g., not crushing a fragile object) are inconsistent because the model can’t feel contact forces.
No long-term memory: MEM compresses 10 seconds of history. But multi-room tasks require remembering “I already cleaned the kitchen counter” from 5 minutes ago. The model lacks this episodic memory.
No self-improvement: π0.7 is a static model — it doesn’t learn from its deployment experience. π*0.6 addresses this with RL from real-world failures, but combining RL with diversified prompts is an open problem.
Subgoal generation is slow: BAGEL takes ~500ms to generate a subgoal image. For fast tasks or dynamic environments, this latency is too high. Amortized generation or faster world models are needed.
Data annotation is expensive: Labeling every episode with subtask instructions, quality scores, and mistake labels costs $5-15 per episode. At 10K+ hours of training data, this is a significant expense. Auto-labeling (using VLMs to annotate robot data) is a partial solution but introduces noise.

The bigger picture

π0.7 represents a shift in how we think about robot learning. Instead of training specialized models for each task, embodiment, and quality level, you train one model on everything and let the prompt select the desired behavior. This is the same insight that made language models powerful: a single model, conditioned on diverse prompts, can perform any task in its training distribution — and many tasks outside it.

The lesson of π0.7: Data diversity alone isn’t enough. Prompt diversity alone isn’t enough. You need both. The prompt must be at least as expressive as the data is varied. When it is, the model can absorb arbitrarily diverse data without mode averaging, and the resulting policy is steerable, compositional, and generalizable.

π0.7: Steerable GeneralistFoundation Model