Physical Intelligence, 2025

pi*0.6: A VLA That Learns
From Experience

The first VLA that improves beyond its demonstrations through reinforcement learning. It ran an espresso machine for 13 hours straight and folded laundry in new homes for 2+ hours — by learning from its own mistakes.

Prerequisites: VLA basics (pi-0) + RL fundamentals

Chapters

Simulations

Chapter 0: The Imitation Ceiling

Every VLA before pi*0.6 — including pi-0, pi-0.5, RT-2, OpenVLA — learns purely from imitation. A human demonstrates the task, the model copies the demonstration. This is behavioral cloning, and it has a hard limit: the policy can never exceed the demonstrator.

In practice, it's even worse. The policy makes small errors that accumulate over time (distribution shift). It encounters situations the demonstrator never showed it. It can't adapt to new environments without new demonstrations. And for hard tasks like making espresso — which involves pouring liquids, operating buttons, and handling hot equipment — even expert demonstrations contain suboptimal moments that the policy copies faithfully.

Humans don't learn this way. We practice. We try, fail, understand what went wrong, and adjust. We get better with repetition. pi*0.6 gives VLAs this same ability.

The fundamental question: Can a VLA learn from its own experience — not just from human demonstrations? Can it try a task, receive reward feedback ("that worked" or "that didn't"), and improve its policy? This is what reinforcement learning promises, but making it work with large VLA models at scale is a major engineering challenge.

Imitation Ceiling vs RL Improvement

Why is imitation learning fundamentally limited for robot policies?

The policy can never exceed the demonstrator, and small errors compound via distribution shift — it can't recover from mistakes it was never shown Imitation learning requires too much data The model architecture is too small

Chapter 1: The Key Insight

pi*0.6 uses RECAP — RL with Experience and Corrections via Advantage-conditioned Policies. The core idea: condition the VLA on a binary signal that says "this is a good action" vs "this is a bad action," then train it to prefer good actions.

This is different from standard RL (like PPO or SAC) which updates policy weights directly via policy gradients. Instead, RECAP treats RL as a conditional generation problem: generate actions conditioned on "be good." The VLA's existing generative capability does the heavy lifting — you just need to tell it what "good" means.

The analogy from language: RLHF for LLMs works similarly — you don't change the model's ability to generate text, you condition it to generate text that humans prefer. RECAP does the same for robot actions: condition the VLA to generate actions that lead to task success.

Step 1

Pre-train VLA with offline RL on diverse data (advantage conditioning built in)

↓

Step 2

Deploy on target task, collect autonomous rollouts + human interventions

↓

Step 3

Train value function on collected data, estimate advantages

↓

Step 4

Fine-tune VLA conditioned on improved advantages → better policy

↻ repeat Steps 2-4

How does RECAP differ from standard policy gradient RL methods?

It treats RL as conditional generation — conditioning the VLA on advantage estimates rather than directly updating weights via policy gradients It uses a bigger model It doesn't use a reward signal

Chapter 2: The RECAP Method

RECAP has three key components working together:

1. Advantage conditioning

The VLA receives an extra input: a binary token indicating whether the current trajectory segment is "above average" (advantage > 0) or "below average" (advantage ≤ 0). During training, this is computed from a learned value function. At inference, the model is always conditioned on "above average" — asking it to generate its best behavior.

2. Learned value function

A separate network V(s) estimates the expected future reward from each state. The advantage of an action is: A(s,a) = R(s,a) + V(s') - V(s) — how much better was this action than average?

3. Heterogeneous data

RECAP uses three types of data simultaneously: (a) expert demonstrations (high quality, but limited), (b) autonomous rollouts (the robot's own attempts — includes failures!), and (c) human interventions during execution (an operator corrects the robot when it makes a mistake).

Why interventions matter: When a human takes over to correct a mistake, the system records the transition from "bad behavior" to "good behavior" at exactly the failure point. This is the most informative training signal possible — it shows both what went wrong AND what the correct recovery looks like, precisely where the policy needs to improve.

What are the three types of data RECAP combines?

Expert demonstrations + autonomous rollouts (including failures) + human interventions during execution Simulation data + real data + web data Only successful demonstrations

Chapter 3: Advantage Conditioning

The key mechanism. During training, each action in the dataset gets labeled with a binary advantage value:

advantage token = { "+" if A(s,a) > 0, "−" otherwise }

where A(s,a) = R + γV(s') - V(s) is the temporal difference advantage estimated by the value function.

The VLA learns to generate different behaviors depending on this conditioning token. When conditioned on "+", it generates actions that led to good outcomes. When conditioned on "−", it generates actions that led to poor outcomes.

At inference time, we always condition on "+". The model generates its best behavior — actions associated with positive advantages throughout training.

This is elegant because it's simple. No policy gradients, no importance sampling, no trust regions. Just: label each training action as good or bad, train the VLA to distinguish them, then ask for good actions at test time. The VLA's existing conditional generation capability handles the rest.

At inference time, what advantage token is used to condition pi*0.6?

Always "+" (positive) — asking the model to generate its best behavior A mix of "+" and "-" No conditioning is used

Chapter 4: The Value Function

The value function V(s) is the critic that evaluates how well the robot is doing. It takes the current observation (images + proprioception) and predicts the expected future reward.

The value function is trained separately from the VLA policy on the collected data. It uses a simpler architecture and is faster to update. After each round of data collection, the value function is re-trained on all accumulated data, then used to re-label the advantages for policy training.

Sparse rewards in the real world

In the real world, rewards are typically sparse: +1 if the task succeeded, 0 if it didn't. There's no dense reward signal telling the robot "you're getting closer to success." The value function must learn to bridge this gap — estimating which states are likely to lead to eventual success based on the sparse outcome signal.

The role of interventions: Human interventions provide a richer signal than just success/failure. When a human takes over, the moment of intervention is a clear signal: "something was going wrong HERE." This gives the value function more precise information about which states are bad, even within otherwise successful trajectories.

Why are human interventions during deployment valuable for training the value function?

They provide precise state-level feedback about WHERE things went wrong, supplementing the sparse task-level success/failure signal They provide more demonstration data They speed up data collection

Chapter 5: The Data Mix

RECAP doesn't just use one type of data — it strategically mixes multiple sources:

Data Source	What It Provides	Advantage Labels
Expert demos	High-quality task execution	Mostly "+" (expert behavior is above average)
Autonomous rollouts (success)	Robot's own successful attempts	"+" for good actions, "−" for suboptimal ones
Autonomous rollouts (failure)	What went wrong and where	Mostly "−" (the value function identifies failure points)
Human interventions	Corrections at failure points	"−" before intervention, "+" after (recovery behavior)

This mixture is critical. Without failures, the model can't learn what to avoid. Without demonstrations, it has no target behavior. Without interventions, it doesn't know how to recover. Each source fills a gap the others can't.

The self-improvement loop: Deploy → collect autonomous experience → human intervenes when needed → train value function → re-estimate advantages → fine-tune policy → deploy improved policy → repeat. Each iteration produces a better policy that makes fewer mistakes, requiring fewer interventions, generating more useful autonomous data.

Why is failed autonomous experience valuable for RECAP training?

Failures with negative advantage labels teach the model what to AVOID — the value function identifies exactly where things went wrong Failed data is discarded Failures are converted to successes

Chapter 6: Self-Improvement Loop

The full RECAP pipeline is iterative. Each iteration:

Deploy the current policy on the target task
Collect autonomous rollouts (successes and failures) with occasional human interventions
Train the value function on all accumulated data (old + new)
Re-label all data with updated advantage estimates
Fine-tune the VLA policy conditioned on the new advantages
Repeat from step 1 with the improved policy

Each iteration produces a better policy. The paper shows that after just 1-2 iterations of this loop, performance improves dramatically — doubling throughput and halving failure rates on the hardest tasks.

Practice makes perfect: Just like a human learning a new skill, the robot gets better with practice. The first few attempts are clumsy. With each iteration, it learns from its mistakes and becomes more reliable. The key difference from pure RL: the human interventions provide a safety net and targeted feedback that accelerates learning.

How many iterations of the self-improvement loop does RECAP typically need to see significant improvement?

Just 1-2 iterations — each doubling throughput and halving failure rates on hard tasks Hundreds of iterations It never converges

Chapter 7: Results

The paper demonstrates RECAP on three challenging real-world tasks:

Espresso making

Operating a professional espresso machine: grinding beans, tamping, locking the portafilter, starting the extraction, steaming milk, pouring. The robot ran continuously for 13 hours making espresso drinks. This is the kind of endurance test that only matters for real-world deployment — and pi*0.6 passed it.

Box assembly

Assembling cardboard boxes from flat templates. This requires bimanual coordination, understanding of 3D folding geometry, and handling deformable materials that stick together unpredictably. RECAP more than doubled throughput on this task.

Laundry folding in new homes

Folding diverse laundry items in homes the robot had never been in. The robot ran for over 2 hours without interruption in a completely new environment — demonstrating both the generalization from pi-0.5 and the reliability improvements from RECAP.

RECAP Improvement: Before vs After

The headline numbers: On the hardest tasks, RECAP more than doubles throughput and roughly halves the failure rate. These aren't incremental improvements — they represent the difference between "lab demo" and "practically useful."

How long did pi*0.6 run continuously making espresso drinks?

30 minutes 2 hours 13 hours — a real endurance test for deployment

Chapter 8: Real Deployment

What makes pi*0.6 different from academic RL papers is that it's designed for real-world deployment, not simulation. Several design choices reflect this:

Human-in-the-loop: Operators can intervene at any time. The intervention data becomes training signal — safety and learning are aligned.
Sparse rewards: No reward engineering. Just "did the task succeed?" This is all you can reliably measure in the real world.
Offline RL: All training happens offline on collected data. No risky online policy updates during execution.
Factory testing: The box assembly task was tested with boxes actually used for packaging in a factory — not simplified lab versions.

The path to commercial robotics: pi*0.6 shows that VLAs can reach the reliability levels needed for real-world deployment. 13 hours of continuous espresso making, 2+ hours of laundry folding — these are the metrics that matter for products, not benchmark scores.

Why does RECAP use offline RL instead of online RL?

Offline RL is safer — all policy updates happen on collected data, not during live execution where risky exploration could damage hardware or the environment Offline RL is faster Online RL requires simulation

Chapter 9: Connections

pi*0.6 represents a critical shift in the VLA paradigm:

Evolution	What Changed
pi-0 (2024)	Foundation model — flow matching + VLM
pi-0.5 (2025)	Open-world generalization — co-training on heterogeneous data
pi*0.6 (2025)	Self-improvement — RL from real-world experience

The progression is: learn from data (pi-0) → generalize to new environments (pi-0.5) → improve from experience (pi*0.6). Each step is necessary: you can't improve from experience if you don't generalize (pi-0.5), and you can't generalize if you don't have a strong foundation (pi-0).

Related lessons: pi-0 • pi-0.5 • Gleams: RL Algorithms • Gleams: VLA

"It's amazing what you can learn if you're not afraid to try."

— Robert A. Heinlein (quoted in the paper)

pi*0.6: A VLA That LearnsFrom Experience