The first VLA that improves beyond its demonstrations through reinforcement learning. It ran an espresso machine for 13 hours straight and folded laundry in new homes for 2+ hours — by learning from its own mistakes.
Every VLA before pi*0.6 — including pi-0, pi-0.5, RT-2, OpenVLA — learns purely from imitation. A human demonstrates the task, the model copies the demonstration. This is behavioral cloning, and it has a hard limit: the policy can never exceed the demonstrator.
In practice, it's even worse. The policy makes small errors that accumulate over time (distribution shift). It encounters situations the demonstrator never showed it. It can't adapt to new environments without new demonstrations. And for hard tasks like making espresso — which involves pouring liquids, operating buttons, and handling hot equipment — even expert demonstrations contain suboptimal moments that the policy copies faithfully.
Humans don't learn this way. We practice. We try, fail, understand what went wrong, and adjust. We get better with repetition. pi*0.6 gives VLAs this same ability.
pi*0.6 uses RECAP — RL with Experience and Corrections via Advantage-conditioned Policies. The core idea: condition the VLA on a binary signal that says "this is a good action" vs "this is a bad action," then train it to prefer good actions.
This is different from standard RL (like PPO or SAC) which updates policy weights directly via policy gradients. Instead, RECAP treats RL as a conditional generation problem: generate actions conditioned on "be good." The VLA's existing generative capability does the heavy lifting — you just need to tell it what "good" means.
RECAP has three key components working together:
The VLA receives an extra input: a binary token indicating whether the current trajectory segment is "above average" (advantage > 0) or "below average" (advantage ≤ 0). During training, this is computed from a learned value function. At inference, the model is always conditioned on "above average" — asking it to generate its best behavior.
A separate network V(s) estimates the expected future reward from each state. The advantage of an action is: A(s,a) = R(s,a) + V(s') - V(s) — how much better was this action than average?
RECAP uses three types of data simultaneously: (a) expert demonstrations (high quality, but limited), (b) autonomous rollouts (the robot's own attempts — includes failures!), and (c) human interventions during execution (an operator corrects the robot when it makes a mistake).
The key mechanism. During training, each action in the dataset gets labeled with a binary advantage value:
where A(s,a) = R + γV(s') - V(s) is the temporal difference advantage estimated by the value function.
The VLA learns to generate different behaviors depending on this conditioning token. When conditioned on "+", it generates actions that led to good outcomes. When conditioned on "−", it generates actions that led to poor outcomes.
At inference time, we always condition on "+". The model generates its best behavior — actions associated with positive advantages throughout training.
The value function V(s) is the critic that evaluates how well the robot is doing. It takes the current observation (images + proprioception) and predicts the expected future reward.
The value function is trained separately from the VLA policy on the collected data. It uses a simpler architecture and is faster to update. After each round of data collection, the value function is re-trained on all accumulated data, then used to re-label the advantages for policy training.
In the real world, rewards are typically sparse: +1 if the task succeeded, 0 if it didn't. There's no dense reward signal telling the robot "you're getting closer to success." The value function must learn to bridge this gap — estimating which states are likely to lead to eventual success based on the sparse outcome signal.
RECAP doesn't just use one type of data — it strategically mixes multiple sources:
| Data Source | What It Provides | Advantage Labels |
|---|---|---|
| Expert demos | High-quality task execution | Mostly "+" (expert behavior is above average) |
| Autonomous rollouts (success) | Robot's own successful attempts | "+" for good actions, "−" for suboptimal ones |
| Autonomous rollouts (failure) | What went wrong and where | Mostly "−" (the value function identifies failure points) |
| Human interventions | Corrections at failure points | "−" before intervention, "+" after (recovery behavior) |
This mixture is critical. Without failures, the model can't learn what to avoid. Without demonstrations, it has no target behavior. Without interventions, it doesn't know how to recover. Each source fills a gap the others can't.
The full RECAP pipeline is iterative. Each iteration:
Each iteration produces a better policy. The paper shows that after just 1-2 iterations of this loop, performance improves dramatically — doubling throughput and halving failure rates on the hardest tasks.
The paper demonstrates RECAP on three challenging real-world tasks:
Operating a professional espresso machine: grinding beans, tamping, locking the portafilter, starting the extraction, steaming milk, pouring. The robot ran continuously for 13 hours making espresso drinks. This is the kind of endurance test that only matters for real-world deployment — and pi*0.6 passed it.
Assembling cardboard boxes from flat templates. This requires bimanual coordination, understanding of 3D folding geometry, and handling deformable materials that stick together unpredictably. RECAP more than doubled throughput on this task.
Folding diverse laundry items in homes the robot had never been in. The robot ran for over 2 hours without interruption in a completely new environment — demonstrating both the generalization from pi-0.5 and the reliability improvements from RECAP.
What makes pi*0.6 different from academic RL papers is that it's designed for real-world deployment, not simulation. Several design choices reflect this:
pi*0.6 represents a critical shift in the VLA paradigm:
| Evolution | What Changed |
|---|---|
| pi-0 (2024) | Foundation model — flow matching + VLM |
| pi-0.5 (2025) | Open-world generalization — co-training on heterogeneous data |
| pi*0.6 (2025) | Self-improvement — RL from real-world experience |
The progression is: learn from data (pi-0) → generalize to new environments (pi-0.5) → improve from experience (pi*0.6). Each step is necessary: you can't improve from experience if you don't generalize (pi-0.5), and you can't generalize if you don't have a strong foundation (pi-0).