Physical Intelligence + Stanford, 2025

Human to Robot
Transfer

Can watching humans teach robots to act? Yes -- but ONLY when the VLA is pre-trained on enough diverse robot data first. Below a threshold of robot data diversity, human data doesn't help at all. Above the threshold, it provides a massive boost. Transfer emerges at scale.

Prerequisites: VLA basics + Pre-training concepts
8
Chapters
3+
Simulations

Chapter 0: The Data Problem

Robot learning is starved for data. Training a foundation model requires millions of diverse examples. Language models train on trillions of words from the internet. Image models train on billions of images. But robot manipulation data? We have maybe hundreds of thousands of demonstrations, collected painfully by teleoperating robots in labs.

Meanwhile, there are billions of hours of human manipulation video on the internet -- cooking tutorials, assembly guides, craft videos, surgery recordings. Every YouTube video of someone folding laundry contains useful information about how fabric behaves, what sequence of folds to use, where to grasp. If we could use this data to train robots, the data scarcity problem would evaporate.

The problem: humans and robots have different bodies. A human hand has 27 degrees of freedom and compliant skin. A robot gripper has 1-2 DOF and rigid fingers. Watching a human fold a shirt doesn't directly tell a robot gripper how to fold a shirt. The visual appearance is different, the kinematics are different, the contact mechanics are different.

The gap seems unbridgeable: Previous attempts to use human video data for robot learning showed inconsistent results. Sometimes it helped a little. Sometimes it hurt. Nobody could reliably make it work. This paper discovers WHY: the transfer isn't about better algorithms. It's about having enough diverse robot data as a foundation. Below a threshold, nothing helps. Above it, everything works.
Why can't robots directly learn manipulation from human videos?

Chapter 1: The Emergence

The paper's central finding is a phase transition. They train VLAs with varying amounts of robot pre-training data, then add human video data and measure if it helps. The result:

This is an emergent capability -- it appears suddenly as the base of robot data diversity crosses a threshold. Below the threshold, the model lacks the embodiment-agnostic representations needed to bridge the human-robot gap. Above it, those representations emerge and human data becomes usable.

The Emergence Curve

Drag the robot data diversity slider. Below the threshold, human data doesn't help. Above it, a large benefit emerges.

Robot data diversity20%
This mirrors language model emergence: GPT-3 showed that certain capabilities (arithmetic, chain-of-thought reasoning) don't improve gradually with scale -- they appear suddenly above a threshold. Human-to-robot transfer follows the same pattern: the capability is absent, absent, absent, then suddenly present. The substrate must be rich enough before the emergent capability can manifest.
What determines whether human video data will benefit robot learning?

Chapter 2: What Transfers

What exactly does the robot learn from watching humans? Not motor commands -- a human wrist rotation doesn't map to a robot joint angle. Instead, higher-level knowledge transfers:

Task structure and sequencing

Watching someone make coffee teaches the order of operations: grind beans, boil water, add grounds to filter, pour water. This task structure is embodiment-independent. The robot learns WHAT to do and in what order, even though HOW to do each step differs between human and robot.

Object affordances

Human videos demonstrate how objects behave: cups have handles that afford grasping, drawers slide out, lids twist off, buttons push in. These affordances are properties of the objects, not the manipulator. A robot that has seen humans interact with thousands of objects learns what each object "wants" -- where to grasp, how to manipulate, what motions are effective.

Spatial relationships and goals

Videos show where objects end up: the plate goes on the table, the cup goes under the spout, the cloth covers the surface. These spatial goals are embodiment-independent -- the target configuration of objects is the same whether a human or robot arranges them.

Think of it as semantic transfer: Human data teaches the "what" and "where" of manipulation. Robot data teaches the "how." The VLA needs both: human data for breadth of understanding (thousands of objects, tasks, and configurations) and robot data for grounding (translating understanding into motor commands for a specific embodiment).
What specific knowledge transfers from human videos to robot control?

Chapter 3: What Doesn't Transfer

Knowing what doesn't transfer is equally important. The paper identifies clear boundaries:

Fine motor control

The precise forces, velocities, and trajectories needed to execute a manipulation step are embodiment-specific. A human hand grasps a cup differently than a parallel-jaw gripper. The approach angle, grip width, contact points, and forces are completely different. This low-level motor knowledge must come from robot-specific data.

Contact mechanics

How objects respond to manipulation depends on the manipulator. A human finger can apply gentle rolling friction; a rigid gripper cannot. The physics of contact are different, and no amount of watching humans will teach a robot the contact dynamics of its own hardware.

Proprioceptive feedback

Humans use rich tactile and proprioceptive feedback (Am I gripping too hard? Is the object slipping?). Robots have different sensors with different characteristics. The feedback loop is fundamentally different.

The clear boundary: Everything above the "action interface" transfers -- understanding of tasks, objects, and goals. Everything below the action interface -- the mapping from goals to motor commands on specific hardware -- does not. This boundary is clean and predictable, which is useful for practitioners deciding what data to collect.
Why can't fine motor control transfer from human to robot demonstrations?

Chapter 4: Bridging the Embodiment Gap

The key question: how does the model bridge the gap between human and robot embodiments? The answer lies in embodiment-agnostic representations.

When a VLA is trained on sufficiently diverse robot data -- many different robots with different morphologies, joint configurations, and grippers -- it is forced to learn representations that abstract away the specific hardware. A "grasp" becomes a concept that is independent of whether it's executed by a 6-DOF arm or a 7-DOF arm, a parallel gripper or a suction cup.

Once the model has these embodiment-agnostic representations, adding human data is just adding another "embodiment" to the mix. The model already knows how to extract embodiment-independent knowledge (task structure, affordances, spatial goals) from diverse data sources. Human data is simply a very rich source of this knowledge.

The diversity hypothesis: It's not the AMOUNT of robot data that enables transfer -- it's the DIVERSITY. A model trained on millions of demonstrations from a single robot type cannot bridge to human data. A model trained on fewer demonstrations but from 20 different robot types can. Diversity forces the model to learn abstract, transferable representations.
Embodiment-Agnostic Representations

Different embodiments performing the same task converge to similar internal representations when diversity is high enough. Drag to adjust diversity.

Robot type diversity2
What enables the model to bridge the human-robot embodiment gap?

Chapter 5: The Scaling Curve

The paper maps out the full scaling curve, revealing a clear picture of how human data benefit changes with robot data diversity.

Phase 1: No transfer (low diversity)

With 1-3 robot types in pre-training, adding human data provides zero benefit. The model's representations are too specific to the few robots it has seen. It cannot generalize to the radically different "embodiment" of a human hand.

Phase 2: Weak transfer (medium diversity)

With 5-10 robot types, the model begins to develop some embodiment-agnostic representations. Human data provides small, inconsistent gains. Transfer is fragile -- it works for some tasks but not others, and the effect is hard to distinguish from noise.

Phase 3: Strong transfer (high diversity)

With 15+ robot types, the model has robust embodiment-agnostic representations. Human data provides a large, consistent improvement across all evaluated tasks. The transfer is reliable and significant.

The practical takeaway: Don't add human data to a VLA trained on one or two robot types -- it won't help and may hurt. First ensure sufficient diversity in robot pre-training (ideally 15+ embodiment types). Then human data becomes a powerful augmentation that improves performance on tasks the robot has never seen demonstrated on its own hardware.
Scaling Curve: Human Data Benefit vs Robot Diversity

The three phases of transfer. The dashed line shows performance without human data; the solid line shows performance with it.

At approximately how many robot types does human data transfer become reliable?

Chapter 6: Implications

This finding has profound implications for the future of robot learning.

The data flywheel accelerates

Once a VLA crosses the diversity threshold, the internet becomes a training data source. Billions of hours of human manipulation video are suddenly usable. This creates a flywheel: more diverse robot data enables human data use, which improves the model, which motivates collecting more diverse robot data.

Collection strategy shifts

The traditional approach was to collect lots of data from one robot type for one task. This paper argues for the opposite: collect a little data from MANY robot types across MANY tasks. Diversity trumps volume. Ten demonstrations from each of 100 robot types is more valuable than 1000 demonstrations from one robot type.

New tasks become possible

Some tasks have extensive human video demonstrations but zero robot demonstrations (e.g., cooking complex recipes, intricate crafts, surgical procedures). With sufficient pre-training diversity, a VLA could learn these tasks primarily from human video, with only a small amount of robot-specific fine-tuning for the motor control layer.

The analogy to language models: Early language models were trained on curated, domain-specific text. GPT showed that training on diverse internet text -- even noisy, multi-domain, multi-language text -- produces far superior models. This paper makes the same point for robot learning: diverse, multi-embodiment data (including human video) is the path to capable robot foundation models.
Why does this finding change the optimal data collection strategy for robot learning?

Chapter 7: Connections

Human-to-robot transfer connects to several broader themes in machine learning:

ThemeThis paper's parallel
Emergent capabilitiesTransfer appears suddenly above a diversity threshold, mirroring how reasoning emerges in language models above a scale threshold
Cross-domain transferLike how ImageNet pre-training helps medical imaging (different "embodiments" of visual understanding), robot diversity pre-training helps transfer from human "embodiments"
Foundation model recipeDiverse pre-training + task-specific fine-tuning = the universal recipe. This paper shows it applies across embodiment boundaries
Scaling lawsJust as language model capabilities follow scaling laws in compute and data, human-to-robot transfer follows a scaling law in diversity
The Transfer Landscape

Visualizing what crosses the embodiment boundary. Green = transfers. Red = doesn't transfer. The boundary sits at the "action interface."

Related lessons: pi-0pi-0.5MEMGleams: VLA
"The key to artificial intelligence has always been the representation."
— Jeff Hinton, a principle this paper proves for embodied AI: the right representations emerge from diverse data and enable cross-embodiment transfer