In-the-wild robot teaching without in-the-wild robots — collect manipulation demos anywhere with a handheld gripper, then transfer to any robot via relative actions and diffusion policy.
Robot learning needs data. Lots of it. And diverse data at that — different kitchens, different objects, different lighting, different backgrounds. But there's a fundamental bottleneck: robots are stuck in labs.
Teleoperation (controlling a robot remotely to collect demonstrations) is the standard approach. But it requires the actual robot hardware, an expert operator, and a controlled lab environment. You can't wheel a UR5e into your kitchen and start collecting data while cooking dinner.
The result? Robot learning datasets are collected in a handful of identical lab setups. The policies trained on this data work beautifully — in that same lab, with that same lighting, on that same table. Move to a different kitchen? They fail.
By 2024, the field had tried two approaches, both insufficient:
Teleoperation gives good actions but limited environments. Internet video gives diverse visuals but no actions. UMI bridges the gap.
UMI's insight is almost embarrassingly simple: decouple data collection from robot hardware.
Instead of collecting demos on a robot, build a handheld gripper that looks and acts like a robot gripper but is operated by a human hand. Take it to any kitchen, any office, any outdoor table. Collect demonstrations by just... doing the task. Then train a policy on that data and deploy it on any robot.
The key is that UMI's handheld gripper is designed so that the observation space (what the camera sees) and the action space (how the gripper moves) are nearly identical between human demonstration and robot deployment:
Previous handheld gripper approaches (like Dobb-E) were limited to quasi-static pick-and-place. They used monocular structure-from-motion (SfM) which suffers from scale ambiguity and can't track fast motions. They ignored latency differences between collection and deployment. And they used simple MLP policies that can't capture multimodal action distributions.
UMI solves each of these with careful interface design: IMU-aware SLAM for robust tracking, inference-time latency matching for dynamic tasks, and Diffusion Policy for multimodal actions.
The UMI gripper is a 3D-printed handheld parallel-jaw gripper with soft fingers, weighing 780g. Its only sensor is a GoPro camera. Total cost: ~$370 ($73 for the gripper, $298 for the GoPro and accessories). Let's break down every design decision.
The GoPro is mounted at the wrist — the same position relative to the fingers as it will be on the robot. This is critical: the images the policy sees during training (from the handheld gripper) look nearly identical to what it sees during deployment (from the robot gripper). No domain gap.
A side benefit: because the camera moves with the gripper, the policy learns to focus on task-relevant objects rather than background structure. This is like free data augmentation — similar to random cropping.
A standard camera lens gives ~69 degrees field of view. UMI uses a 155-degree fisheye lens. Why? Because when your camera is inches from the object you're manipulating, a narrow field of view means you can't see the table, the saucer, or the context needed to plan actions.
UMI uses the raw fisheye image without undistortion. Rectifying a 155-degree image to a standard pinhole model stretches the periphery grotesquely while compressing the center (where the action happens) to a tiny area. Raw fisheye naturally preserves resolution where it matters most.
A monocular camera can't directly perceive depth. UMI's solution is elegant: place physical mirrors on either side of the gripper, angled so they appear in the camera's peripheral fisheye view. Each mirror creates a virtual camera — effectively giving you three viewpoints (center + two sides) in a single image.
The raw mirror images are digitally reflected before being fed to the policy. Without reflection, the orientation of objects in the mirrors is flipped relative to the main view, which confuses vision encoders. With reflection, all three views are consistent.
GoPro records IMU data (accelerometer + gyroscope) alongside video. UMI feeds both into ORB-SLAM3, a visual-inertial SLAM system. The IMU provides absolute scale (solving structure-from-motion's scale ambiguity) and maintains tracking during fast motions when visual features blur.
Previous handheld grippers used binary open/close. UMI tracks finger width continuously via fiducial markers on the fingers. This is essential for tasks like tossing, where the exact moment and width of release determines success.
Interactive diagram of the UMI gripper showing all hardware components. Click components to highlight.
The most critical technical challenge: how do you extract precise 6-DOF end-effector trajectories from a handheld gripper with just a GoPro camera? No robot encoders, no motion capture, no external tracking system.
UMI uses ORB-SLAM3, a state-of-the-art visual-inertial SLAM system. Here's what each component contributes:
The result: accurate 6-DOF pose trajectories at real-world scale, even during fast tossing motions that would defeat pure visual tracking.
For each new scene (kitchen, table, etc.), UMI uses a two-phase process:
Visualization of how SLAM extracts a 6-DOF trajectory from camera motion. The gripper moves through space and SLAM recovers its path.
Previous handheld gripper approaches used monocular SfM (structure-from-motion). SfM has three fatal problems for manipulation:
Visual-inertial SLAM solves all three: IMU provides scale, maintains tracking through blur, and loop closure corrects drift.
UMI uses Diffusion Policy as its backbone — the same framework from Chi et al. 2023, but with critical adaptations for the handheld-to-robot transfer problem.
Human demonstrations are inherently multimodal. When a cup handle faces away from you, you can rotate it clockwise or counter-clockwise — both are correct. Simple regression policies (MSE loss) average these modes, producing a motion that does neither correctly. Diffusion Policy models the full distribution of valid actions, naturally handling multimodality.
The policy observes a sequence of synchronized inputs at each timestep:
This is UMI's most important design decision. Instead of predicting actions in absolute coordinates (which require knowing the robot's base frame) or delta actions (which accumulate error), UMI uses relative trajectories:
Each action in the predicted sequence is the desired pose relative to the current pose at the start of prediction. At the next inference step, the whole trajectory is re-anchored to the new current pose. This means:
Compare how absolute, delta, and relative trajectory representations handle a curved path. Observe how delta accumulates error while relative re-anchors each step.
For simpler tasks, UMI trains a ResNet vision encoder from scratch. For complex tasks (dish washing, in-the-wild generalization), it fine-tunes a CLIP-pretrained ViT. The pretrained features provide the semantic understanding needed for tasks with diverse objects and environments.
The journey from raw GoPro footage to a deployable robot policy has several careful stages. Each stage addresses a specific challenge in bridging the handheld-to-robot gap.
The demonstrator performs the task while holding the UMI gripper. The GoPro records video at high resolution with synchronized IMU data embedded in the MP4 file. No external sensors, no wires, no setup beyond picking up the gripper.
Each video is processed through ORB-SLAM3 (visual-inertial mode). Output: a 6-DOF camera pose for every frame, in metric scale. The gripper width at each frame is extracted from the fiducial markers on the fingers.
The raw SLAM output is at video framerate (e.g., 60 Hz). The policy operates at a lower frequency (10-20 Hz for images, higher for proprioception). The trajectories are temporally resampled and interpolated to the policy's operating frequency.
Optionally, trajectories are checked against the target robot's kinematic model. Demonstrations where the gripper moves beyond the robot's reach or requires impossible joint configurations are filtered out. This ensures the policy only trains on executable motions.
The side mirror regions in each fisheye image are detected, cropped, digitally reflected (flipped), and swapped (left mirror becomes right view). This creates consistent multi-view observations from a single camera.
The processed data (images, relative EE trajectories, gripper widths) is used to train Diffusion Policy via standard behavior cloning. The observation horizon is typically 2 (current + previous frame), and the action prediction horizon is 16 steps.
End-to-end flow from raw handheld demonstration to deployable robot policy.
Training is done on handheld data. Now the policy needs to run on an actual robot. This is where UMI's careful interface design pays off — but two critical timing issues must be addressed.
During handheld data collection, there's essentially zero latency between observation and action — the human sees the scene and moves their hand in the same instant. But on a robot, latency is everywhere:
UMI's solution is elegant: measure each latency source physically, then compensate at inference time:
Observation side: Different sensor streams (camera, robot joint encoders, gripper encoder) have different latencies. UMI measures each one and time-aligns all observations to the highest-latency stream (usually the camera). Faster streams are interpolated backwards to match the camera's timestamp.
Action side: The policy predicts a sequence of future actions. But by the time the robot starts executing, the first several actions are already stale (due to observation + inference + execution latency). UMI simply skips those stale actions and only executes actions with timestamps after the expected execution time.
Deploying on a new robot requires minimal setup:
The same policy checkpoint trained on handheld data has been deployed on both a UR5e (6-DOF) and a Franka FR2 (7-DOF) without retraining. The cup arrangement task achieved 100% on UR5e and 90% on Franka (2 failures were joint limit violations from the robot's mounting position, not policy errors).
UMI is evaluated on four tasks that push the boundaries of what behavior cloning can achieve. Each task tests a different capability.
Place an espresso cup on a saucer with the handle facing left. This tests precision (placement), multimodality (clockwise vs counter-clockwise rotation), and depth perception (via mirrors). 305 episodes from 2 demonstrators.
UMI achieves 20/20 = 100% on UR5e, and the same policy deployed on Franka FR2 achieves 18/20 = 90%.
Sort 6 YCB objects by tossing them to two bins placed beyond the robot's reach — spherical objects to a round bin, Lego pieces to a rectangular bin. This tests fast motions, precise release timing, and hand-eye coordination. 280 episodes.
UMI achieves 105/120 = 87.5%. Without latency matching: 57.5%.
Two robot arms fold a sweater: fold sleeves inward, fold bottom up, rotate 90 degrees, fold in half. Requires tight bimanual coordination — if one arm grabs slightly too early, the fold fails. 250 episodes from 2 demonstrators.
UMI achieves 14/20 = 70%. Without inter-gripper proprioception: 30%.
A 7-step long-horizon task: turn on faucet, grab plate, pick up sponge, wash ketchup off, place plate on rack, return sponge, turn off faucet. Tests deformable manipulation, fluid handling, and semantic understanding of "clean." 258 episodes.
UMI with ViT encoder achieves 14/20 = 70%. With ResNet: 0%.
Success rates across all four benchmark tasks. Bars show UMI vs key ablations.
This is UMI's headline result. Can data collected in the wild — across diverse kitchens, offices, and outdoor locations — produce policies that generalize to entirely new environments?
3 demonstrators collected 1,400 cup arrangement demonstrations across 30 diverse physical locations: homes, offices, restaurants, outdoor tables. They used 15 different espresso cups (ceramic, glass, metal; cylindrical and tapered; various colors). Total collection time: 12 person-hours.
The policy was trained with a CLIP-pretrained ViT-L/14 vision encoder to handle the visual diversity.
The trained policy was tested in two completely unseen environments:
Combined success rate: 43/60 = 71.7%. Remarkably, performance on unseen cups (75%) is actually slightly better than on training cups (70%). The policy has learned a generalizable notion of "espresso cup" rather than memorizing specific objects.
When trained on data from 30 locations, the policy encounters: wooden tables, metal tables, glass tables, stone surfaces, paper tablecloths, different lighting conditions, different backgrounds, and different distractors. It's forced to learn which visual features actually matter for the task (the cup, the saucer, the spatial relationship) and which are irrelevant (the table material, the background, the lighting). This is the same principle behind ImageNet-scale pretraining, but applied at the robotics task level.
Diffusion Policy (Chi et al., 2023): UMI's policy backbone. Diffusion Policy models multimodal action distributions via iterative denoising — essential for handling the natural diversity of human demonstrations.
ACT / ALOHA (Zhao et al., 2023): Demonstrated that low-cost bimanual teleoperation is possible. But ALOHA still requires the robot during data collection. UMI removes that requirement entirely.
ORB-SLAM3 (Campos et al., 2021): The visual-inertial SLAM system that enables UMI's precise action extraction. Without IMU fusion, handheld data collection can't support dynamic tasks.
Dobb-E (Shafiullah et al., 2023): A reacher-grabber tool with iPhone for data collection. But limited to quasi-static tasks and requires environment-specific fine-tuning. UMI enables dynamic, bimanual, and zero-shot transfer.
DROID (Khazatsky et al., 2024): A large-scale robot manipulation dataset. UMI showed that diverse data collection is the key bottleneck, motivating large-scale distributed data collection efforts.
pi-0 / Physical Intelligence (2024): Foundation models for robot manipulation. UMI demonstrated that handheld data can be a scalable data source for training general-purpose robot policies.
MimicGen (Mandlekar et al., 2023): Generates demonstration data synthetically. UMI provides an orthogonal approach: collect real diverse data cheaply instead of generating synthetic data.