UMI — Veanors

Chapter 0: The Problem

Robot learning needs data. Lots of it. And diverse data at that — different kitchens, different objects, different lighting, different backgrounds. But there's a fundamental bottleneck: robots are stuck in labs.

Teleoperation (controlling a robot remotely to collect demonstrations) is the standard approach. But it requires the actual robot hardware, an expert operator, and a controlled lab environment. You can't wheel a UR5e into your kitchen and start collecting data while cooking dinner.

The result? Robot learning datasets are collected in a handful of identical lab setups. The policies trained on this data work beautifully — in that same lab, with that same lighting, on that same table. Move to a different kitchen? They fail.

The two extremes

By 2024, the field had tried two approaches, both insufficient:

Teleoperation (ALOHA, spacemouse, VR controllers): High-quality action data, but chained to expensive robot hardware. Limited to lab environments. Embodiment-specific — data collected on a Franka can't be used on a UR5.
Internet video (YouTube, Ego4D): Massive visual diversity, but no action labels. The embodiment gap between human hands and robot grippers is enormous. Extracting usable actions from passive video remains an open problem.

The wish list: We want a data collection system that is (1) portable — take it anywhere, (2) captures precise robot-compatible actions, (3) produces data that transfers across different robot embodiments, and (4) costs less than a semester's coffee budget. UMI delivers all four for ~$370.

The Data Collection Dilemma

Teleoperation gives good actions but limited environments. Internet video gives diverse visuals but no actions. UMI bridges the gap.

Why can't we simply collect more teleoperation data in more environments to get visual diversity?

Teleoperation requires the physical robot hardware on-site — you can't easily move a robot arm to 30 different kitchens, offices, and outdoor locations Teleoperation data is too noisy Teleoperation only works for pick-and-place tasks

Chapter 1: The Key Insight

UMI's insight is almost embarrassingly simple: decouple data collection from robot hardware.

Instead of collecting demos on a robot, build a handheld gripper that looks and acts like a robot gripper but is operated by a human hand. Take it to any kitchen, any office, any outdoor table. Collect demonstrations by just... doing the task. Then train a policy on that data and deploy it on any robot.

Why this works

The key is that UMI's handheld gripper is designed so that the observation space (what the camera sees) and the action space (how the gripper moves) are nearly identical between human demonstration and robot deployment:

Same camera: A wrist-mounted GoPro with fisheye lens, positioned identically relative to the fingers on both the handheld gripper and the robot gripper
Same gripper geometry: The 3D-printed fingers are the same shape, so the gripper-object interactions look the same
Same action representation: 6-DOF end-effector poses extracted via visual SLAM, represented as relative trajectories — no robot-specific coordinates needed

The "universal" in Universal Manipulation Interface: Universal means three things simultaneously. (1) Any environment — collect data in kitchens, offices, cafes, outdoors. (2) Any action — dynamic tossing, bimanual folding, not just pick-and-place. (3) Any robot — the same policy deploys on UR5e, Franka, and potentially mobile manipulators. All three are enabled by the same design choice: relative action representations.

Previous handheld gripper approaches (like Dobb-E) were limited to quasi-static pick-and-place. They used monocular structure-from-motion (SfM) which suffers from scale ambiguity and can't track fast motions. They ignored latency differences between collection and deployment. And they used simple MLP policies that can't capture multimodal action distributions.

UMI solves each of these with careful interface design: IMU-aware SLAM for robust tracking, inference-time latency matching for dynamic tasks, and Diffusion Policy for multimodal actions.

What is UMI's core design insight?

Decouple data collection from robot hardware — use a handheld gripper that matches the robot's observation and action spaces, enabling data collection anywhere Use a more powerful neural network Collect data from internet videos

Chapter 2: The UMI Gripper

The UMI gripper is a 3D-printed handheld parallel-jaw gripper with soft fingers, weighing 780g. Its only sensor is a GoPro camera. Total cost: ~$370 ($73 for the gripper, $298 for the GoPro and accessories). Let's break down every design decision.

HD1: Wrist-mounted camera

The GoPro is mounted at the wrist — the same position relative to the fingers as it will be on the robot. This is critical: the images the policy sees during training (from the handheld gripper) look nearly identical to what it sees during deployment (from the robot gripper). No domain gap.

A side benefit: because the camera moves with the gripper, the policy learns to focus on task-relevant objects rather than background structure. This is like free data augmentation — similar to random cropping.

HD2: Fisheye lens (155 degrees)

A standard camera lens gives ~69 degrees field of view. UMI uses a 155-degree fisheye lens. Why? Because when your camera is inches from the object you're manipulating, a narrow field of view means you can't see the table, the saucer, or the context needed to plan actions.

UMI uses the raw fisheye image without undistortion. Rectifying a 155-degree image to a standard pinhole model stretches the periphery grotesquely while compressing the center (where the action happens) to a tiny area. Raw fisheye naturally preserves resolution where it matters most.

HD3: Side mirrors for implicit stereo

A monocular camera can't directly perceive depth. UMI's solution is elegant: place physical mirrors on either side of the gripper, angled so they appear in the camera's peripheral fisheye view. Each mirror creates a virtual camera — effectively giving you three viewpoints (center + two sides) in a single image.

The raw mirror images are digitally reflected before being fed to the policy. Without reflection, the orientation of objects in the mirrors is flipped relative to the main view, which confuses vision encoders. With reflection, all three views are consistent.

HD4: IMU-aware SLAM tracking

GoPro records IMU data (accelerometer + gyroscope) alongside video. UMI feeds both into ORB-SLAM3, a visual-inertial SLAM system. The IMU provides absolute scale (solving structure-from-motion's scale ambiguity) and maintains tracking during fast motions when visual features blur.

HD5: Continuous gripper control

Previous handheld grippers used binary open/close. UMI tracks finger width continuously via fiducial markers on the fingers. This is essential for tasks like tossing, where the exact moment and width of release determines success.

UMI Gripper Design

Interactive diagram of the UMI gripper showing all hardware components. Click components to highlight.

HD6: Kinematic data filtering. Not every human demonstration is physically feasible for a robot. UMI optionally filters collected trajectories against a specific robot's kinematic limits (joint ranges, velocities). This ensures the policy only trains on motions the target robot can actually execute — a simple but effective way to handle the remaining embodiment gap.

Why does UMI use raw fisheye images instead of rectifying them to a standard pinhole model?

Rectifying a 155-degree image severely stretches the periphery and compresses the center where task-relevant information is densest — raw fisheye naturally preserves center resolution Fisheye images are smaller in file size Neural networks can't process rectified images

Chapter 3: SLAM-based Action Extraction

The most critical technical challenge: how do you extract precise 6-DOF end-effector trajectories from a handheld gripper with just a GoPro camera? No robot encoders, no motion capture, no external tracking system.

ORB-SLAM3 with visual-inertial fusion

UMI uses ORB-SLAM3, a state-of-the-art visual-inertial SLAM system. Here's what each component contributes:

Visual tracking (ORB features): Detects and matches visual features across frames to estimate camera motion. Works well when the scene has texture and motion is slow.
Inertial tracking (GoPro IMU): The accelerometer and gyroscope provide high-frequency motion estimates (hundreds of Hz) that don't depend on visual features. Critical during fast motions when images blur.
Joint optimization: The visual and inertial signals are fused in a factor graph, where each constrains the other. The IMU provides absolute scale (meters, not arbitrary units) and bridges visual tracking failures.

The result: accurate 6-DOF pose trajectories at real-world scale, even during fast tossing motions that would defeat pure visual tracking.

Map-then-localize scheme

For each new scene (kitchen, table, etc.), UMI uses a two-phase process:

Mapping: Collect one "mapping video" that surveys the scene. SLAM builds a 3D map of feature points with a global coordinate system.
Localization: All subsequent demonstrations in that scene are re-localized against this map. This means all demos share the same coordinate frame — essential for computing inter-gripper poses in bimanual setups.

SLAM Trajectory Extraction

Visualization of how SLAM extracts a 6-DOF trajectory from camera motion. The gripper moves through space and SLAM recovers its path.

Why not structure-from-motion?

Previous handheld gripper approaches used monocular SfM (structure-from-motion). SfM has three fatal problems for manipulation:

Scale ambiguity: SfM recovers camera motion up to an unknown scale factor. Is that motion 5cm or 50cm? Without IMU, you can't tell.
Motion blur: Fast motions (tossing, dynamic manipulation) blur the image, causing SfM to lose tracking entirely.
Drift: SfM accumulates error over time. For multi-second demonstrations, the end-of-trajectory pose can be centimeters off — unacceptable for precise manipulation.

Visual-inertial SLAM solves all three: IMU provides scale, maintains tracking through blur, and loop closure corrects drift.

What critical advantage does the GoPro's IMU provide to the SLAM system beyond visual tracking alone?

Absolute metric scale (meters, not arbitrary units), tracking during motion blur, and high-frequency pose estimates that bridge visual tracking failures Better image quality Faster processing speed

Chapter 4: Policy Architecture

UMI uses Diffusion Policy as its backbone — the same framework from Chi et al. 2023, but with critical adaptations for the handheld-to-robot transfer problem.

Why Diffusion Policy?

Human demonstrations are inherently multimodal. When a cup handle faces away from you, you can rotate it clockwise or counter-clockwise — both are correct. Simple regression policies (MSE loss) average these modes, producing a motion that does neither correctly. Diffusion Policy models the full distribution of valid actions, naturally handling multimodality.

Observation space

The policy observes a sequence of synchronized inputs at each timestep:

RGB image: Raw fisheye image from the wrist-mounted GoPro (with digitally reflected mirror crops)
Relative EE pose: The recent history of end-effector poses, represented as relative trajectories (see below)
Gripper width: Continuous finger opening distance
Inter-gripper pose (bimanual only): Relative transform between the two grippers

Action space: relative trajectories

This is UMI's most important design decision. Instead of predicting actions in absolute coordinates (which require knowing the robot's base frame) or delta actions (which accumulate error), UMI uses relative trajectories:

a_t:t+H = {T_t^-1 T_t+1, T_t^-1 T_t+2, ..., T_t^-1 T_t+H}

Each action in the predicted sequence is the desired pose relative to the current pose at the start of prediction. At the next inference step, the whole trajectory is re-anchored to the new current pose. This means:

No global coordinate frame needed (works across robots and environments)
No error accumulation (unlike delta actions, each step is independently relative)
Robust to tracking errors and camera displacement

Action Representations Compared

Compare how absolute, delta, and relative trajectory representations handle a curved path. Observe how delta accumulates error while relative re-anchors each step.

Vision encoder

For simpler tasks, UMI trains a ResNet vision encoder from scratch. For complex tasks (dish washing, in-the-wild generalization), it fine-tunes a CLIP-pretrained ViT. The pretrained features provide the semantic understanding needed for tasks with diverse objects and environments.

Why relative trajectories are "universal": Because every action is defined relative to the current gripper pose, the policy never needs to know where the robot base is, what coordinate frame the lab uses, or which robot it's running on. You can literally move the robot's base during execution and the policy still works — as long as the objects are within reach. This is what makes the same policy deployable on UR5e, Franka, and potentially mobile manipulators.

Why does UMI use relative trajectory actions instead of absolute or delta actions?

Relative trajectories need no global coordinate frame (hardware-agnostic), don't accumulate error (unlike delta), and re-anchor at each step — enabling cross-robot transfer Relative actions are computationally cheaper Absolute actions don't work with neural networks

Chapter 5: Data Pipeline

The journey from raw GoPro footage to a deployable robot policy has several careful stages. Each stage addresses a specific challenge in bridging the handheld-to-robot gap.

Stage 1: Raw video recording

The demonstrator performs the task while holding the UMI gripper. The GoPro records video at high resolution with synchronized IMU data embedded in the MP4 file. No external sensors, no wires, no setup beyond picking up the gripper.

Stage 2: SLAM processing

Each video is processed through ORB-SLAM3 (visual-inertial mode). Output: a 6-DOF camera pose for every frame, in metric scale. The gripper width at each frame is extracted from the fiducial markers on the fingers.

Stage 3: Trajectory extraction and interpolation

The raw SLAM output is at video framerate (e.g., 60 Hz). The policy operates at a lower frequency (10-20 Hz for images, higher for proprioception). The trajectories are temporally resampled and interpolated to the policy's operating frequency.

Stage 4: Kinematic filtering

Optionally, trajectories are checked against the target robot's kinematic model. Demonstrations where the gripper moves beyond the robot's reach or requires impossible joint configurations are filtered out. This ensures the policy only trains on executable motions.

Stage 5: Mirror image processing

The side mirror regions in each fisheye image are detected, cropped, digitally reflected (flipped), and swapped (left mirror becomes right view). This creates consistent multi-view observations from a single camera.

Stage 6: Policy training

The processed data (images, relative EE trajectories, gripper widths) is used to train Diffusion Policy via standard behavior cloning. The observation horizon is typically 2 (current + previous frame), and the action prediction horizon is 16 steps.

UMI Data Pipeline

End-to-end flow from raw handheld demonstration to deployable robot policy.

Throughput comparison: In 15 minutes, a human can collect 231 demos per hour with bare hands (no actions), 111 with UMI gripper, and only 35 with spacemouse teleoperation. UMI is 3x faster than teleop while producing robot-deployable action data. For bimanual tasks, UMI collects 149 demos/hour while spacemouse teleoperation scores 0 — it simply can't do bimanual tossing.

Why must side mirror images be digitally reflected before feeding to the policy?

Without reflection, objects in mirrors appear with flipped orientation relative to the main view, confusing vision encoders — digital reflection makes all three views consistent To increase image resolution To match the robot camera's field of view

Chapter 6: Transfer to Robots

Training is done on handheld data. Now the policy needs to run on an actual robot. This is where UMI's careful interface design pays off — but two critical timing issues must be addressed.

The latency problem

During handheld data collection, there's essentially zero latency between observation and action — the human sees the scene and moves their hand in the same instant. But on a robot, latency is everywhere:

Camera latency (~100ms): Time from photons hitting the sensor to the image being available for processing
Inference latency (~50ms): Time for the neural network to process the observation and predict actions
Arm execution latency (~100ms): Time for the robot arm to begin moving after receiving a command
Gripper execution latency (~120ms): Time for the gripper to respond (often different from the arm)

PD1: Inference-time latency matching

UMI's solution is elegant: measure each latency source physically, then compensate at inference time:

Observation side: Different sensor streams (camera, robot joint encoders, gripper encoder) have different latencies. UMI measures each one and time-aligns all observations to the highest-latency stream (usually the camera). Faster streams are interpolated backwards to match the camera's timestamp.

Action side: The policy predicts a sequence of future actions. But by the time the robot starts executing, the first several actions are already stale (due to observation + inference + execution latency). UMI simply skips those stale actions and only executes actions with timestamps after the expected execution time.

This matters enormously for dynamic tasks. Without latency matching, the dynamic tossing task drops from 87.5% to 57.5% success rate. The robot's motions become jittery, and the precise timing needed for releasing objects during a toss is completely disrupted. The elbow joint velocity curves show the difference clearly: with latency matching, motions are smooth; without, they oscillate.

PD2: Robot setup

Deploying on a new robot requires minimal setup:

Mount UMI-compatible gripper fingers and GoPro camera at the same position as on the handheld device
Place ArUco fiducial markers for initial calibration (one-time)
Measure hardware-specific latencies (one-time)
Load the trained policy and run

The same policy checkpoint trained on handheld data has been deployed on both a UR5e (6-DOF) and a Franka FR2 (7-DOF) without retraining. The cup arrangement task achieved 100% on UR5e and 90% on Franka (2 failures were joint limit violations from the robot's mounting position, not policy errors).

How does UMI handle the latency mismatch between zero-latency data collection and the robot's multi-source latencies?

Measure each latency source physically, time-align observations to the highest-latency stream, and skip stale predicted actions to only execute future-valid ones Slow down robot execution to match human speed Add artificial latency during data collection

Chapter 7: Results

UMI is evaluated on four tasks that push the boundaries of what behavior cloning can achieve. Each task tests a different capability.

Task 1: Cup arrangement (100% / 90% cross-robot)

Place an espresso cup on a saucer with the handle facing left. This tests precision (placement), multimodality (clockwise vs counter-clockwise rotation), and depth perception (via mirrors). 305 episodes from 2 demonstrators.

UMI achieves 20/20 = 100% on UR5e, and the same policy deployed on Franka FR2 achieves 18/20 = 90%.

Task 2: Dynamic tossing (87.5%)

Sort 6 YCB objects by tossing them to two bins placed beyond the robot's reach — spherical objects to a round bin, Lego pieces to a rectangular bin. This tests fast motions, precise release timing, and hand-eye coordination. 280 episodes.

UMI achieves 105/120 = 87.5%. Without latency matching: 57.5%.

Task 3: Bimanual cloth folding (70%)

Two robot arms fold a sweater: fold sleeves inward, fold bottom up, rotate 90 degrees, fold in half. Requires tight bimanual coordination — if one arm grabs slightly too early, the fold fails. 250 episodes from 2 demonstrators.

UMI achieves 14/20 = 70%. Without inter-gripper proprioception: 30%.

Task 4: Dish washing (70%)

A 7-step long-horizon task: turn on faucet, grab plate, pick up sponge, wash ketchup off, place plate on rack, return sponge, turn off faucet. Tests deformable manipulation, fluid handling, and semantic understanding of "clean." 258 episodes.

UMI with ViT encoder achieves 14/20 = 70%. With ResNet: 0%.

Task Performance Summary

Success rates across all four benchmark tasks. Bars show UMI vs key ablations.

The ablation that matters most: For each task, a different design decision is critical. Cup arrangement: fisheye lens (55% without vs 100% with). Dynamic tossing: latency matching (57.5% vs 87.5%). Bimanual folding: inter-gripper proprioception (30% vs 70%). Dish washing: pretrained ViT encoder (0% vs 70%). UMI's strength is that it identifies and solves all of these failure modes simultaneously.

Which design decision has the largest impact on the dynamic tossing task, and why?

Inference-time latency matching — without it, out-of-sync observations and actions cause jittery motions and mistimed object release, dropping success from 87.5% to 57.5% The fisheye lens The CLIP-pretrained ViT encoder

Chapter 8: In-The-Wild Diversity

This is UMI's headline result. Can data collected in the wild — across diverse kitchens, offices, and outdoor locations — produce policies that generalize to entirely new environments?

The experiment

3 demonstrators collected 1,400 cup arrangement demonstrations across 30 diverse physical locations: homes, offices, restaurants, outdoor tables. They used 15 different espresso cups (ceramic, glass, metal; cylindrical and tapered; various colors). Total collection time: 12 person-hours.

The policy was trained with a CLIP-pretrained ViT-L/14 vision encoder to handle the visual diversity.

Generalization tests

The trained policy was tested in two completely unseen environments:

Cafe table: Outdoor metal table at a busy cafe with pedestrian distractors. 5 training cups + 2 unseen cups, 35 total trials.
Water fountain: A black cubic fountain with water constantly flowing over the surface. Extremely out-of-distribution: all training data was on non-black, dry tables. 3 training + 2 unseen cups, 25 total trials.

The results

Combined success rate: 43/60 = 71.7%. Remarkably, performance on unseen cups (75%) is actually slightly better than on training cups (70%). The policy has learned a generalizable notion of "espresso cup" rather than memorizing specific objects.

The control experiment that proves it: Training the same ViT architecture with only narrow-domain lab data produces a policy that scores 0% in the same unseen environments. The robot doesn't even move toward the cup. This proves that generalization comes from data diversity, not model architecture. A large pretrained vision encoder is necessary but not sufficient — you need diverse in-the-wild data.

Why visual diversity matters

When trained on data from 30 locations, the policy encounters: wooden tables, metal tables, glass tables, stone surfaces, paper tablecloths, different lighting conditions, different backgrounds, and different distractors. It's forced to learn which visual features actually matter for the task (the cup, the saucer, the spatial relationship) and which are irrelevant (the table material, the background, the lighting). This is the same principle behind ImageNet-scale pretraining, but applied at the robotics task level.

What happens when UMI's ViT architecture is trained with only narrow-domain lab data instead of diverse in-the-wild data?

0% success rate in unseen environments — the robot doesn't even move toward the cup, proving that generalization requires data diversity, not just model capacity Slightly worse performance (about 50%) The same performance since ViT already has pretrained features

Chapter 9: Connections

What UMI builds on

Diffusion Policy (Chi et al., 2023): UMI's policy backbone. Diffusion Policy models multimodal action distributions via iterative denoising — essential for handling the natural diversity of human demonstrations.

ACT / ALOHA (Zhao et al., 2023): Demonstrated that low-cost bimanual teleoperation is possible. But ALOHA still requires the robot during data collection. UMI removes that requirement entirely.

ORB-SLAM3 (Campos et al., 2021): The visual-inertial SLAM system that enables UMI's precise action extraction. Without IMU fusion, handheld data collection can't support dynamic tasks.

Dobb-E (Shafiullah et al., 2023): A reacher-grabber tool with iPhone for data collection. But limited to quasi-static tasks and requires environment-specific fine-tuning. UMI enables dynamic, bimanual, and zero-shot transfer.

What UMI enabled

DROID (Khazatsky et al., 2024): A large-scale robot manipulation dataset. UMI showed that diverse data collection is the key bottleneck, motivating large-scale distributed data collection efforts.

pi-0 / Physical Intelligence (2024): Foundation models for robot manipulation. UMI demonstrated that handheld data can be a scalable data source for training general-purpose robot policies.

MimicGen (Mandlekar et al., 2023): Generates demonstration data synthetically. UMI provides an orthogonal approach: collect real diverse data cheaply instead of generating synthetic data.

UMI's legacy: UMI showed that the data collection problem in robotics isn't a hardware problem — it's an interface design problem. By carefully matching observation and action spaces between a cheap handheld device and expensive robot hardware, you can collect diverse manipulation data anywhere, train once, and deploy on any robot. The open-sourced design (https://umi-gripper.github.io) has been adopted by labs worldwide, making it one of the most impactful practical contributions to robot learning.

Cheat sheet

Core idea

Decouple data collection from robot hardware via a handheld gripper interface

Key hardware

3D-printed gripper + GoPro (fisheye, IMU, mirrors) = ~$370

Action extraction

ORB-SLAM3 (visual-inertial) → 6-DOF trajectories at metric scale

Action representation

Relative trajectories — hardware-agnostic, no error accumulation

Transfer mechanism

Inference-time latency matching + relative actions → cross-robot deployment

How does UMI differ from ALOHA as a data collection system for robot learning?

ALOHA requires the physical robot during data collection (leader-follower), limiting data to lab environments. UMI uses a handheld gripper, enabling data collection anywhere without a robot — and the resulting policies transfer across embodiments. ALOHA uses a different neural network architecture UMI uses more cameras than ALOHA