The open-source stack for end-to-end robotics — from data collection to deployment.
Imagine you want a robot to fold a shirt. Sounds simple — you've done it a thousand times without thinking. But watch what actually happens: you perceive the shirt's shape through vision, estimate its crumpled 3D geometry, plan a sequence of grasps and folds, then execute motions that involve constant contact with a deformable surface whose physics are nearly impossible to model analytically.
Traditional robotics would try to solve each of these sub-problems independently. A perception module segments the shirt. A planning module computes a fold sequence. A dynamics model predicts how the fabric deforms. A controller tracks the planned trajectory. Each module is hand-engineered, and each module can fail.
The shirt wrinkles slightly differently than the simulator predicted? The planner's assumption breaks. The lighting changes? The perception module loses the shirt outline. A human bumps the table? The controller doesn't know how to recover. These aren't edge cases — they're the norm in unstructured environments.
Meanwhile, something remarkable happened in NLP and vision: researchers stopped engineering features and started learning them from data. The same neural network architecture (the Transformer) now handles text, images, audio, and video. The secret ingredient wasn't a better algorithm — it was more data and a simpler pipeline.
Robot learning asks: can we do the same for physical manipulation? Can we replace the brittle stack of perception + planning + dynamics + control with a single neural network that maps observations directly to actions?
Click each pipeline to see how many components must succeed for the robot to fold a shirt. In the classical pipeline, any single failure kills the whole chain.
The classical pipeline has six boxes chained together. If the perception module fails (probability 0.95 success), and each subsequent module also has 0.95 success rate, the overall success is 0.956 ≈ 0.74. One in four attempts fails. And that's being generous — real modules often have correlated failures.
The learned pipeline has one box: observation in, action out. It can still fail, but its failure mode is different: it degrades gracefully rather than catastrophically, because the network learns to be robust to the exact kinds of perturbations it saw during training.
The transition from classical to learned robotics rests on four pillars. Understanding these pillars explains why this shift is happening now and not ten years ago.
Classical systems decompose manipulation into separate stages: perceive, plan, control. Each stage is developed by a different specialist, using different representations. The perception engineer outputs a point cloud. The planner expects a mesh. The controller needs joint-space trajectories. Every interface is a potential point of failure.
Monolithic learned policies replace this entire stack with a single neural network. The input is raw sensor data (images, joint positions). The output is motor commands (joint velocities or torques). No hand-designed interfaces, no representation mismatches.
A robot folding a shirt uses vision (where is the shirt?), proprioception (where are my joints?), and possibly tactile sensing (how hard am I gripping?). Classical pipelines process each modality in a separate module and then try to fuse them. Learned policies consume all modalities as a single input vector and let the network figure out what to attend to.
This is the same insight that made large multimodal models (GPT-4V, Gemini) work: don't engineer separate encoders and a fusion layer — just tokenize everything and let attention sort it out.
Modeling the physics of a crumpled shirt is a research-level problem in computational mechanics. Learned policies sidestep this entirely. They don't need a dynamics model because they learn the input-output mapping directly from demonstrations. The implicit dynamics are baked into the training data.
The final pillar is the one that makes everything else possible. Vision-Language Models work because they were trained on billions of image-text pairs scraped from the internet. Robot learning has historically had no equivalent data source. But this is changing:
| Data Source | Scale | Diversity |
|---|---|---|
| Human teleoperation | ~1K demos/week | High (real-world) |
| Simulation (IsaacGym, MuJoCo) | ~1M demos/day | Medium (sim-to-real gap) |
| Internet video | Billions of clips | Very high (no actions) |
| Cross-embodiment (Open X-Embodiment) | ~1M episodes | High (many robots) |
Hover over each pillar to see how it contributes to the data-driven shift. The connecting arrows show dependencies between pillars.
LeRobot is an open-source platform by Hugging Face that provides the full pipeline for robot learning: collect data, train policies, and deploy them on real hardware. Think of it as the Hugging Face Transformers library, but for robots.
The stack has three layers, and they're designed to work together but also independently:
What makes LeRobot different from other robotics frameworks is its data-first philosophy. The dataset format is standardized across all robots and all tasks. Any policy can train on any dataset. Any trained policy can deploy on any compatible robot. This interoperability is the key design choice.
python # The entire LeRobot workflow in 10 lines from lerobot.common.datasets.lerobot_dataset import LeRobotDataset from lerobot.common.policies.act.modeling_act import ACTPolicy # 1. Load a dataset (local or from Hub) dataset = LeRobotDataset("lerobot/aloha_sim_transfer_cube_human") # 2. Create a policy policy = ACTPolicy(dataset.meta) # 3. Train (simplified — real training uses Hydra configs) for batch in dataset: loss = policy.forward(batch) loss.backward()
Click each stage to see what flows between them. The data shapes and types are shown on the connecting arrows.
Each layer is a Python module you can use independently. Want to just use the dataset format without training? Import LeRobotDataset. Want to train a policy on your own data format? Wrap your data to match the expected dictionary structure. Want to deploy a pretrained policy without touching the training code? Load a checkpoint and call policy.select_action().
A LeRobotDataset has three storage layers, each optimized for a different kind of data. Understanding this format is essential because every policy in the library expects data in exactly this structure.
Actions, joint positions, and other numerical time series are stored in Apache Parquet files. Parquet is a columnar format — reading a single column (say, just the gripper state) is fast because you don't have to scan through the image columns. Each row corresponds to one timestep.
Camera images are stored as MP4 video files, one per camera per episode. Why not PNG frames? Because video compression reduces storage by 10-100x. A 50-episode dataset with two cameras would be ~50 GB as PNGs but ~500 MB as MP4. The dataset loader handles frame extraction transparently — you request frame 42, it seeks to the right position in the video.
The info.json file describes everything about the dataset: what robot was used, how many episodes, the FPS, which columns exist, and the shape/dtype of every feature. This is what makes datasets self-describing — a policy can inspect the metadata and automatically configure its input/output dimensions.
python from lerobot.common.datasets.lerobot_dataset import LeRobotDataset # Load dataset — downloads from Hub on first call ds = LeRobotDataset("lerobot/aloha_sim_transfer_cube_human") # Inspect metadata print(ds.meta.fps) # 50 print(ds.meta.total_episodes) # 50 print(ds.meta.total_frames) # 20000 # Access a single frame frame = ds[0] print(frame.keys()) # dict_keys(['observation.images.top', 'observation.state', # 'action', 'episode_index', 'frame_index', # 'timestamp', 'index']) # Shapes print(frame['observation.images.top'].shape) # torch.Size([3, 480, 640]) print(frame['observation.state'].shape) # torch.Size([14]) print(frame['action'].shape) # torch.Size([14])
forward() method expects this exact dictionary structure. When you create a new dataset for a different robot, you match these keys (with different shapes), and every policy just works.Click each storage layer to see what it contains and how it connects to the frame dictionary that policies consume.
A policy doesn't just see the current frame. It needs context — what happened a moment ago, and what actions to predict a moment from now. This is where delta timestamps come in, and understanding them is crucial for building intuition about how policies consume data.
A delta timestamp is an offset in seconds from the current frame. Negative values look backward in time (past observations). Zero is the present. Positive values look forward (future actions to predict).
At each training step, the dataset loader takes the current frame index and computes which actual frames correspond to each delta. The formula is simple:
Why not just use consecutive frames? Because the relevant timescale varies by task. A fast pick-and-place might need 50ms windows. A slow assembly task might need 2-second windows. Delta timestamps let you decouple the observation window from the recording FPS.
For actions, positive deltas specify the action chunk — the sequence of future actions the policy must predict. If action deltas are [0.0, 0.033, 0.066, 0.1] at 30 FPS, the policy predicts the next 4 actions (current + 3 future frames). This is called action chunking, and it's a key technique in ACT and Diffusion Policy.
python # Delta timestamps in practice from lerobot.common.datasets.lerobot_dataset import LeRobotDataset ds = LeRobotDataset( "lerobot/aloha_sim_transfer_cube_human", delta_timestamps={ "observation.images.top": [-0.1, 0.0], # 2 frames: 100ms ago + now "observation.state": [-0.1, 0.0], # same window for state "action": [0.0, 0.02, 0.04, 0.06], # 4-step action chunk } ) frame = ds[100] print(frame['observation.state'].shape) # torch.Size([2, 14]) — 2 timesteps print(frame['action'].shape) # torch.Size([4, 14]) — 4-step chunk
Drag the slider to change the current frame. The observation window (orange) and action chunk (teal) are shown on the timeline. Adjust FPS to see how the same deltas map to different frame indices.
Robot datasets are large. The ALOHA transfer cube dataset is modest at ~2 GB, but cross-embodiment datasets like Open X-Embodiment exceed 1 TB. Downloading everything before you can even inspect it is impractical. LeRobot solves this with streaming.
When you enable streaming, the dataset doesn't download files to disk. Instead, it fetches data on-the-fly from the Hugging Face Hub, caching only what's needed. The Parquet files are read in chunks via HTTP range requests, and video frames are decoded from streamed MP4 segments.
python # Local loading — downloads everything first ds_local = LeRobotDataset( "lerobot/aloha_sim_transfer_cube_human", local_files_only=False, # default: download to cache ) # Takes minutes on first run, instant after # Streaming — fetches on demand ds_stream = LeRobotDataset( "lerobot/aloha_sim_transfer_cube_human", streaming=True, ) # Ready in seconds, but slower per-batch
The Hub also provides a dataset viewer in the browser. Navigate to any LeRobot dataset on huggingface.co and you'll see the videos playing alongside the numerical data. This is incredibly useful for quality control: before training, scrub through a few episodes to check for bad demonstrations (the teleoperator sneezed, the object fell off the table, etc.).
| Property | Local | Streaming |
|---|---|---|
| Setup time | Minutes (first download) | Seconds |
| Per-batch latency | ~1ms (disk) | ~100ms (network) |
| Disk usage | Full dataset size | ~0 (cache only) |
| Use case | Training | Exploration, debugging |
| Random access | Instant | Requires seek in video stream |
Toggle between local and streaming modes to see how data flows from the Hub to your training loop. Notice the cache layer in local mode vs the direct fetch in streaming.
The most natural way to teach a robot is to show it what to do. You physically guide the robot's arm through the desired motion while recording everything: joint positions, camera images, gripper state. This is teleoperation, and it's how most robot learning datasets are collected today.
LeRobot supports several teleoperation setups, but the most accessible is the SO-100 with a leader-follower arm. You move the leader arm (the one you hold), and the follower arm (the one doing the task) mirrors your movements in real time. Both arms' joint positions are recorded, along with camera feeds.
bash # Full recording pipeline for SO-100 # 1. Calibrate (only needed once per robot) python -m lerobot.calibrate \ --robot.type=so100 \ --robot.port=/dev/ttyACM0 \ --robot.leader_port=/dev/ttyACM1 # 2. Quick teleoperation test (no recording) python -m lerobot.teleoperate \ --robot.type=so100 # 3. Record 50 episodes of a pick-and-place task python -m lerobot.record \ --robot.type=so100 \ --repo-id=your_name/pick_place_cube \ --num-episodes=50 \ --fps=30 \ --warmup-time-s=5 \ --episode-time-s=30 \ --reset-time-s=10 # 4. Replay episode 0 to check quality python -m lerobot.replay \ --robot.type=so100 \ --repo-id=your_name/pick_place_cube \ --episode=0
The --reset-time-s=10 flag gives you 10 seconds between episodes to reset the scene (put the cube back, straighten the cloth, etc.). The --warmup-time-s=5 flag gives 5 seconds before recording starts so you can get your hands in position.
After recording, the dataset is automatically formatted as a LeRobotDataset with Parquet, MP4, and metadata. You can push it to the Hub with a single command:
python # Push your dataset to the Hub ds = LeRobotDataset("your_name/pick_place_cube") ds.push_to_hub()
Click each stage to trace data from the leader arm through recording to the final dataset.
This chapter is the trailhead. You now understand the motivation for robot learning and the infrastructure that makes it possible. Here's where each subsequent chapter takes you:
| Chapter | What You'll Learn | Key Idea |
|---|---|---|
| Ch 2: Classical Robotics | Forward/inverse kinematics, Jacobians, feedback control | What works, and why it breaks |
| Ch 3: RL Foundations | MDPs, policy gradients, value functions | Learning from rewards, not demonstrations |
| Ch 4: Imitation Learning | Behavior cloning, DAgger, action chunking | Learning from expert demonstrations |
| Ch 5: Policies | ACT, Diffusion Policy, TDMPC, VLA | The specific architectures that work |
| Ch 6: Vision-Language-Action | Foundation models for robotics | One model for any task, any robot |
The thread connecting all chapters is a single question: how do we get data, and what do we do with it? Classical robotics needs no data but breaks on hard tasks. RL generates its own data but needs millions of trials. Imitation learning uses human demonstrations but needs hundreds per task. VLAs leverage internet-scale pretraining so they need only a handful of demonstrations for a new task.
Each approach trades off data efficiency against generality. The trend in the field is unmistakable: toward more data, simpler pipelines, and larger models. LeRobot is the infrastructure that makes this practical.