Capuano, Pascal, Zouitine, Wolf & Aractingi, Chapter 1

Introduction to Robot Learning

The open-source stack for end-to-end robotics — from data collection to deployment.

Prerequisites: Basic Python + Curiosity about robots. That's it.
8
Chapters
4+
Simulations
0
Assumed Knowledge

Chapter 0: Why Robot Learning?

Imagine you want a robot to fold a shirt. Sounds simple — you've done it a thousand times without thinking. But watch what actually happens: you perceive the shirt's shape through vision, estimate its crumpled 3D geometry, plan a sequence of grasps and folds, then execute motions that involve constant contact with a deformable surface whose physics are nearly impossible to model analytically.

Traditional robotics would try to solve each of these sub-problems independently. A perception module segments the shirt. A planning module computes a fold sequence. A dynamics model predicts how the fabric deforms. A controller tracks the planned trajectory. Each module is hand-engineered, and each module can fail.

The shirt wrinkles slightly differently than the simulator predicted? The planner's assumption breaks. The lighting changes? The perception module loses the shirt outline. A human bumps the table? The controller doesn't know how to recover. These aren't edge cases — they're the norm in unstructured environments.

The fundamental gap: Large language models can write poetry and pass bar exams because text data is abundant and easy to collect. Robot data is expensive, dangerous to collect, and tied to specific hardware. This gap — not compute, not algorithms — is the primary bottleneck in robot intelligence.

Meanwhile, something remarkable happened in NLP and vision: researchers stopped engineering features and started learning them from data. The same neural network architecture (the Transformer) now handles text, images, audio, and video. The secret ingredient wasn't a better algorithm — it was more data and a simpler pipeline.

Robot learning asks: can we do the same for physical manipulation? Can we replace the brittle stack of perception + planning + dynamics + control with a single neural network that maps observations directly to actions?

Classical vs Learned Pipelines

Click each pipeline to see how many components must succeed for the robot to fold a shirt. In the classical pipeline, any single failure kills the whole chain.

Click a pipeline to begin

The classical pipeline has six boxes chained together. If the perception module fails (probability 0.95 success), and each subsequent module also has 0.95 success rate, the overall success is 0.956 ≈ 0.74. One in four attempts fails. And that's being generous — real modules often have correlated failures.

The learned pipeline has one box: observation in, action out. It can still fail, but its failure mode is different: it degrades gracefully rather than catastrophically, because the network learns to be robust to the exact kinds of perturbations it saw during training.

Why is the classical robotics pipeline brittle for tasks like shirt folding?

Chapter 1: The Data-Driven Shift

The transition from classical to learned robotics rests on four pillars. Understanding these pillars explains why this shift is happening now and not ten years ago.

Pillar 1: Monolithic Pipelines

Classical systems decompose manipulation into separate stages: perceive, plan, control. Each stage is developed by a different specialist, using different representations. The perception engineer outputs a point cloud. The planner expects a mesh. The controller needs joint-space trajectories. Every interface is a potential point of failure.

Monolithic learned policies replace this entire stack with a single neural network. The input is raw sensor data (images, joint positions). The output is motor commands (joint velocities or torques). No hand-designed interfaces, no representation mismatches.

Key insight: End-to-end learning doesn't mean the network ignores structure. It means the network discovers whatever internal representations are useful for the task, rather than having them imposed by a human designer. Often, the learned representations look nothing like what an engineer would have designed.

Pillar 2: Multimodal Features

A robot folding a shirt uses vision (where is the shirt?), proprioception (where are my joints?), and possibly tactile sensing (how hard am I gripping?). Classical pipelines process each modality in a separate module and then try to fuse them. Learned policies consume all modalities as a single input vector and let the network figure out what to attend to.

This is the same insight that made large multimodal models (GPT-4V, Gemini) work: don't engineer separate encoders and a fusion layer — just tokenize everything and let attention sort it out.

Pillar 3: No Explicit Dynamics

Modeling the physics of a crumpled shirt is a research-level problem in computational mechanics. Learned policies sidestep this entirely. They don't need a dynamics model because they learn the input-output mapping directly from demonstrations. The implicit dynamics are baked into the training data.

Pillar 4: Data Scaling

The final pillar is the one that makes everything else possible. Vision-Language Models work because they were trained on billions of image-text pairs scraped from the internet. Robot learning has historically had no equivalent data source. But this is changing:

Data SourceScaleDiversity
Human teleoperation~1K demos/weekHigh (real-world)
Simulation (IsaacGym, MuJoCo)~1M demos/dayMedium (sim-to-real gap)
Internet videoBillions of clipsVery high (no actions)
Cross-embodiment (Open X-Embodiment)~1M episodesHigh (many robots)
The data flywheel: Better policies → safer autonomous data collection → more data → even better policies. This is the same flywheel that drove self-driving car development, and it's now spinning up for general manipulation.
The Four Pillars

Hover over each pillar to see how it contributes to the data-driven shift. The connecting arrows show dependencies between pillars.

Which pillar explains why learned policies don't need accurate physics simulators for deformable objects?

Chapter 2: The LeRobot Stack

LeRobot is an open-source platform by Hugging Face that provides the full pipeline for robot learning: collect data, train policies, and deploy them on real hardware. Think of it as the Hugging Face Transformers library, but for robots.

The stack has three layers, and they're designed to work together but also independently:

Data Layer
LeRobotDataset: collect, store, stream demonstrations
Training Layer
Policies: ACT, Diffusion Policy, TDMPC, VLA — all in one API
Deployment Layer
Real-time inference on real robots (SO-100, ALOHA, etc.)

What makes LeRobot different from other robotics frameworks is its data-first philosophy. The dataset format is standardized across all robots and all tasks. Any policy can train on any dataset. Any trained policy can deploy on any compatible robot. This interoperability is the key design choice.

Why this matters: Before LeRobot, every robotics lab had its own data format, its own training scripts, and its own deployment code. Sharing was nearly impossible. LeRobot's standardized format means a dataset collected on a SO-100 in Paris can train a policy deployed on an ALOHA in San Francisco — without changing a single line of code.
python
# The entire LeRobot workflow in 10 lines
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
from lerobot.common.policies.act.modeling_act import ACTPolicy

# 1. Load a dataset (local or from Hub)
dataset = LeRobotDataset("lerobot/aloha_sim_transfer_cube_human")

# 2. Create a policy
policy = ACTPolicy(dataset.meta)

# 3. Train (simplified — real training uses Hydra configs)
for batch in dataset:
    loss = policy.forward(batch)
    loss.backward()
LeRobot Pipeline

Click each stage to see what flows between them. The data shapes and types are shown on the connecting arrows.

Each layer is a Python module you can use independently. Want to just use the dataset format without training? Import LeRobotDataset. Want to train a policy on your own data format? Wrap your data to match the expected dictionary structure. Want to deploy a pretrained policy without touching the training code? Load a checkpoint and call policy.select_action().

What is LeRobot's key design philosophy that differentiates it from other robotics frameworks?

Chapter 3: LeRobotDataset Format

A LeRobotDataset has three storage layers, each optimized for a different kind of data. Understanding this format is essential because every policy in the library expects data in exactly this structure.

Layer 1: Tabular Data (Parquet)

Actions, joint positions, and other numerical time series are stored in Apache Parquet files. Parquet is a columnar format — reading a single column (say, just the gripper state) is fast because you don't have to scan through the image columns. Each row corresponds to one timestep.

Each row: { timestamp, episode_index, frame_index, action[6], state[6] }

Layer 2: Visual Data (MP4)

Camera images are stored as MP4 video files, one per camera per episode. Why not PNG frames? Because video compression reduces storage by 10-100x. A 50-episode dataset with two cameras would be ~50 GB as PNGs but ~500 MB as MP4. The dataset loader handles frame extraction transparently — you request frame 42, it seeks to the right position in the video.

Layer 3: Metadata (JSON)

The info.json file describes everything about the dataset: what robot was used, how many episodes, the FPS, which columns exist, and the shape/dtype of every feature. This is what makes datasets self-describing — a policy can inspect the metadata and automatically configure its input/output dimensions.

python
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

# Load dataset — downloads from Hub on first call
ds = LeRobotDataset("lerobot/aloha_sim_transfer_cube_human")

# Inspect metadata
print(ds.meta.fps)           # 50
print(ds.meta.total_episodes) # 50
print(ds.meta.total_frames)   # 20000

# Access a single frame
frame = ds[0]
print(frame.keys())
# dict_keys(['observation.images.top', 'observation.state',
#            'action', 'episode_index', 'frame_index',
#            'timestamp', 'index'])

# Shapes
print(frame['observation.images.top'].shape)  # torch.Size([3, 480, 640])
print(frame['observation.state'].shape)       # torch.Size([14])
print(frame['action'].shape)                  # torch.Size([14])
Key insight: The frame dictionary is the universal interface. Every policy's forward() method expects this exact dictionary structure. When you create a new dataset for a different robot, you match these keys (with different shapes), and every policy just works.
Dataset Structure

Click each storage layer to see what it contains and how it connects to the frame dictionary that policies consume.

Why does LeRobot store camera data as MP4 videos rather than individual PNG frames?

Chapter 4: Delta Timestamps

A policy doesn't just see the current frame. It needs context — what happened a moment ago, and what actions to predict a moment from now. This is where delta timestamps come in, and understanding them is crucial for building intuition about how policies consume data.

A delta timestamp is an offset in seconds from the current frame. Negative values look backward in time (past observations). Zero is the present. Positive values look forward (future actions to predict).

delta_timestamps = {
  "observation.state": [-0.2, -0.1, 0.0],  # past + present states
  "action": [0.0, 0.033, 0.066, 0.1]     # present + future actions
}

At each training step, the dataset loader takes the current frame index and computes which actual frames correspond to each delta. The formula is simple:

frame_index = current_index + round(delta × fps)
Hand calculation: Suppose fps = 30 and we're at frame 100. With delta_timestamps for observations = [-0.2, -0.1, 0.0]:

• delta = -0.2 → frame = 100 + round(-0.2 × 30) = 100 + round(-6) = 94
• delta = -0.1 → frame = 100 + round(-0.1 × 30) = 100 + round(-3) = 97
• delta = 0.0 → frame = 100 + round(0.0 × 30) = 100

The policy sees frames [94, 97, 100] — a 200ms observation window sampled at 100ms intervals.

Why not just use consecutive frames? Because the relevant timescale varies by task. A fast pick-and-place might need 50ms windows. A slow assembly task might need 2-second windows. Delta timestamps let you decouple the observation window from the recording FPS.

For actions, positive deltas specify the action chunk — the sequence of future actions the policy must predict. If action deltas are [0.0, 0.033, 0.066, 0.1] at 30 FPS, the policy predicts the next 4 actions (current + 3 future frames). This is called action chunking, and it's a key technique in ACT and Diffusion Policy.

python
# Delta timestamps in practice
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

ds = LeRobotDataset(
    "lerobot/aloha_sim_transfer_cube_human",
    delta_timestamps={
        "observation.images.top": [-0.1, 0.0],    # 2 frames: 100ms ago + now
        "observation.state": [-0.1, 0.0],          # same window for state
        "action": [0.0, 0.02, 0.04, 0.06],      # 4-step action chunk
    }
)

frame = ds[100]
print(frame['observation.state'].shape)  # torch.Size([2, 14]) — 2 timesteps
print(frame['action'].shape)              # torch.Size([4, 14]) — 4-step chunk
Delta Timestamp Visualizer

Drag the slider to change the current frame. The observation window (orange) and action chunk (teal) are shown on the timeline. Adjust FPS to see how the same deltas map to different frame indices.

Current Frame 100
FPS 30
At fps=50, current frame=200, and delta=-0.1, what frame index does the loader retrieve?

Chapter 5: Streaming Datasets

Robot datasets are large. The ALOHA transfer cube dataset is modest at ~2 GB, but cross-embodiment datasets like Open X-Embodiment exceed 1 TB. Downloading everything before you can even inspect it is impractical. LeRobot solves this with streaming.

When you enable streaming, the dataset doesn't download files to disk. Instead, it fetches data on-the-fly from the Hugging Face Hub, caching only what's needed. The Parquet files are read in chunks via HTTP range requests, and video frames are decoded from streamed MP4 segments.

python
# Local loading — downloads everything first
ds_local = LeRobotDataset(
    "lerobot/aloha_sim_transfer_cube_human",
    local_files_only=False,  # default: download to cache
)
# Takes minutes on first run, instant after

# Streaming — fetches on demand
ds_stream = LeRobotDataset(
    "lerobot/aloha_sim_transfer_cube_human",
    streaming=True,
)
# Ready in seconds, but slower per-batch
When to use which? Use local for training — you'll iterate over the dataset hundreds of times, and disk I/O is orders of magnitude faster than HTTP. Use streaming for exploration — browsing a few episodes to decide if a dataset is useful before committing to a full download.

The Hub also provides a dataset viewer in the browser. Navigate to any LeRobot dataset on huggingface.co and you'll see the videos playing alongside the numerical data. This is incredibly useful for quality control: before training, scrub through a few episodes to check for bad demonstrations (the teleoperator sneezed, the object fell off the table, etc.).

PropertyLocalStreaming
Setup timeMinutes (first download)Seconds
Per-batch latency~1ms (disk)~100ms (network)
Disk usageFull dataset size~0 (cache only)
Use caseTrainingExploration, debugging
Random accessInstantRequires seek in video stream
Local vs Streaming Data Flow

Toggle between local and streaming modes to see how data flows from the Hub to your training loop. Notice the cache layer in local mode vs the direct fetch in streaming.

Why should you use local loading instead of streaming for training?

Chapter 6: Teleop Data Collection

The most natural way to teach a robot is to show it what to do. You physically guide the robot's arm through the desired motion while recording everything: joint positions, camera images, gripper state. This is teleoperation, and it's how most robot learning datasets are collected today.

LeRobot supports several teleoperation setups, but the most accessible is the SO-100 with a leader-follower arm. You move the leader arm (the one you hold), and the follower arm (the one doing the task) mirrors your movements in real time. Both arms' joint positions are recorded, along with camera feeds.

Step 1: Calibrate
python -m lerobot.calibrate — set joint zero positions
Step 2: Teleop Test
python -m lerobot.teleoperate — verify follower mirrors leader
Step 3: Record
python -m lerobot.record — collect episodes with camera + joints
Step 4: Replay
python -m lerobot.replay — play back episodes to verify quality
bash
# Full recording pipeline for SO-100

# 1. Calibrate (only needed once per robot)
python -m lerobot.calibrate \
  --robot.type=so100 \
  --robot.port=/dev/ttyACM0 \
  --robot.leader_port=/dev/ttyACM1

# 2. Quick teleoperation test (no recording)
python -m lerobot.teleoperate \
  --robot.type=so100

# 3. Record 50 episodes of a pick-and-place task
python -m lerobot.record \
  --robot.type=so100 \
  --repo-id=your_name/pick_place_cube \
  --num-episodes=50 \
  --fps=30 \
  --warmup-time-s=5 \
  --episode-time-s=30 \
  --reset-time-s=10

# 4. Replay episode 0 to check quality
python -m lerobot.replay \
  --robot.type=so100 \
  --repo-id=your_name/pick_place_cube \
  --episode=0
Quality over quantity: 50 high-quality demonstrations beat 500 sloppy ones. During recording, focus on smooth, consistent motions. If you make a mistake, discard the episode and re-record. LeRobot's replay command lets you inspect each episode before uploading to the Hub.

The --reset-time-s=10 flag gives you 10 seconds between episodes to reset the scene (put the cube back, straighten the cloth, etc.). The --warmup-time-s=5 flag gives 5 seconds before recording starts so you can get your hands in position.

After recording, the dataset is automatically formatted as a LeRobotDataset with Parquet, MP4, and metadata. You can push it to the Hub with a single command:

python
# Push your dataset to the Hub
ds = LeRobotDataset("your_name/pick_place_cube")
ds.push_to_hub()
Teleop Data Flow

Click each stage to trace data from the leader arm through recording to the final dataset.

What is the purpose of the replay command after recording episodes?

Chapter 7: Connections

This chapter is the trailhead. You now understand the motivation for robot learning and the infrastructure that makes it possible. Here's where each subsequent chapter takes you:

ChapterWhat You'll LearnKey Idea
Ch 2: Classical RoboticsForward/inverse kinematics, Jacobians, feedback controlWhat works, and why it breaks
Ch 3: RL FoundationsMDPs, policy gradients, value functionsLearning from rewards, not demonstrations
Ch 4: Imitation LearningBehavior cloning, DAgger, action chunkingLearning from expert demonstrations
Ch 5: PoliciesACT, Diffusion Policy, TDMPC, VLAThe specific architectures that work
Ch 6: Vision-Language-ActionFoundation models for roboticsOne model for any task, any robot
The learning progression: Classical robotics (Ch 2) teaches you what hand-engineered solutions look like and where they fail. RL (Ch 3) shows how a robot can learn from trial-and-error. Imitation learning (Ch 4) shows how it can learn from watching an expert. Policies (Ch 5) are the specific neural network architectures. VLAs (Ch 6) are the frontier — one model that can do any task described in natural language.

The thread connecting all chapters is a single question: how do we get data, and what do we do with it? Classical robotics needs no data but breaks on hard tasks. RL generates its own data but needs millions of trials. Imitation learning uses human demonstrations but needs hundreds per task. VLAs leverage internet-scale pretraining so they need only a handful of demonstrations for a new task.

Each approach trades off data efficiency against generality. The trend in the field is unmistakable: toward more data, simpler pipelines, and larger models. LeRobot is the infrastructure that makes this practical.

Where to go next: If you want to understand what LeRobot is replacing, read Chapter 2: Classical Robotics. It covers forward kinematics, inverse kinematics, and Jacobian control — the fundamentals that every roboticist should know, even if they ultimately use learned policies.
What is the fundamental tradeoff across robot learning paradigms (classical, RL, imitation, VLA)?
Chapter 2: Classical Robotics →