← Gleams
Stanford CS 231n · Lecture 17 · Robot Learning

Teaching Machines to Touch the World

Vision got us eyes. Language got us words. But robots need hands — and hands that learn from experience, not just instructions.

Perception-Action Loops Sim-to-Real Transfer Imitation Learning Robotic Foundation Models
Roadmap

What You'll Master

Chapter 01

Why Robots Need Learning

You've built a convolutional network that classifies images with superhuman accuracy. You've trained a transformer that writes coherent essays. Now imagine bolting either onto a robot arm and asking it to fold a towel. What goes wrong?

Everything. The towel deforms unpredictably. The robot's gripper slips. The camera sees a slightly different angle than training. The table has a crease. A single pixel-level change cascades into a completely different physical outcome. This is the chasm between perception (recognizing the world) and action (changing it).

The Four Curses of Robot Learning

Robot learning differs from standard supervised learning in four fundamental ways. Each is a genuine obstacle, not a minor annoyance:

Curse 1
Stochasticity

The same action in the same state can produce different outcomes. You push a block and it slides 3 cm or 5 cm depending on friction, contact angle, surface dust. In supervised learning, the label for an image is fixed. In robotics, the "label" (next state) is a probability distribution.

Curse 2
Credit Assignment

A robot stacks 10 blocks and the tower falls on block 7. Which action caused the failure? Block 3 was placed 1mm off-center, creating a lean that compounded. The bad action happened 4 steps before the consequence. In supervised learning, the loss is immediate. In robotics, consequences are delayed.

Curse 3
Non-Differentiability

You can backpropagate through a neural network because every operation is differentiable. You cannot backpropagate through physics. When a robot's gripper contacts a table, there's a discontinuous collision event. The gradient of "did the cup break?" with respect to gripper force is either zero or undefined.

Curse 4
Nonstationarity

In image classification, the data distribution is fixed: cats look like cats tomorrow. For a robot, each action changes the world, which changes what the robot sees, which changes what it should do next. The "training distribution" is constantly shifting because the robot is part of it.

The MDP Framework

Despite these curses, we need a formal framework. Robotics uses the Markov Decision Process (MDP) — the same framework from RL, but now the states are physical and the actions move real actuators.

MDP Tuple M = (S, A, P, R, γ)

S = state space     joint angles, object positions, camera pixels
A = action space     motor torques, gripper commands, velocity targets
P(s'|s, a) = transition     physics: what happens when you act
R(s, a) = reward     did the task succeed? how efficiently?
γ ∈ [0, 1) = discount     prefer sooner rewards to later ones
Example — Cart-Pole as MDP

State: (cart position x, cart velocity ẋ, pole angle θ, angular velocity θ̇). Four numbers.

Action: Push left or push right. Two choices.

Transition: Newtonian mechanics — gravity pulls the pole, force accelerates the cart. Deterministic but nonlinear.

Reward: +1 for every timestep the pole stays upright. Episode ends when it falls.

Example — Cloth Folding as MDP

State: Point cloud of cloth surface (thousands of 3D points) + robot arm joint angles.

Action: Pick point (x, y) on cloth, place at target (x', y'). Continuous, high-dimensional.

Transition: Deformable body physics — wildly stochastic. Same pick can fold or crumple depending on initial wrinkles.

Reward: IoU (intersection over union) between current cloth shape and target folded shape. Hard to compute in real time.

Why Reward Design is the Hidden Bottleneck

In Atari, the reward is the game score — given to you for free. In robotics, you have to invent the reward. Want a robot to pour water into a glass? You need a sensor that measures water level. Want it to fold a shirt neatly? You need to define "neat" mathematically. This is reward engineering, and it's often harder than the learning itself.

Reward Hacking

A robot told to "move the cup to the target" with reward = −(distance to target) learns to fling the cup at maximum speed. It reaches the target — and shatters the cup. The reward function was technically satisfied. This failure mode, called reward hacking, is ubiquitous in robotics. The robot optimizes exactly what you told it, not what you meant.

The Sim-to-Real Gap

Training in the real world is slow (one trial per second) and dangerous (robots break things). So we simulate. But simulation is imperfect: friction coefficients are wrong, lighting is too uniform, contact dynamics are simplified. A policy trained in simulation may fail completely on the real robot. This sim-to-real gap is the central practical challenge of robot learning.

The Core Tension

Simulation gives you unlimited data but imperfect physics. The real world gives you perfect physics but limited data. Every robot learning method is, at its core, a strategy for navigating this tradeoff.

Chapter 02

Robot Perception

Computer vision asks: "What's in this image?" Robot vision asks: "What's in this image and what should I do about it?" The difference is profound. A vision model can take 100ms to classify an image. A robot reaching toward a moving object needs perception at 30 Hz with sub-centimeter accuracy. Perception is not a preprocessing step — it's part of the control loop.

The Perception-Action Loop

Every robot, from a Roomba to a surgical arm, runs the same loop:

The Perception-Action Loop
  1. Sense: Read cameras, LiDAR, joint encoders, force sensors, tactile arrays.
  2. Perceive: Extract a state representation from raw sensor data.
  3. Decide: Feed the state to a policy π(a|s) to choose an action.
  4. Act: Send motor commands to actuators.
  5. Observe: Measure the outcome. Did the world change as expected?
  6. Repeat at 10–1000 Hz.
Interactive: Perception-Action Loop
Click "Step" to advance through one cycle of the perception-action loop for a pick-and-place task. Watch how state, action, and reward change at each step.
Ready — click Step

Sensor Modalities

SensorWhat It MeasuresStrengthsWeaknesses
RGB CameraColor images (pixels)Rich, cheap, abundant dataDepth-ambiguous, lighting-sensitive
Depth / RGBDPer-pixel distance3D geometry directlyNoisy on reflective/transparent surfaces
LiDAR3D point cloudsPrecise, long-rangeSparse, expensive, no color
TactileContact pressure mapsCrucial for manipulationLocal (only where touching)
Force/TorqueForces at wrist/jointsDetects contact eventsNo spatial information
ProprioceptionJoint angles, velocitiesAlways available, preciseOnly knows robot, not world

The Representation Bottleneck

Raw sensor data is massive — a single RGB image is 640×480×3 = 921,600 numbers. A policy network that takes raw pixels as input must learn both perception and control simultaneously. This is like asking someone to learn chess while blindfolded, interpreting the board from descriptions of pixel colors.

The choice of state representation — what you extract from sensors before feeding to the policy — is perhaps the most consequential design decision in robot learning. Here are the main options:

Representation
Raw Pixels

Feed the image directly to a CNN. Maximum information, but requires learning visual features and control jointly. Works when you have millions of training frames (simulated Atari). Struggles with real-robot data budgets of thousands of episodes.

Representation
Learned Features

Use a pretrained vision encoder (ResNet, ViT, CLIP) to extract a compact feature vector. The policy sees a 512-dimensional vector instead of a 921K-dimensional image. Faster to train, more sample-efficient, but the features may miss task-relevant details (like the exact angle of a screw).

Representation
Keypoints

Detect a sparse set of 2D or 3D points on objects of interest — corners, handles, edges. The state becomes ~10–50 (x, y, z) coordinates. Extremely compact and interpretable. But keypoint detectors can fail on novel objects, and you lose shape information between keypoints.

Representation
Point Clouds

A set of 3D points from depth sensors or LiDAR. Richer than keypoints (captures full geometry), sparser than pixels. Works well with PointNet-style architectures. The gold standard for manipulation tasks involving complex 3D shapes.

The Representation Tradeoff

More compressed representations (keypoints) need less data but lose information. Less compressed representations (pixels) preserve information but need orders of magnitude more training. The sweet spot depends on your data budget: 100 real demos → use keypoints. 100,000 sim episodes → use pixels. Millions of internet images → use pretrained features.

Chapter 03

Reinforcement Learning for Robots

You know RL from Atari: train a DQN on millions of frames, achieve superhuman Breakout scores. Now try to apply the same recipe to a real robot arm. The robot gets one attempt per 5 seconds. At Atari's 200 million frame budget, training would take 31 years. Clearly, we need different strategies.

DQN Recap (the Atari Success Story)

Let's recall why DQN worked so spectacularly for games. The architecture is simple: a CNN takes 4 stacked grayscale frames (84×84 each) as input, passes through three conv layers and two fully-connected layers, and outputs one Q-value per possible action (4–18 joystick directions). The agent picks argmaxa Q(s, a) with ε-greedy exploration.

Three innovations made this work:

Innovation 1
Experience Replay

Store every transition (s, a, r, s') in a buffer. Sample mini-batches randomly for training. This breaks temporal correlations (consecutive game frames are nearly identical) and reuses each transition many times. One real interaction generates dozens of gradient updates.

Innovation 2
Target Networks

The Bellman target y = r + γ maxa' Q(s', a') uses Q itself — so the target moves as Q improves. This is like trying to hit a moving goalpost. Solution: freeze a copy Qtarget and update it only every 10,000 steps. The target is now locally stable.

Innovation 3
Frame Stacking

A single frame is ambiguous: is the ball moving left or right? Stacking 4 consecutive frames gives the CNN access to velocity information without explicitly computing it.

From Games to Real Robots

DQN's three innovations are necessary but not sufficient for robotics. Here's what else you need:

Sample Efficiency Crisis

DQN used 200 million Atari frames. A real robot arm doing bin-picking at 5 Hz gets 18,000 interactions per hour. At DQN's data appetite, training takes 11,111 hours = 1.3 years of continuous operation. And that's assuming no hardware failures, no resets, no human supervision. Model-free RL on real robots requires fundamentally better sample efficiency.

Domain Randomization

If sim-to-real is the problem, make simulation harder than reality. Domain randomization trains the policy across thousands of simulated environments with randomly varied physics: friction ∈ [0.1, 1.0], gravity ∈ [9.5, 10.1], object mass ∈ [0.5, 2.0]×, camera position jittered by ±5cm, lighting randomly placed. If the policy works across all these variations, it will likely work in the one real-world setting too.

Domain Randomization π* = argmaxπ 𝔼ξ ~ p(ξ) [ ∑t γt R(st, at; ξ) ]

ξ = randomized environment parameters (friction, mass, lighting, etc.)

Landmark Results

AlphaGo (2016) → AlphaGo Zero (2017) → AlphaZero (2017) → MuZero (2020): This progression illustrates a key theme. AlphaGo used human expert games. AlphaGo Zero eliminated human data entirely, learning from self-play alone — and became stronger. AlphaZero generalized to chess and shogi. MuZero learned without even knowing the rules of the game, learning its own world model internally. Each step reduced the human knowledge required.

Quadrupedal Locomotion: Policies trained in simulation with domain randomization transfer to real quadruped robots walking over stairs, rubble, and ice. The key insight: locomotion is a relatively low-dimensional control problem (12 joint angles) with fast feedback (IMU at 200Hz), making sim-to-real tractable.

Dexterous Manipulation (OpenAI, 2019): A simulated Shadow Hand learned to solve a Rubik's cube using RL with massive domain randomization: 6144 parallel simulations, randomizing 37 physics parameters. The policy transferred to the real hand — but required billions of simulated episodes and months of compute.

The Real Lesson

Model-free RL works for robots when: (1) you can simulate the task cheaply, (2) the action space is moderate-dimensional, and (3) you can afford enormous compute. For tasks where simulation is unreliable (deformable objects, liquids, soft contacts) or data is limited, we need other approaches: model-based RL, imitation learning, or foundation models.

🔨 Derivation Derive the Sim-to-Real Transfer Bound (Domain Adaptation) ✓ ATTEMPTED

Let π be a policy trained in simulation. The sim-to-real gap can be formalized as a domain adaptation problem. Let dsim denote the state distribution under the simulator dynamics, and dreal under the real dynamics. The expected return gap satisfies:

|Jreal(π) − Jsim(π)| ≤ (2γ / (1−γ)2) · εmodel

where εmodel = maxs,a TV(Psim(s'|s,a), Preal(s'|s,a)).

Your task: (1) Derive this bound using the simulation lemma (relate Jreal − Jsim to the per-step transition error). (2) Show why the (1−γ)2 denominator means even small per-step errors compound catastrophically for long horizons. (3) Explain how domain randomization reduces the effective εmodel.

The performance difference between two MDPs with the same reward but different dynamics can be written: Jreal(π) − Jsim(π) = (1/(1−γ)) ∑t γt Es~dtreal[Asim(s, π(s))], where Asim captures the advantage of following π under sim dynamics. The key is bounding how the state distributions diverge over time.
After T steps, the total variation between sim and real state distributions is bounded by TV(dTsim, dTreal) ≤ T · εmodel. This is a union bound: at each step, the distributions can diverge by at most εmodel. Sum the geometric series ∑t γt · t · εmodel = γεmodel/(1−γ)2.
Domain randomization trains π to succeed under many dynamics: π* = argmax Eξ[Jξ]. If the real dynamics fall within the randomization envelope, then the policy's robustness effectively reduces the sensitivity to any single εmodel. Formally, the policy becomes Lipschitz-smooth w.r.t. dynamics parameters, so small mismatches produce small performance drops.

Step 1 — Simulation Lemma: For two MDPs M1, M2 with the same reward but different transitions, and policy π:

|J1(π) − J2(π)| ≤ (2Rmax/(1−γ)) · ∑t=0 γt · TV(dt1, dt2)

Step 2 — Bound per-step divergence: At each step, the new state distributions satisfy TV(dt+11, dt+12) ≤ TV(dt1, dt2) + εmodel. By induction: TV(dt1, dt2) ≤ t · εmodel.

Step 3 — Sum the series:t=0 γt · t = γ/(1−γ)2. So: |Jreal − Jsim| ≤ 2Rmax · γ · εmodel / (1−γ)3. (With Rmax=1 normalization and tighter analysis, you get the (1−γ)2 version.)

Step 4 — Implication for long horizons: With γ=0.99 (effective horizon ~100), the bound is 100× worse per unit εmodel than γ=0.9 (effective horizon ~10). This is why MPC (re-plan every step, effectively γ≈0) transfers so much better than long-horizon RL policies.

The key insight: The sim-to-real gap isn't just about how accurate your simulator is (εmodel) — it's about how that error compounds over time. Short-horizon methods (MPC, reactive policies) are inherently more robust to sim-to-real gaps than long-horizon planning.

Checkpoint — Before you move on
Model-free RL solved Rubik's Cube in simulation using 6144 parallel envs. But it took billions of episodes. Explain WHY model-free RL is so sample-inefficient for robotics, and what specific property of model-based methods addresses this.
✓ Gate cleared
Model Answer

Model-free RL discards the transition structure: it only remembers (s, a, r, s') tuples and uses them to update Q-values or policy gradients. It doesn't try to understand WHY s led to s' after action a. This means every new task needs fresh exploration from scratch.

Model-based methods learn the dynamics f(s,a) → s'. Once learned, this model can be queried for any action sequence — generating unlimited imaginary rollouts. One real transition teaches the model about physics that generalizes to many hypothetical actions. The sample efficiency gap is roughly the branching factor: if there are 10 possible actions, one real transition informs the model about all 10 next states, while model-free RL only learns about the one action taken.

The tradeoff: model-based methods are only as good as their model. For contact-rich tasks where the model is inaccurate, model-free methods avoid compounding model errors.

Chapter 04

Model-Based Learning — Learning the World

Model-free RL treats the environment as a black box: act, observe reward, update policy. Model-based RL opens the box. Instead of learning a policy directly, first learn a world model — a neural network that predicts what happens next:

World Modelt+1 = fφ(st, at)

Given current state and action, predict the next state

Once you have a world model, you can plan: simulate thousands of action sequences in your head (in the model), pick the one with the best predicted outcome, execute only the first action, then re-plan. This is called Model Predictive Control (MPC).

Model Predictive Control (MPC)
  1. Observe current state st from sensors.
  2. Sample N random action sequences: {at:t+H}(1), ..., {at:t+H}(N), each of horizon H steps.
  3. Rollout each sequence through the learned model: ŝt+1 = fφ(ŝt, at), accumulating predicted rewards.
  4. Select the sequence with highest total predicted reward.
  5. Execute only the first action at.
  6. Re-plan from the new observed state st+1. Go to step 1.
Why Re-plan?

The model is imperfect. After one real step, the actual state st+1 differs from the predicted ŝt+1. By re-planning from the real state every step, MPC self-corrects. It only trusts the model for one step at a time. This is robustness through humility.

State Representations for Dynamics

What should the world model predict? The choice of state representation completely determines the model's difficulty:

Approach
Pixel Dynamics (Deep Visual Foresight)

Finn & Levine (2017) trained a model to predict future video frames given actions. Input: current frame + robot action. Output: predicted next frame. This is maximally general — it works for any task visible in a camera — but generating photorealistic futures is extremely hard. Predicted frames get blurry after 5–10 steps as uncertainty compounds.

Approach
Keypoint Dynamics

Manuelli et al. (2020) detected sparse keypoints on objects and predicted their future positions. Instead of predicting 307,200 pixel values, you predict ~30 (x, y, z) coordinates. Dramatically easier to learn, dramatically more sample-efficient. But keypoint detection must generalize to novel objects, and you lose shape information.

Approach
Particle Dynamics

Wang et al. (2023) represented objects as sets of particles (like a point cloud) and used graph neural networks (GNNs) to predict particle interactions. Each particle is a node; nearby particles are connected by edges. The GNN learns local physics: when two particles are close, they repel (rigidity) or attract (cohesion). This captures deformable objects, fluids, and granular materials that keypoints and pixels struggle with.

Interactive: Model-Based Planning (MPC)
A 2D robot (blue circle) must reach the goal (gold star). The learned model predicts future states (translucent trails). Adjust model accuracy and planning horizon to see how they affect performance. Click "Plan & Step" to execute one MPC cycle.

The Model Accuracy vs. Planning Horizon Tradeoff

A perfect model lets you plan arbitrarily far ahead. An imperfect model accumulates errors with each predicted step. After H steps, the error grows roughly as:

Error Compounding error(H) ≈ ε · (1 + ε)H − ε

ε = single-step model error. Grows exponentially with horizon H.

This means a 5% per-step error becomes a 34% error at H = 6, and a 108% error at H = 15. Long-horizon planning requires either very accurate models or short re-planning intervals.

When Models Fail

Model-based methods struggle with contact-rich tasks: inserting a peg into a hole, turning a key in a lock, tying a knot. These involve sudden discontinuous dynamics (contact/no-contact transitions) that smooth neural network models approximate poorly. For such tasks, model-free methods or imitation learning often work better.

When Models Win

Model-based methods excel when: (1) the dynamics are relatively smooth (pushing, reaching, locomotion), (2) data is scarce (tens of episodes, not millions), and (3) the task requires long-horizon reasoning (multi-step assembly). With 100 real-world episodes, a learned model + MPC can solve tasks that model-free RL needs 100,000 episodes for.

Chapter 05

Imitation Learning — Learning from Demonstrations

What if you skip reward engineering, skip simulation, and just show the robot what to do? A human teleoperates the arm, picks up the cup, and places it on the shelf. Record the observation-action pairs. Train a policy to mimic them. This is imitation learning — and it's the fastest path from zero to a working robot policy.

Behavioral Cloning (BC)

The simplest form: treat it as supervised learning. You have a dataset of expert demonstrations D = {(o1, a1), (o2, a2), ...}. Train a neural network πθ(a | o) to predict the expert's action from the observation:

Behavioral Cloning Loss L(θ) = 𝔼(o, a) ~ D [ || πθ(o) − a ||2 ]

MSE between predicted and expert action. That's it.

This is literally just regression. And it works — for about three seconds. Then the robot drifts off the expert's trajectory, encounters states the expert never visited, and has no idea what to do. This failure mode has a name.

Distribution Shift: The Compounding Error Problem

During training, the policy sees states from the expert's trajectory. During execution, the policy sees states from its own trajectory. Even a tiny per-step error ε causes the robot to visit states the expert never demonstrated. In those states, the policy's prediction is garbage, causing more drift, causing worse predictions.

Compounding Error Total error ≤ ε · T2

ε = per-step error, T = episode length. Quadratic growth!
Why Quadratic?

At step 1, you make error ε. At step 2, you're in a slightly wrong state, so your error is ε + ε · δ where δ reflects the distributional shift. At step t, you've accumulated O(t) drift, and each drifted step adds O(ε) error. Total: ∑t=1T O(tε) = O(εT2). For a 100-step episode with ε = 0.01, BC's error grows to 0.01 × 10,000 = 100. Completely diverged.

Interactive: Behavioral Cloning vs DAgger
Left: BC policy (red) drifts from the expert trajectory (gold). Right: DAgger (green) corrects by querying the expert in visited states. Click "Run Episode" to see both policies attempt the same curved path. Adjust noise to see how BC degrades faster.

DAgger: Dataset Aggregation

DAgger (Ross et al., 2011) fixes distribution shift by collecting data on the policy's own trajectory. The idea is elegant:

DAgger Algorithm
  1. Collect initial expert demonstrations D0.
  2. Train policy πθ on D0.
  3. Execute πθ in the environment, visiting states s1, s2, ...
  4. Query the expert: "What would you do in states s1, s2, ...?" Record labels.
  5. Aggregate: D1 = D0 ∪ {(si, aiexpert)}.
  6. Retrain on D1. Go to step 3.

DAgger reduces the error from O(εT2) to O(εT) — linear instead of quadratic. The key: by training on states the policy actually visits, you close the distribution gap. The price: you need an expert available to label on-policy states, which is expensive.

Beyond Gaussian: Multimodal Action Distributions

Consider a T-intersection. The expert sometimes turns left, sometimes right. BC with a Gaussian output averages them: go straight into the wall. The Gaussian policy outputs the mean of the expert's actions, which is a terrible action that no expert ever took.

Solution 1
Implicit Behavioral Cloning (IBC)

Florence et al. (2022) used energy-based models: instead of predicting a single action, learn an energy function E(o, a) that assigns low energy to good actions and high energy to bad ones. At test time, find argmina E(o, a) via gradient descent in action space. This naturally captures multimodality — multiple action modes have low energy.

Solution 2
Diffusion Policy

Chi et al. (2023) used diffusion models to generate action sequences. Start with Gaussian noise, iteratively denoise into a coherent action trajectory. Because the diffusion model learns the full distribution (not just the mean), it can produce diverse, multimodal action sequences. It also generates action chunks — entire trajectories of 8–16 future actions at once, providing temporal consistency.

Why Action Chunking Matters

Single-step action prediction is jittery: the robot reconsiders its plan every 50ms. Action chunking commits to a short trajectory (0.5–1.0 seconds of future actions), producing smoother, more human-like motion. It also reduces the effective horizon from T to T/chunk_size, mitigating compounding error.

Inverse Reinforcement Learning (IRL)

Instead of cloning actions, infer the reward function the expert must be optimizing. Then use RL with that learned reward. This is more robust than BC because the reward function transfers across different dynamics (a reward for "cup on shelf" works even if the robot arm is different from the demonstrator's).

Maximum Entropy IRL R* = argmaxR 𝔼τ ~ πexpert[R(τ)] − maxπ H(π) + 𝔼τ ~ π[R(τ)]

Find R that makes the expert look optimal. H(π) = policy entropy.

The intuition: find a reward function under which the expert's behavior has higher value than any other policy's. The entropy term H(π) prevents degenerate solutions where R = 0 everywhere (trivially making everyone "optimal").

🔨 Derivation Derive BC Compounding Error — Why T2 and Not T? ✓ ATTEMPTED

Behavioral cloning has a famous failure mode: error compounds quadratically with horizon. The bound is: Total error ≤ ε · T2, where ε is the per-step policy error and T is the trajectory length.

Your task: (1) Derive this T2 bound by showing how distribution shift causes errors to compound. (2) Show that DAgger achieves O(ε · T) instead, and explain what breaks the quadratic compounding. (3) For a contact-rich task like peg-in-hole with T=200 steps, how many demonstrations do you need to keep total error below 10% success degradation?

At training time, BC sees states from the expert's distribution dπ*. At test time, after one mistake, the policy visits state s' that the expert would never visit. Now the policy must predict from an out-of-distribution state — likely making another error. This pushes it further off-distribution. After t steps, the deviation from the expert's distribution grows as t · ε (each step contributes ε deviation). The per-step error at time t is thus proportional to t · ε, and summing over T steps gives ∑ tε = ε · T(T+1)/2 ≈ εT2.
DAgger (Dataset Aggregation) collects data under the LEARNER's distribution, then asks the expert to label those states. After round i, the training set covers dπi. This means the policy is trained on states it will actually encounter, eliminating the distribution shift. Without distribution shift, errors don't compound — you just pay the per-step error ε at each of T steps: total error = ε · T.
For peg-in-hole (T=200, continuous 6D action), the required accuracy per step is: ε ≤ target/(T2) = 0.1/40000 = 2.5×10-6. With MSE loss over 6D actions, this requires roughly N ≥ d/ε ≈ 6/(2.5e-6) ≈ 2.4 million demonstrations — absurd! This is why BC alone fails for long-horizon contact tasks and why Diffusion Policy (multi-modal, handles temporal correlations) and DAgger (breaks quadratic) are necessary.

Part 1 — T2 bound: Let ε = Es~dπ*[TV(πθ(s), π*(s))]. At time t, the policy has accumulated drift from the expert. The state at time t satisfies TV(dtπθ, dtπ*) ≤ tε (each step adds at most ε divergence). The cost at time t under the wrong distribution is bounded by tε. Total cost: ∑t=1T tε = εT(T+1)/2 = O(εT2). ■

Part 2 — DAgger's linear bound: DAgger trains on dπθ (the policy's own distribution). No distribution shift means the per-step error is just ε everywhere, regardless of timestep. Total cost: Tε = O(εT). ■

Part 3 — Why contact tasks are hard: Contact introduces discontinuities. A 1mm error in position can mean the difference between "peg enters hole" and "peg hits surface and jams." The effective ε for contact tasks is much larger than for free-space motion because the loss landscape is discontinuous. This is why diffusion policies (which model multi-modal action distributions) outperform MSE-based BC: they capture the "insert from left OR insert from right" multi-modality that MSE averages to "insert from straight ahead" (which fails).

The key insight: BC's T2 bound means long-horizon contact tasks need exponentially more data than short-horizon ones. The field's response: (1) reduce T via hierarchical policies, (2) use DAgger-style online methods, or (3) use expressive policies (diffusion, flow matching) that represent multi-modality so a single demonstration teaches more.

🔨 Derivation System Identification — Fitting Simulator Parameters from Real Data ✓ ATTEMPTED

System identification finds simulator parameters φ (friction, mass, damping) that minimize the trajectory mismatch between sim and reality. Given N real trajectories τreal = {(s0, a0, s1, ..., sT)}1:N:

φ* = argminφi=1Nt=0T-1 || fφ(st(i), at(i)) − st+1(i) ||2

Your task: (1) Derive the gradient ∇φ of this objective when fφ is a differentiable simulator (like Brax or MuJoCo MJX). (2) Explain why you need multi-step rollout loss (not single-step) for accurate identification. (3) Show how CMA-ES can solve this when the simulator is NOT differentiable.

φ L = ∑ 2(fφ(st,at) − st+1) · ∇φ fφ(st,at). This requires the Jacobian of the simulator w.r.t. its parameters — available in differentiable simulators. For MuJoCo MJX: backprop through the physics step. For classic MuJoCo: finite differences or adjoint methods.
Single-step fitting finds φ that matches one-step predictions. But errors compound: a simulator that's 1% wrong per step can be 50% wrong after 50 steps if errors correlate. Multi-step loss: L = ∑ ||rolloutφ(s0, a0:T) − τreal||2. This penalizes compounding errors directly. But the gradient requires backprop through T simulator steps — expensive and potentially unstable (exploding gradients through contact).
CMA-ES (Covariance Matrix Adaptation Evolution Strategy) maintains a Gaussian distribution over φ. Sample candidates, evaluate each by running the full sim and measuring trajectory RMSE, then update the distribution toward low-RMSE regions. No gradients needed — works with any simulator. Typically converges in ~100-1000 evaluations for 10-30 parameters. Key: the evaluation function is "rollout sim with these params, compare to real trajectory."

Differentiable case: With autodiff through the simulator: ∇φ L = ∑t 2(ŝt+1 − st+1) · (∂f/∂φ + ∂f/∂s · dŝt/dφ). The second term is the "chain through time" — errors in earlier predictions affect later ones through the state. This is BPTT through the physics engine.

Multi-step is crucial: Fitting mass using single-step: if real mass = 2kg and sim mass = 1.8kg, single-step error is small (10%). But over 50 steps of free-fall, position error grows as ½Δm·g·t2/(m) — 50x worse. Multi-step loss catches this compounding and drives φ to the true value.

CMA-ES procedure: (1) Initialize μ = default params, Σ = 0.1·I. (2) Sample λ=50 candidates φi ~ N(μ, Σ). (3) For each φi: run sim with those params, compute RMSE vs real. (4) Rank candidates by fitness. (5) Update μ, Σ toward top-ranked candidates. (6) Repeat until convergence (~100-500 generations).

The key insight: System identification is an optimization problem where the objective is "make simulation match reality." The choice of optimization method (gradient vs black-box) depends on whether your simulator is differentiable. The choice of loss (single-step vs multi-step) determines whether you catch compounding errors.

Chapter 06

Robotic Foundation Models

The pattern in NLP was clear: pretrain a giant model on diverse data, then fine-tune for specific tasks. GPT proved it for text. CLIP proved it for vision+language. Can the same pattern work for robots? Train one model on data from many robots doing many tasks, then fine-tune it for your specific robot and task?

Vision-Language-Action Models (VLAs)

A VLA takes three inputs: what the robot sees (camera image), what it should do (language instruction like "pick up the red cup"), and outputs motor commands. It's a VLM with an action head instead of a text decoder.

VLA Architecture at = VLA(imaget, "pick up the red cup")

Maps vision + language to continuous robot actions
VLA vs VLM: The Key Difference

A VLM (like GPT-4V) outputs text tokens. A VLA outputs continuous motor commands (7D: xyz position + rotation quaternion + gripper open/close). This seems like a small change, but it's fundamental. Text tokens are discrete and language-universal. Motor commands are continuous, robot-specific (a 6-DOF arm vs. a quadruped have different action spaces), and must be physically executable at 10–50 Hz.

The Timeline

ModelDateKey InnovationScale
RT-1Dec 2022First large-scale robot transformer. FiLM-conditioned EfficientNet + TokenLearner. 130K real episodes from 13 robots.35M params
RT-2Jul 2023Repurposed a VLM (PaLI-X) as robot controller. Actions tokenized as text: "1 128 91 241 5 101 127". Emergent reasoning — "pick up something that's NOT a banana" works without being trained on negation.55B params
RT-XOct 2023Cross-embodiment dataset: 22 robot types, 160,000 tasks, 1M+ episodes. Showed that training on diverse robots helps every individual robot.Multi-dataset
OpenVLAJun 2024Open-source VLA. Llama 2 backbone + DINOv2 vision. First VLA the community could actually use and fine-tune.7B params
π-ZeroOct 2024Flow matching for action generation. Pre-train on diverse robot data, post-train with flow matching for smooth, multimodal actions. Cross-embodiment + cross-task.3B params

RT-2: How a VLM Becomes a Robot

RT-2's insight is almost absurdly simple. Take PaLI-X, a 55-billion parameter VLM trained on web images and text. It already understands visual concepts ("red cup," "left side of table"). Now fine-tune it on robot data where actions are encoded as text tokens:

RT-2 Action Tokenization

Input: Camera image + "move the apple to the blue bowl"

Output text: "1 128 91 241 5 101 127"

Each number is a discretized action dimension: [terminate, x, y, z, roll, pitch, yaw, gripper]. The VLM generates these as regular text tokens — it just happens that these tokens are motor commands.

Emergent capability: RT-2 can follow instructions it was never trained on ("pick up the extinct animal" → picks up the plastic dinosaur) because PaLI-X learned these concepts from the internet.

π-Zero: Cross-Embodiment + Flow Matching

Physical Intelligence's π-Zero takes the foundation model idea further with two phases:

Pre-training: Train on diverse robot interaction data from multiple robot types (arms, hands, bipeds). Actions from different robots are normalized into a common representation. The model learns general physical manipulation concepts: grasping, pushing, inserting, wiping.

Post-training with flow matching: Instead of tokenizing actions as text (lossy discretization), π-Zero uses flow matching to generate continuous action trajectories. Flow matching learns a vector field that transports a simple distribution (Gaussian noise) to the data distribution (expert actions). This produces smooth, multimodal action distributions — exactly what you need for dexterous manipulation.

Cross-Embodiment Training

Training on data from 22 different robots sounds like it would confuse the model. The opposite happens: shared manipulation concepts transfer across embodiments. A 7-DOF arm and a 16-DOF hand both need to approach objects, align grippers, and apply appropriate force. The foundation model learns these abstractions, and each robot benefits from the others' data.

Chapter 07

World Models & Foundation World Models

In Chapter 4, we learned world models that predict state transitions. Now scale that idea up: what if the world model is a video generation model that predicts entire future videos conditioned on actions? This is the frontier of robot learning.

Action-Conditioned Video Prediction

Given the current camera frame and a proposed action sequence, predict what the camera will see in the next 1–5 seconds. This is literally a video generation problem, but conditioned on robot actions:

Video World Modelt+1:t+H = Gφ(v1:t, at:t+H)

Predict future video frames from past video + planned actions

DayDreamer: Training Entirely in Dreams

Hafner et al. (2023) showed something remarkable: you can train a robot policy entirely inside a learned world model, with near-zero real-world interaction. The recipe:

DayDreamer Pipeline
  1. Collect a small amount of real-world data (1 hour of random exploration).
  2. Train a world model on this real data.
  3. Dream: Generate thousands of imagined trajectories inside the world model.
  4. Train policy on imagined trajectories using RL (no real interaction!).
  5. Deploy the policy on the real robot. Collect more real data. Go to step 2.

DayDreamer trained a quadruped to walk in just 1 hour of real-world interaction — orders of magnitude less than model-free RL. The world model serves as a "dream simulator" that amplifies scarce real data into abundant imagined experience.

Foundation World Models

Nvidia Cosmos (2024) pushes this further: train a world model on internet-scale video data (driving videos, manipulation clips, nature documentaries). The resulting model understands basic physics: objects fall, liquids flow, hands grasp. Fine-tune on your specific robot's data, and you get a world model that already knows about the physical world.

UniSim (2023) by Google DeepMind is a universal simulator that can generate realistic video of unseen actions in unseen environments. It's trained on diverse video data and can simulate what happens if you push an object, open a drawer, or rearrange items — all in environments it's never seen before.

Genie (2024) by Google DeepMind learned to generate interactive 2D worlds from a single image. Show it a screenshot of a platformer game, and it generates a playable world where actions have consistent consequences. This hints at a future where world models become general-purpose simulators, bootstrapped from video.

The 3D Question

Current video world models reason in 2D pixel space. But the physical world is 3D. Two active research directions: (1) Structural priors — build 3D inductive biases (voxels, NeRFs, Gaussian splats) into the world model architecture. (2) Emergent 3D — train on enough 2D video and hope 3D understanding emerges, the way language models seem to learn world knowledge from text. The jury is out on which approach will win.

Connection
Physics-Informed Neural Networks (PINNs)

Instead of learning physics from data alone, encode known physical laws (Newton's equations, conservation of energy) as loss terms. The network's predictions must satisfy F = ma as a soft constraint. This dramatically reduces the data needed for accurate models, especially for rigid-body dynamics where the physics is well-understood.

Chapter 08

Open Challenges — What's Left?

Robot learning has achieved remarkable results in controlled settings. But deploying robots reliably in unstructured human environments remains stubbornly hard. Here are the open problems the field is wrestling with.

Evaluation: How Do You Know It Works?

In NLP, evaluation is cheap: run the model on a test set. In robotics, evaluation means physical experiments. Each experiment takes minutes, requires human supervision, risks breaking hardware, and produces noisy results (the same policy might succeed 7/10 or 9/10 times depending on initial conditions). This makes it nearly impossible to do rigorous ablation studies.

Safety and Robustness

A language model that generates a wrong answer wastes your time. A robot that executes a wrong action breaks a $50,000 arm, injures a human, or destroys the object it's manipulating. Safety isn't an afterthought — it's a prerequisite.

The Long Tail

Robot policies work 95% of the time in the lab. In deployment, the 5% failure cases are catastrophic: an unseen object shape, an unusual lighting condition, a human reaching into the workspace. Moving from 95% to 99.99% reliability is qualitatively different from moving from 50% to 95%. It requires handling a combinatorial explosion of edge cases that no training set can exhaustively cover.

Long-Horizon Planning

Current policies handle 10–30 second tasks: pick, place, push. Real-world useful tasks are minutes to hours: cook a meal, clean a room, assemble furniture. Long-horizon tasks require hierarchical planning (decompose "make coffee" into "grind beans," "boil water," "pour," "wait," "pour again"), error recovery (if the mug tips over, re-plan), and memory (where did I put the sugar?).

Adaptation and Lifelong Learning

A robot deployed in a home must adapt to that home's specific quirks: the sticky drawer, the heavy fridge door, the cat that knocks things off tables. Current models are trained once and deployed frozen. Continual learning without catastrophic forgetting (adapting to the drawer without forgetting how to pour water) is an unsolved problem.

Multi-Robot Coordination

Two robot arms collaboratively folding a sheet. A fleet of warehouse robots navigating without collisions. Multi-agent coordination multiplies the state space exponentially and introduces communication, synchronization, and credit-assignment challenges that single-robot methods don't face.

The Role of Language

Language is increasingly the interface between humans and robots. But mapping language to action is fraught: "put it over there" requires grounding "it" (what object?), "over there" (where exactly?), and "put" (place? throw? slide?). Foundation models bring linguistic understanding, but grounding that understanding in physical reality — connecting words to forces and positions — remains a fundamental challenge.

The Bigger Picture

Current foundation models for robots are pretrained on internet data and fine-tuned on robot data. But internet data is overwhelmingly about describing the world (text, images), not acting in it. The embodied data that robots need — cause-and-effect relationships, physical consequences of actions, contact dynamics — is vastly underrepresented. The field needs either much more embodied data (expensive) or architectures that can learn physics from vision alone (hard).

💥 Break-It Lab What Dies When You Remove Sim-to-Real Components? ✓ ATTEMPTED
A robot arm policy is trained in MuJoCo with domain randomization and transferred to a real Franka Panda. The system works at 87% success. Toggle components OFF to see what breaks.
Remove Domain Randomization ACTIVE
Failure mode: Success drops to 15-25%. The policy overfits to exact sim parameters (friction=0.8, mass=1.2kg). The real robot has friction=0.65 and slightly different masses. Without DR, the policy has zero robustness to parameter mismatch. It's like training a vision model on one background color — instant failure on any other.
Wrong Action Space (Joint vs Task Space) ACTIVE
Failure mode: Policy trained in joint space (7 joint velocities) but deployed expecting task space (6D end-effector velocity + gripper). The robot moves in completely wrong directions. Joint-space policies are harder to transfer because joint dynamics vary more between robots than task-space geometry. Task-space policies transfer better across embodiments because "move end-effector 5cm left" means the same thing regardless of the specific kinematic chain.
Remove Safety Constraints ACTIVE
Failure mode: Without joint limits, velocity caps, and force monitoring: the policy commands 200% of max torque (clipped silently on real hardware, causing unexpected behavior). Without collision checking, the arm drives into the table at full speed. Without watchdog timeout, a crashed policy leaves the arm in motion. In production, the safety layer ISN'T optional — it's the difference between a $500 experiment and a $50,000 accident.
🏗 Design Challenge You're the Architect: Bimanual Laundry-Folding Robot ✓ ATTEMPTED
Your team at a robotics startup must build a bimanual robot that folds laundry. Deformable objects (cloth, towels, shirts), diverse garments (small socks to large bedsheets), and a throughput target of 100 folds/hour. The system ships in 12 months.
Throughput
100 folds/hour (36 sec/fold avg)
Object Types
T-shirts, towels, pants, socks, sheets
Hardware
2x 7-DOF arms + parallel grippers
Perception
2x RGB-D cameras (overhead + front)
Training Budget
500 human demos + unlimited sim
Deformable Sim Fidelity
Low (cloth sim is inaccurate)
1. Learning approach: BC from human demos? RL in sim? Hybrid? Given that cloth simulation is inaccurate, how do you handle the sim-to-real gap for deformable objects?
2. State representation: how do you represent a crumpled cloth to a policy? Keypoints? Full mesh? Learned embeddings from RGB?
3. Action primitives: do you learn end-to-end or define primitives (pick-point, fold-along-axis, flatten) and learn when to apply them?
4. Bimanual coordination: two independent policies? One joint policy? Leader-follower?
5. Garment classification: how do you handle a never-seen garment shape? Do you need explicit classification, or can the policy generalize?

State of the art (2024): SpeedFolding (UC Berkeley) achieves 30-40 folds/hour using:

Perception: RGBD overhead camera + learned keypoint detection on cloth. A neural network predicts "grasp points" (where to pick up) and "fold lines" (where to create the fold). This avoids needing a full cloth mesh — just key geometric features.

Learning: Imitation learning from human demos (not RL, because cloth sim is too inaccurate for sim-to-real). ~200 demos per garment category. Diffusion policy for multi-modal actions (there are multiple valid fold sequences for any garment).

Action space: Hybrid primitives. High-level: classify garment → select fold sequence template. Low-level: learned pick-and-place within each primitive. This decomposes the 36-second task into 3-4 primitives of 9 seconds each.

Bimanual: Leader-follower for symmetric folds (both arms do the same motion mirrored). Joint policy for asymmetric tasks (one arm holds, other folds). Coordination via shared state + synchronization primitives.

Key insight: For deformable objects, imitation learning + primitives beats end-to-end RL because cloth sim can't generate useful training data. The 500-demo budget is spent wisely: 100 demos per category (5 categories), with data augmentation (random start poses) expanding to 5000+ effective demos.

💻 Build It Implement Domain Randomization Parameter Sampling ✓ ATTEMPTED
Implement a domain randomizer that samples physics parameters for each training episode. The randomizer should support: (1) uniform ranges for each parameter, (2) correlated parameters (mass and inertia must be physically consistent), (3) adaptive randomization (widen ranges if the policy is too confident, narrow if it can't learn).
python class DomainRandomizer: def __init__(self, param_ranges: dict[str, tuple[float, float]]): """ param_ranges: {'friction': (0.3, 1.2), 'mass': (0.5, 3.0), ...} """ ... def sample(self) -> dict[str, float]: """Sample one set of randomized parameters.""" ... def adapt(self, success_rate: float, target: float = 0.7): """ Widen ranges if success_rate > target (too easy). Narrow ranges if success_rate < target * 0.5 (too hard). """ ... def make_consistent(self, params: dict) -> dict: """Enforce physical consistency: I = m * r^2 for inertia.""" ...
Test case
dr = DomainRandomizer({'friction': (0.3, 1.2), 'mass': (0.5, 3.0)})
params = dr.sample() # {'friction': 0.74, 'mass': 1.83, ...}
assert 0.3 <= params['friction'] <= 1.2
dr.adapt(success_rate=0.95) # too easy, widen
params2 = dr.sample() # now samples from wider range
Parameters like mass and friction are scale parameters — the ratio matters more than the absolute value. Mass=0.5 vs mass=1.0 is as significant as mass=2.0 vs mass=4.0. Use log-uniform sampling: log(x) ~ U(log(lo), log(hi)). This gives equal probability to each multiplicative factor. For additive parameters (like bias offsets), regular uniform is fine.
python
import numpy as np

class DomainRandomizer:
    def __init__(self, param_ranges, log_params=None):
        self.ranges = param_ranges  # {name: (lo, hi)}
        self.log_params = log_params or {'mass', 'friction', 'inertia'}
        self.width_scale = {k: 1.0 for k in param_ranges}

    def sample(self):
        params = {}
        for name, (lo, hi) in self.ranges.items():
            # Apply adaptive width scaling
            mid = (lo + hi) / 2
            half_w = (hi - lo) / 2 * self.width_scale[name]
            eff_lo = max(lo * 0.1, mid - half_w)  # floor at 10% of original lo
            eff_hi = mid + half_w

            if name in self.log_params and eff_lo > 0:
                # Log-uniform: equal probability per multiplicative factor
                params[name] = np.exp(np.random.uniform(
                    np.log(eff_lo), np.log(eff_hi)))
            else:
                params[name] = np.random.uniform(eff_lo, eff_hi)
        return self.make_consistent(params)

    def adapt(self, success_rate, target=0.7):
        for name in self.width_scale:
            if success_rate > target:
                # Too easy: widen by 10%
                self.width_scale[name] = min(3.0,
                    self.width_scale[name] * 1.1)
            elif success_rate < target * 0.5:
                # Too hard: narrow by 10%
                self.width_scale[name] = max(0.3,
                    self.width_scale[name] * 0.9)

    def make_consistent(self, params):
        # Physical consistency: inertia = mass * radius^2
        if 'mass' in params and 'inertia' in params:
            r_sq = params.get('radius', 0.05) ** 2
            params['inertia'] = params['mass'] * r_sq
        return params
Bonus challenge: Extend this to support correlated parameter groups (e.g., all joint dampings scale together) and curriculum scheduling (start narrow, widen over training).
🔗 Pattern Recognition
End-to-End vs Modular: The Same Tradeoff Everywhere
Robot Learning (This Lesson)
Modular: perception → planning → control.
End-to-end: VLA maps pixels directly to actions.
Tradeoff: modularity = interpretable + debuggable. End-to-end = no information bottleneck but opaque.
VLA Architectures
Same tension: RT-2 is end-to-end (VLM → actions).
π-Zero adds explicit flow matching (structured action generation).
VLA Lesson

In both settings, the field oscillates between "let the model figure it out" (end-to-end) and "inject structure" (modular). The pattern: end-to-end wins when data is abundant and the task is complex enough that hand-designed interfaces lose information. Modular wins when data is scarce or you need guarantees (safety, interpretability). The frontier combines both: structured end-to-end (like diffusion policy with explicit action head but learned perception backbone).

Can you identify this same modular vs. end-to-end tension in autonomous driving? What's the equivalent of the "perception → planning → control" stack, and what's the end-to-end alternative?

🔗 Pattern Recognition
Simulation IS the Cheap Data Source
Robot Learning (This Lesson)
RL needs millions of episodes.
Real robot: 1 episode / 5 sec = 31 years for 200M.
Sim: 4096 parallel envs × 1000 steps/sec = 200M in hours.
Robotics Simulation
MuJoCo MJX: 10,000+ envs on 1 GPU.
Isaac Lab: 100,000 parallel envs.
The simulator IS the training infrastructure.
Simulation Lesson

Every robot learning method (RL, BC from sim demos, world model training) bottlenecks on data. Simulation is the universal solution: unlimited, parallelizable, free of safety constraints. The quality of your simulator determines the ceiling of your policy. This is why simulation engineering (choosing the right engine, tuning contact models, system identification) is arguably more important than the learning algorithm itself.

If simulation is so powerful, why can't we just simulate everything? What class of tasks remains fundamentally hard to simulate accurately? (Hint: think about deformable objects and fluid dynamics.)

Chapter 09

Showcase — Build a Robot Learning Pipeline

Time to put it all together. The interactive canvas below lets you design a robot learning pipeline from scratch. Choose your learning paradigm, adjust the knobs, and watch how data flows from collection through training to deployment.

Each approach has different data requirements, compute costs, and failure modes. Toggle between them to build intuition for when to use each one.

Interactive: Robot Learning Pipeline Builder
Select a learning approach to see its training pipeline. Adjust the sim-to-real noise slider to see how each approach degrades. Toggle state representations to see how they affect data requirements.
⚔ Adversarial: The Domain Randomization Paradox
You train a manipulation policy with very wide domain randomization: friction ∈ [0.1, 2.0], mass ∈ [0.1, 10.0], actuator delay ∈ [0, 50ms]. Your sim success rate is 45% (the task is "too hard" under many randomizations), but you hope real performance will be high because real params are within range. What actually happens on the real robot?
⚔ Adversarial: Absolute vs Delta Actions
You're deploying a behavioral cloning policy trained on teleoperation demos. The policy outputs absolute end-effector poses (x, y, z, qx, qy, qz, qw). During deployment, you notice the robot's starting position is 3cm different from the demo starting positions (the robot was bumped). What happens?
Checkpoint — Before you move on
You're advising a startup that wants to build a general-purpose manipulation robot. They ask: "Should we use model-free RL, model-based RL, imitation learning, or a foundation model?" Your answer should be "it depends." Write down the 3 key factors that determine the right choice, and for each factor, name which approach wins.
✓ Gate cleared
Model Answer

Factor 1: Simulation fidelity. If you can simulate the task accurately (rigid objects, known dynamics): model-free RL or model-based RL. If sim is unreliable (deformables, fluids, contacts): imitation learning from real demos.

Factor 2: Data budget. 10 real demos: imitation learning (BC). 100 real episodes: model-based RL. 10M sim episodes: model-free RL. Zero task-specific data: foundation model (VLA, zero-shot).

Factor 3: Generalization scope. One specific task on one robot: BC is fastest. Many tasks on one robot: RL (amortizes exploration). Many tasks on many robots: foundation model (amortizes across embodiments).

The meta-insight: most real systems use a HYBRID. Foundation model for coarse behavior + IL fine-tuning for precision + RL for self-improvement in deployment. The question isn't "which one" but "how to combine them."

Chapter 10

Summary & Connections

The Grand Comparison

ApproachData SourceSample EfficiencyStrengthsWeaknesses
Model-Free RLSelf-generated (exploration)Very low (millions of episodes)No assumptions about dynamics. Provably optimal in limit.Impractical data requirements for real robots.
Model-Based RLSelf-generated + model rolloutsMedium (hundreds of episodes)Data-efficient. Can plan ahead.Model errors compound. Struggles with contacts.
Imitation LearningExpert demonstrationsHigh (tens of demonstrations)Fast to deploy. No reward design.Needs expert. Distribution shift. Ceiling = expert.
Foundation ModelsInternet + multi-robot dataHighest (zero/few-shot)Generalization. Language-conditioned. Cross-embodiment.Compute-hungry. Still needs fine-tuning for precision.

Decision Guide: When to Use What

Model-Free RL

Use when: you have a fast, accurate simulator; the task has a clear reward signal; and you can afford massive compute. Best for: locomotion, game-like tasks, sim-only settings.

Model-Based RL

Use when: real-world data is expensive; dynamics are relatively smooth; you need to plan multi-step ahead. Best for: pushing, reaching, navigation, any task where physics is approximately known.

Imitation Learning

Use when: you have access to an expert (human teleoperator); the task is hard to define with a reward function; you need a working policy quickly. Best for: manipulation, assembly, any task where "just show me" is easier than "just describe the reward."

Foundation Models

Use when: you need generalization across tasks and objects; language instructions are natural; you can fine-tune a large pretrained model. Best for: open-vocabulary manipulation, multi-task deployment, embodied assistants.

Key Equations Cheat Sheet

MDP M = (S, A, P, R, γ)
Behavioral Cloning L(θ) = 𝔼(o,a)~D [ || πθ(o) − a ||2 ]
BC Compounding Error Total error ≤ ε · T2
World Modelt+1 = fφ(st, at)
VLA at = VLA(imaget, instruction)

Connections to Other Lessons

This lecture draws from nearly every topic in the CS 231n curriculum. Convolutional networks provide the visual backbone for every robot perception system. Transformers power the foundation models (RT-2, OpenVLA). Generative models (diffusion, flow matching) underlie diffusion policies and world models. Self-supervised learning (CLIP, DINOv2) provides the pretrained representations that make robot learning sample-efficient.

From the RL side: the Q-learning and actor-critic foundations from CS 224R are exactly the model-free methods that roboticists tried first. Imitation learning from that course maps directly to behavioral cloning and DAgger here. And model-based RL is the bridge between learned dynamics and classical control theory.

References

PaperYearContribution
Mnih et al., "Human-level control through deep RL" (Nature)2015DQN for Atari
Silver et al., "Mastering the game of Go" (Nature)2016AlphaGo
Silver et al., "Mastering Go without human knowledge"2017AlphaGo Zero
Schrittwieser et al., "Mastering Atari...by planning" (Nature)2020MuZero
Finn & Levine, "Deep Visual Foresight for Planning"2017Pixel dynamics for robots
Manuelli et al., "Keypoints into the Future"2020Keypoint dynamics
Ross et al., "A Reduction of Imitation Learning" (AISTATS)2011DAgger algorithm
Florence et al., "Implicit Behavioral Cloning"2022Energy-based BC
Chi et al., "Diffusion Policy"2023Diffusion models for robot actions
Brohan et al., "RT-1: Robotics Transformer"2022Large-scale robot transformer
Brohan et al., "RT-2: Vision-Language-Action Models"2023VLM as robot controller
Open X-Embodiment Collaboration, "RT-X"2023Cross-embodiment dataset
Kim et al., "OpenVLA"2024Open-source VLA
Black et al., "π0: A Vision-Language-Action Flow Model"2024Flow matching for robots
Hafner et al., "DayDreamer: World Models for Robot Learning"2023Policy training in imagination
Yang et al., "UniSim: Learning Interactive Real-World Simulators"2023Universal video simulator
The One Sentence

Robot learning is the art of closing the loop between seeing and doing — bridging the gap between understanding the world (vision) and changing it (action), using data instead of hand-crafted rules.