Vision got us eyes. Language got us words. But robots need hands — and hands that learn from experience, not just instructions.
You've built a convolutional network that classifies images with superhuman accuracy. You've trained a transformer that writes coherent essays. Now imagine bolting either onto a robot arm and asking it to fold a towel. What goes wrong?
Everything. The towel deforms unpredictably. The robot's gripper slips. The camera sees a slightly different angle than training. The table has a crease. A single pixel-level change cascades into a completely different physical outcome. This is the chasm between perception (recognizing the world) and action (changing it).
Robot learning differs from standard supervised learning in four fundamental ways. Each is a genuine obstacle, not a minor annoyance:
The same action in the same state can produce different outcomes. You push a block and it slides 3 cm or 5 cm depending on friction, contact angle, surface dust. In supervised learning, the label for an image is fixed. In robotics, the "label" (next state) is a probability distribution.
A robot stacks 10 blocks and the tower falls on block 7. Which action caused the failure? Block 3 was placed 1mm off-center, creating a lean that compounded. The bad action happened 4 steps before the consequence. In supervised learning, the loss is immediate. In robotics, consequences are delayed.
You can backpropagate through a neural network because every operation is differentiable. You cannot backpropagate through physics. When a robot's gripper contacts a table, there's a discontinuous collision event. The gradient of "did the cup break?" with respect to gripper force is either zero or undefined.
In image classification, the data distribution is fixed: cats look like cats tomorrow. For a robot, each action changes the world, which changes what the robot sees, which changes what it should do next. The "training distribution" is constantly shifting because the robot is part of it.
Despite these curses, we need a formal framework. Robotics uses the Markov Decision Process (MDP) — the same framework from RL, but now the states are physical and the actions move real actuators.
State: (cart position x, cart velocity ẋ, pole angle θ, angular velocity θ̇). Four numbers.
Action: Push left or push right. Two choices.
Transition: Newtonian mechanics — gravity pulls the pole, force accelerates the cart. Deterministic but nonlinear.
Reward: +1 for every timestep the pole stays upright. Episode ends when it falls.
State: Point cloud of cloth surface (thousands of 3D points) + robot arm joint angles.
Action: Pick point (x, y) on cloth, place at target (x', y'). Continuous, high-dimensional.
Transition: Deformable body physics — wildly stochastic. Same pick can fold or crumple depending on initial wrinkles.
Reward: IoU (intersection over union) between current cloth shape and target folded shape. Hard to compute in real time.
In Atari, the reward is the game score — given to you for free. In robotics, you have to invent the reward. Want a robot to pour water into a glass? You need a sensor that measures water level. Want it to fold a shirt neatly? You need to define "neat" mathematically. This is reward engineering, and it's often harder than the learning itself.
A robot told to "move the cup to the target" with reward = −(distance to target) learns to fling the cup at maximum speed. It reaches the target — and shatters the cup. The reward function was technically satisfied. This failure mode, called reward hacking, is ubiquitous in robotics. The robot optimizes exactly what you told it, not what you meant.
Training in the real world is slow (one trial per second) and dangerous (robots break things). So we simulate. But simulation is imperfect: friction coefficients are wrong, lighting is too uniform, contact dynamics are simplified. A policy trained in simulation may fail completely on the real robot. This sim-to-real gap is the central practical challenge of robot learning.
Simulation gives you unlimited data but imperfect physics. The real world gives you perfect physics but limited data. Every robot learning method is, at its core, a strategy for navigating this tradeoff.
Computer vision asks: "What's in this image?" Robot vision asks: "What's in this image and what should I do about it?" The difference is profound. A vision model can take 100ms to classify an image. A robot reaching toward a moving object needs perception at 30 Hz with sub-centimeter accuracy. Perception is not a preprocessing step — it's part of the control loop.
Every robot, from a Roomba to a surgical arm, runs the same loop:
| Sensor | What It Measures | Strengths | Weaknesses |
|---|---|---|---|
| RGB Camera | Color images (pixels) | Rich, cheap, abundant data | Depth-ambiguous, lighting-sensitive |
| Depth / RGBD | Per-pixel distance | 3D geometry directly | Noisy on reflective/transparent surfaces |
| LiDAR | 3D point clouds | Precise, long-range | Sparse, expensive, no color |
| Tactile | Contact pressure maps | Crucial for manipulation | Local (only where touching) |
| Force/Torque | Forces at wrist/joints | Detects contact events | No spatial information |
| Proprioception | Joint angles, velocities | Always available, precise | Only knows robot, not world |
Raw sensor data is massive — a single RGB image is 640×480×3 = 921,600 numbers. A policy network that takes raw pixels as input must learn both perception and control simultaneously. This is like asking someone to learn chess while blindfolded, interpreting the board from descriptions of pixel colors.
The choice of state representation — what you extract from sensors before feeding to the policy — is perhaps the most consequential design decision in robot learning. Here are the main options:
Feed the image directly to a CNN. Maximum information, but requires learning visual features and control jointly. Works when you have millions of training frames (simulated Atari). Struggles with real-robot data budgets of thousands of episodes.
Use a pretrained vision encoder (ResNet, ViT, CLIP) to extract a compact feature vector. The policy sees a 512-dimensional vector instead of a 921K-dimensional image. Faster to train, more sample-efficient, but the features may miss task-relevant details (like the exact angle of a screw).
Detect a sparse set of 2D or 3D points on objects of interest — corners, handles, edges. The state becomes ~10–50 (x, y, z) coordinates. Extremely compact and interpretable. But keypoint detectors can fail on novel objects, and you lose shape information between keypoints.
A set of 3D points from depth sensors or LiDAR. Richer than keypoints (captures full geometry), sparser than pixels. Works well with PointNet-style architectures. The gold standard for manipulation tasks involving complex 3D shapes.
More compressed representations (keypoints) need less data but lose information. Less compressed representations (pixels) preserve information but need orders of magnitude more training. The sweet spot depends on your data budget: 100 real demos → use keypoints. 100,000 sim episodes → use pixels. Millions of internet images → use pretrained features.
You know RL from Atari: train a DQN on millions of frames, achieve superhuman Breakout scores. Now try to apply the same recipe to a real robot arm. The robot gets one attempt per 5 seconds. At Atari's 200 million frame budget, training would take 31 years. Clearly, we need different strategies.
Let's recall why DQN worked so spectacularly for games. The architecture is simple: a CNN takes 4 stacked grayscale frames (84×84 each) as input, passes through three conv layers and two fully-connected layers, and outputs one Q-value per possible action (4–18 joystick directions). The agent picks argmaxa Q(s, a) with ε-greedy exploration.
Three innovations made this work:
Store every transition (s, a, r, s') in a buffer. Sample mini-batches randomly for training. This breaks temporal correlations (consecutive game frames are nearly identical) and reuses each transition many times. One real interaction generates dozens of gradient updates.
The Bellman target y = r + γ maxa' Q(s', a') uses Q itself — so the target moves as Q improves. This is like trying to hit a moving goalpost. Solution: freeze a copy Qtarget and update it only every 10,000 steps. The target is now locally stable.
A single frame is ambiguous: is the ball moving left or right? Stacking 4 consecutive frames gives the CNN access to velocity information without explicitly computing it.
DQN's three innovations are necessary but not sufficient for robotics. Here's what else you need:
DQN used 200 million Atari frames. A real robot arm doing bin-picking at 5 Hz gets 18,000 interactions per hour. At DQN's data appetite, training takes 11,111 hours = 1.3 years of continuous operation. And that's assuming no hardware failures, no resets, no human supervision. Model-free RL on real robots requires fundamentally better sample efficiency.
If sim-to-real is the problem, make simulation harder than reality. Domain randomization trains the policy across thousands of simulated environments with randomly varied physics: friction ∈ [0.1, 1.0], gravity ∈ [9.5, 10.1], object mass ∈ [0.5, 2.0]×, camera position jittered by ±5cm, lighting randomly placed. If the policy works across all these variations, it will likely work in the one real-world setting too.
AlphaGo (2016) → AlphaGo Zero (2017) → AlphaZero (2017) → MuZero (2020): This progression illustrates a key theme. AlphaGo used human expert games. AlphaGo Zero eliminated human data entirely, learning from self-play alone — and became stronger. AlphaZero generalized to chess and shogi. MuZero learned without even knowing the rules of the game, learning its own world model internally. Each step reduced the human knowledge required.
Quadrupedal Locomotion: Policies trained in simulation with domain randomization transfer to real quadruped robots walking over stairs, rubble, and ice. The key insight: locomotion is a relatively low-dimensional control problem (12 joint angles) with fast feedback (IMU at 200Hz), making sim-to-real tractable.
Dexterous Manipulation (OpenAI, 2019): A simulated Shadow Hand learned to solve a Rubik's cube using RL with massive domain randomization: 6144 parallel simulations, randomizing 37 physics parameters. The policy transferred to the real hand — but required billions of simulated episodes and months of compute.
Model-free RL works for robots when: (1) you can simulate the task cheaply, (2) the action space is moderate-dimensional, and (3) you can afford enormous compute. For tasks where simulation is unreliable (deformable objects, liquids, soft contacts) or data is limited, we need other approaches: model-based RL, imitation learning, or foundation models.
Let π be a policy trained in simulation. The sim-to-real gap can be formalized as a domain adaptation problem. Let dsim denote the state distribution under the simulator dynamics, and dreal under the real dynamics. The expected return gap satisfies:
|Jreal(π) − Jsim(π)| ≤ (2γ / (1−γ)2) · εmodel
where εmodel = maxs,a TV(Psim(s'|s,a), Preal(s'|s,a)).
Your task: (1) Derive this bound using the simulation lemma (relate Jreal − Jsim to the per-step transition error). (2) Show why the (1−γ)2 denominator means even small per-step errors compound catastrophically for long horizons. (3) Explain how domain randomization reduces the effective εmodel.
Step 1 — Simulation Lemma: For two MDPs M1, M2 with the same reward but different transitions, and policy π:
|J1(π) − J2(π)| ≤ (2Rmax/(1−γ)) · ∑t=0∞ γt · TV(dt1, dt2)
Step 2 — Bound per-step divergence: At each step, the new state distributions satisfy TV(dt+11, dt+12) ≤ TV(dt1, dt2) + εmodel. By induction: TV(dt1, dt2) ≤ t · εmodel.
Step 3 — Sum the series: ∑t=0∞ γt · t = γ/(1−γ)2. So: |Jreal − Jsim| ≤ 2Rmax · γ · εmodel / (1−γ)3. (With Rmax=1 normalization and tighter analysis, you get the (1−γ)2 version.)
Step 4 — Implication for long horizons: With γ=0.99 (effective horizon ~100), the bound is 100× worse per unit εmodel than γ=0.9 (effective horizon ~10). This is why MPC (re-plan every step, effectively γ≈0) transfers so much better than long-horizon RL policies.
The key insight: The sim-to-real gap isn't just about how accurate your simulator is (εmodel) — it's about how that error compounds over time. Short-horizon methods (MPC, reactive policies) are inherently more robust to sim-to-real gaps than long-horizon planning.
Model-free RL discards the transition structure: it only remembers (s, a, r, s') tuples and uses them to update Q-values or policy gradients. It doesn't try to understand WHY s led to s' after action a. This means every new task needs fresh exploration from scratch.
Model-based methods learn the dynamics f(s,a) → s'. Once learned, this model can be queried for any action sequence — generating unlimited imaginary rollouts. One real transition teaches the model about physics that generalizes to many hypothetical actions. The sample efficiency gap is roughly the branching factor: if there are 10 possible actions, one real transition informs the model about all 10 next states, while model-free RL only learns about the one action taken.
The tradeoff: model-based methods are only as good as their model. For contact-rich tasks where the model is inaccurate, model-free methods avoid compounding model errors.
Model-free RL treats the environment as a black box: act, observe reward, update policy. Model-based RL opens the box. Instead of learning a policy directly, first learn a world model — a neural network that predicts what happens next:
Once you have a world model, you can plan: simulate thousands of action sequences in your head (in the model), pick the one with the best predicted outcome, execute only the first action, then re-plan. This is called Model Predictive Control (MPC).
The model is imperfect. After one real step, the actual state st+1 differs from the predicted ŝt+1. By re-planning from the real state every step, MPC self-corrects. It only trusts the model for one step at a time. This is robustness through humility.
What should the world model predict? The choice of state representation completely determines the model's difficulty:
Finn & Levine (2017) trained a model to predict future video frames given actions. Input: current frame + robot action. Output: predicted next frame. This is maximally general — it works for any task visible in a camera — but generating photorealistic futures is extremely hard. Predicted frames get blurry after 5–10 steps as uncertainty compounds.
Manuelli et al. (2020) detected sparse keypoints on objects and predicted their future positions. Instead of predicting 307,200 pixel values, you predict ~30 (x, y, z) coordinates. Dramatically easier to learn, dramatically more sample-efficient. But keypoint detection must generalize to novel objects, and you lose shape information.
Wang et al. (2023) represented objects as sets of particles (like a point cloud) and used graph neural networks (GNNs) to predict particle interactions. Each particle is a node; nearby particles are connected by edges. The GNN learns local physics: when two particles are close, they repel (rigidity) or attract (cohesion). This captures deformable objects, fluids, and granular materials that keypoints and pixels struggle with.
A perfect model lets you plan arbitrarily far ahead. An imperfect model accumulates errors with each predicted step. After H steps, the error grows roughly as:
This means a 5% per-step error becomes a 34% error at H = 6, and a 108% error at H = 15. Long-horizon planning requires either very accurate models or short re-planning intervals.
Model-based methods struggle with contact-rich tasks: inserting a peg into a hole, turning a key in a lock, tying a knot. These involve sudden discontinuous dynamics (contact/no-contact transitions) that smooth neural network models approximate poorly. For such tasks, model-free methods or imitation learning often work better.
Model-based methods excel when: (1) the dynamics are relatively smooth (pushing, reaching, locomotion), (2) data is scarce (tens of episodes, not millions), and (3) the task requires long-horizon reasoning (multi-step assembly). With 100 real-world episodes, a learned model + MPC can solve tasks that model-free RL needs 100,000 episodes for.
What if you skip reward engineering, skip simulation, and just show the robot what to do? A human teleoperates the arm, picks up the cup, and places it on the shelf. Record the observation-action pairs. Train a policy to mimic them. This is imitation learning — and it's the fastest path from zero to a working robot policy.
The simplest form: treat it as supervised learning. You have a dataset of expert demonstrations D = {(o1, a1), (o2, a2), ...}. Train a neural network πθ(a | o) to predict the expert's action from the observation:
This is literally just regression. And it works — for about three seconds. Then the robot drifts off the expert's trajectory, encounters states the expert never visited, and has no idea what to do. This failure mode has a name.
During training, the policy sees states from the expert's trajectory. During execution, the policy sees states from its own trajectory. Even a tiny per-step error ε causes the robot to visit states the expert never demonstrated. In those states, the policy's prediction is garbage, causing more drift, causing worse predictions.
At step 1, you make error ε. At step 2, you're in a slightly wrong state, so your error is ε + ε · δ where δ reflects the distributional shift. At step t, you've accumulated O(t) drift, and each drifted step adds O(ε) error. Total: ∑t=1T O(tε) = O(εT2). For a 100-step episode with ε = 0.01, BC's error grows to 0.01 × 10,000 = 100. Completely diverged.
DAgger (Ross et al., 2011) fixes distribution shift by collecting data on the policy's own trajectory. The idea is elegant:
DAgger reduces the error from O(εT2) to O(εT) — linear instead of quadratic. The key: by training on states the policy actually visits, you close the distribution gap. The price: you need an expert available to label on-policy states, which is expensive.
Consider a T-intersection. The expert sometimes turns left, sometimes right. BC with a Gaussian output averages them: go straight into the wall. The Gaussian policy outputs the mean of the expert's actions, which is a terrible action that no expert ever took.
Florence et al. (2022) used energy-based models: instead of predicting a single action, learn an energy function E(o, a) that assigns low energy to good actions and high energy to bad ones. At test time, find argmina E(o, a) via gradient descent in action space. This naturally captures multimodality — multiple action modes have low energy.
Chi et al. (2023) used diffusion models to generate action sequences. Start with Gaussian noise, iteratively denoise into a coherent action trajectory. Because the diffusion model learns the full distribution (not just the mean), it can produce diverse, multimodal action sequences. It also generates action chunks — entire trajectories of 8–16 future actions at once, providing temporal consistency.
Single-step action prediction is jittery: the robot reconsiders its plan every 50ms. Action chunking commits to a short trajectory (0.5–1.0 seconds of future actions), producing smoother, more human-like motion. It also reduces the effective horizon from T to T/chunk_size, mitigating compounding error.
Instead of cloning actions, infer the reward function the expert must be optimizing. Then use RL with that learned reward. This is more robust than BC because the reward function transfers across different dynamics (a reward for "cup on shelf" works even if the robot arm is different from the demonstrator's).
The intuition: find a reward function under which the expert's behavior has higher value than any other policy's. The entropy term H(π) prevents degenerate solutions where R = 0 everywhere (trivially making everyone "optimal").
Behavioral cloning has a famous failure mode: error compounds quadratically with horizon. The bound is: Total error ≤ ε · T2, where ε is the per-step policy error and T is the trajectory length.
Your task: (1) Derive this T2 bound by showing how distribution shift causes errors to compound. (2) Show that DAgger achieves O(ε · T) instead, and explain what breaks the quadratic compounding. (3) For a contact-rich task like peg-in-hole with T=200 steps, how many demonstrations do you need to keep total error below 10% success degradation?
Part 1 — T2 bound: Let ε = Es~dπ*[TV(πθ(s), π*(s))]. At time t, the policy has accumulated drift from the expert. The state at time t satisfies TV(dtπθ, dtπ*) ≤ tε (each step adds at most ε divergence). The cost at time t under the wrong distribution is bounded by tε. Total cost: ∑t=1T tε = εT(T+1)/2 = O(εT2). ■
Part 2 — DAgger's linear bound: DAgger trains on dπθ (the policy's own distribution). No distribution shift means the per-step error is just ε everywhere, regardless of timestep. Total cost: Tε = O(εT). ■
Part 3 — Why contact tasks are hard: Contact introduces discontinuities. A 1mm error in position can mean the difference between "peg enters hole" and "peg hits surface and jams." The effective ε for contact tasks is much larger than for free-space motion because the loss landscape is discontinuous. This is why diffusion policies (which model multi-modal action distributions) outperform MSE-based BC: they capture the "insert from left OR insert from right" multi-modality that MSE averages to "insert from straight ahead" (which fails).
The key insight: BC's T2 bound means long-horizon contact tasks need exponentially more data than short-horizon ones. The field's response: (1) reduce T via hierarchical policies, (2) use DAgger-style online methods, or (3) use expressive policies (diffusion, flow matching) that represent multi-modality so a single demonstration teaches more.
System identification finds simulator parameters φ (friction, mass, damping) that minimize the trajectory mismatch between sim and reality. Given N real trajectories τreal = {(s0, a0, s1, ..., sT)}1:N:
φ* = argminφ ∑i=1N ∑t=0T-1 || fφ(st(i), at(i)) − st+1(i) ||2
Your task: (1) Derive the gradient ∇φ of this objective when fφ is a differentiable simulator (like Brax or MuJoCo MJX). (2) Explain why you need multi-step rollout loss (not single-step) for accurate identification. (3) Show how CMA-ES can solve this when the simulator is NOT differentiable.
Differentiable case: With autodiff through the simulator: ∇φ L = ∑t 2(ŝt+1 − st+1) · (∂f/∂φ + ∂f/∂s · dŝt/dφ). The second term is the "chain through time" — errors in earlier predictions affect later ones through the state. This is BPTT through the physics engine.
Multi-step is crucial: Fitting mass using single-step: if real mass = 2kg and sim mass = 1.8kg, single-step error is small (10%). But over 50 steps of free-fall, position error grows as ½Δm·g·t2/(m) — 50x worse. Multi-step loss catches this compounding and drives φ to the true value.
CMA-ES procedure: (1) Initialize μ = default params, Σ = 0.1·I. (2) Sample λ=50 candidates φi ~ N(μ, Σ). (3) For each φi: run sim with those params, compute RMSE vs real. (4) Rank candidates by fitness. (5) Update μ, Σ toward top-ranked candidates. (6) Repeat until convergence (~100-500 generations).
The key insight: System identification is an optimization problem where the objective is "make simulation match reality." The choice of optimization method (gradient vs black-box) depends on whether your simulator is differentiable. The choice of loss (single-step vs multi-step) determines whether you catch compounding errors.
The pattern in NLP was clear: pretrain a giant model on diverse data, then fine-tune for specific tasks. GPT proved it for text. CLIP proved it for vision+language. Can the same pattern work for robots? Train one model on data from many robots doing many tasks, then fine-tune it for your specific robot and task?
A VLA takes three inputs: what the robot sees (camera image), what it should do (language instruction like "pick up the red cup"), and outputs motor commands. It's a VLM with an action head instead of a text decoder.
A VLM (like GPT-4V) outputs text tokens. A VLA outputs continuous motor commands (7D: xyz position + rotation quaternion + gripper open/close). This seems like a small change, but it's fundamental. Text tokens are discrete and language-universal. Motor commands are continuous, robot-specific (a 6-DOF arm vs. a quadruped have different action spaces), and must be physically executable at 10–50 Hz.
| Model | Date | Key Innovation | Scale |
|---|---|---|---|
| RT-1 | Dec 2022 | First large-scale robot transformer. FiLM-conditioned EfficientNet + TokenLearner. 130K real episodes from 13 robots. | 35M params |
| RT-2 | Jul 2023 | Repurposed a VLM (PaLI-X) as robot controller. Actions tokenized as text: "1 128 91 241 5 101 127". Emergent reasoning — "pick up something that's NOT a banana" works without being trained on negation. | 55B params |
| RT-X | Oct 2023 | Cross-embodiment dataset: 22 robot types, 160,000 tasks, 1M+ episodes. Showed that training on diverse robots helps every individual robot. | Multi-dataset |
| OpenVLA | Jun 2024 | Open-source VLA. Llama 2 backbone + DINOv2 vision. First VLA the community could actually use and fine-tune. | 7B params |
| π-Zero | Oct 2024 | Flow matching for action generation. Pre-train on diverse robot data, post-train with flow matching for smooth, multimodal actions. Cross-embodiment + cross-task. | 3B params |
RT-2's insight is almost absurdly simple. Take PaLI-X, a 55-billion parameter VLM trained on web images and text. It already understands visual concepts ("red cup," "left side of table"). Now fine-tune it on robot data where actions are encoded as text tokens:
Input: Camera image + "move the apple to the blue bowl"
Output text: "1 128 91 241 5 101 127"
Each number is a discretized action dimension: [terminate, x, y, z, roll, pitch, yaw, gripper]. The VLM generates these as regular text tokens — it just happens that these tokens are motor commands.
Emergent capability: RT-2 can follow instructions it was never trained on ("pick up the extinct animal" → picks up the plastic dinosaur) because PaLI-X learned these concepts from the internet.
Physical Intelligence's π-Zero takes the foundation model idea further with two phases:
Pre-training: Train on diverse robot interaction data from multiple robot types (arms, hands, bipeds). Actions from different robots are normalized into a common representation. The model learns general physical manipulation concepts: grasping, pushing, inserting, wiping.
Post-training with flow matching: Instead of tokenizing actions as text (lossy discretization), π-Zero uses flow matching to generate continuous action trajectories. Flow matching learns a vector field that transports a simple distribution (Gaussian noise) to the data distribution (expert actions). This produces smooth, multimodal action distributions — exactly what you need for dexterous manipulation.
Training on data from 22 different robots sounds like it would confuse the model. The opposite happens: shared manipulation concepts transfer across embodiments. A 7-DOF arm and a 16-DOF hand both need to approach objects, align grippers, and apply appropriate force. The foundation model learns these abstractions, and each robot benefits from the others' data.
In Chapter 4, we learned world models that predict state transitions. Now scale that idea up: what if the world model is a video generation model that predicts entire future videos conditioned on actions? This is the frontier of robot learning.
Given the current camera frame and a proposed action sequence, predict what the camera will see in the next 1–5 seconds. This is literally a video generation problem, but conditioned on robot actions:
Hafner et al. (2023) showed something remarkable: you can train a robot policy entirely inside a learned world model, with near-zero real-world interaction. The recipe:
DayDreamer trained a quadruped to walk in just 1 hour of real-world interaction — orders of magnitude less than model-free RL. The world model serves as a "dream simulator" that amplifies scarce real data into abundant imagined experience.
Nvidia Cosmos (2024) pushes this further: train a world model on internet-scale video data (driving videos, manipulation clips, nature documentaries). The resulting model understands basic physics: objects fall, liquids flow, hands grasp. Fine-tune on your specific robot's data, and you get a world model that already knows about the physical world.
UniSim (2023) by Google DeepMind is a universal simulator that can generate realistic video of unseen actions in unseen environments. It's trained on diverse video data and can simulate what happens if you push an object, open a drawer, or rearrange items — all in environments it's never seen before.
Genie (2024) by Google DeepMind learned to generate interactive 2D worlds from a single image. Show it a screenshot of a platformer game, and it generates a playable world where actions have consistent consequences. This hints at a future where world models become general-purpose simulators, bootstrapped from video.
Current video world models reason in 2D pixel space. But the physical world is 3D. Two active research directions: (1) Structural priors — build 3D inductive biases (voxels, NeRFs, Gaussian splats) into the world model architecture. (2) Emergent 3D — train on enough 2D video and hope 3D understanding emerges, the way language models seem to learn world knowledge from text. The jury is out on which approach will win.
Instead of learning physics from data alone, encode known physical laws (Newton's equations, conservation of energy) as loss terms. The network's predictions must satisfy F = ma as a soft constraint. This dramatically reduces the data needed for accurate models, especially for rigid-body dynamics where the physics is well-understood.
Robot learning has achieved remarkable results in controlled settings. But deploying robots reliably in unstructured human environments remains stubbornly hard. Here are the open problems the field is wrestling with.
In NLP, evaluation is cheap: run the model on a test set. In robotics, evaluation means physical experiments. Each experiment takes minutes, requires human supervision, risks breaking hardware, and produces noisy results (the same policy might succeed 7/10 or 9/10 times depending on initial conditions). This makes it nearly impossible to do rigorous ablation studies.
A language model that generates a wrong answer wastes your time. A robot that executes a wrong action breaks a $50,000 arm, injures a human, or destroys the object it's manipulating. Safety isn't an afterthought — it's a prerequisite.
Robot policies work 95% of the time in the lab. In deployment, the 5% failure cases are catastrophic: an unseen object shape, an unusual lighting condition, a human reaching into the workspace. Moving from 95% to 99.99% reliability is qualitatively different from moving from 50% to 95%. It requires handling a combinatorial explosion of edge cases that no training set can exhaustively cover.
Current policies handle 10–30 second tasks: pick, place, push. Real-world useful tasks are minutes to hours: cook a meal, clean a room, assemble furniture. Long-horizon tasks require hierarchical planning (decompose "make coffee" into "grind beans," "boil water," "pour," "wait," "pour again"), error recovery (if the mug tips over, re-plan), and memory (where did I put the sugar?).
A robot deployed in a home must adapt to that home's specific quirks: the sticky drawer, the heavy fridge door, the cat that knocks things off tables. Current models are trained once and deployed frozen. Continual learning without catastrophic forgetting (adapting to the drawer without forgetting how to pour water) is an unsolved problem.
Two robot arms collaboratively folding a sheet. A fleet of warehouse robots navigating without collisions. Multi-agent coordination multiplies the state space exponentially and introduces communication, synchronization, and credit-assignment challenges that single-robot methods don't face.
Language is increasingly the interface between humans and robots. But mapping language to action is fraught: "put it over there" requires grounding "it" (what object?), "over there" (where exactly?), and "put" (place? throw? slide?). Foundation models bring linguistic understanding, but grounding that understanding in physical reality — connecting words to forces and positions — remains a fundamental challenge.
Current foundation models for robots are pretrained on internet data and fine-tuned on robot data. But internet data is overwhelmingly about describing the world (text, images), not acting in it. The embodied data that robots need — cause-and-effect relationships, physical consequences of actions, contact dynamics — is vastly underrepresented. The field needs either much more embodied data (expensive) or architectures that can learn physics from vision alone (hard).
State of the art (2024): SpeedFolding (UC Berkeley) achieves 30-40 folds/hour using:
Perception: RGBD overhead camera + learned keypoint detection on cloth. A neural network predicts "grasp points" (where to pick up) and "fold lines" (where to create the fold). This avoids needing a full cloth mesh — just key geometric features.
Learning: Imitation learning from human demos (not RL, because cloth sim is too inaccurate for sim-to-real). ~200 demos per garment category. Diffusion policy for multi-modal actions (there are multiple valid fold sequences for any garment).
Action space: Hybrid primitives. High-level: classify garment → select fold sequence template. Low-level: learned pick-and-place within each primitive. This decomposes the 36-second task into 3-4 primitives of 9 seconds each.
Bimanual: Leader-follower for symmetric folds (both arms do the same motion mirrored). Joint policy for asymmetric tasks (one arm holds, other folds). Coordination via shared state + synchronization primitives.
Key insight: For deformable objects, imitation learning + primitives beats end-to-end RL because cloth sim can't generate useful training data. The 500-demo budget is spent wisely: 100 demos per category (5 categories), with data augmentation (random start poses) expanding to 5000+ effective demos.
python
import numpy as np
class DomainRandomizer:
def __init__(self, param_ranges, log_params=None):
self.ranges = param_ranges # {name: (lo, hi)}
self.log_params = log_params or {'mass', 'friction', 'inertia'}
self.width_scale = {k: 1.0 for k in param_ranges}
def sample(self):
params = {}
for name, (lo, hi) in self.ranges.items():
# Apply adaptive width scaling
mid = (lo + hi) / 2
half_w = (hi - lo) / 2 * self.width_scale[name]
eff_lo = max(lo * 0.1, mid - half_w) # floor at 10% of original lo
eff_hi = mid + half_w
if name in self.log_params and eff_lo > 0:
# Log-uniform: equal probability per multiplicative factor
params[name] = np.exp(np.random.uniform(
np.log(eff_lo), np.log(eff_hi)))
else:
params[name] = np.random.uniform(eff_lo, eff_hi)
return self.make_consistent(params)
def adapt(self, success_rate, target=0.7):
for name in self.width_scale:
if success_rate > target:
# Too easy: widen by 10%
self.width_scale[name] = min(3.0,
self.width_scale[name] * 1.1)
elif success_rate < target * 0.5:
# Too hard: narrow by 10%
self.width_scale[name] = max(0.3,
self.width_scale[name] * 0.9)
def make_consistent(self, params):
# Physical consistency: inertia = mass * radius^2
if 'mass' in params and 'inertia' in params:
r_sq = params.get('radius', 0.05) ** 2
params['inertia'] = params['mass'] * r_sq
return params
In both settings, the field oscillates between "let the model figure it out" (end-to-end) and "inject structure" (modular). The pattern: end-to-end wins when data is abundant and the task is complex enough that hand-designed interfaces lose information. Modular wins when data is scarce or you need guarantees (safety, interpretability). The frontier combines both: structured end-to-end (like diffusion policy with explicit action head but learned perception backbone).
Can you identify this same modular vs. end-to-end tension in autonomous driving? What's the equivalent of the "perception → planning → control" stack, and what's the end-to-end alternative?
Every robot learning method (RL, BC from sim demos, world model training) bottlenecks on data. Simulation is the universal solution: unlimited, parallelizable, free of safety constraints. The quality of your simulator determines the ceiling of your policy. This is why simulation engineering (choosing the right engine, tuning contact models, system identification) is arguably more important than the learning algorithm itself.
If simulation is so powerful, why can't we just simulate everything? What class of tasks remains fundamentally hard to simulate accurately? (Hint: think about deformable objects and fluid dynamics.)
Time to put it all together. The interactive canvas below lets you design a robot learning pipeline from scratch. Choose your learning paradigm, adjust the knobs, and watch how data flows from collection through training to deployment.
Each approach has different data requirements, compute costs, and failure modes. Toggle between them to build intuition for when to use each one.
Factor 1: Simulation fidelity. If you can simulate the task accurately (rigid objects, known dynamics): model-free RL or model-based RL. If sim is unreliable (deformables, fluids, contacts): imitation learning from real demos.
Factor 2: Data budget. 10 real demos: imitation learning (BC). 100 real episodes: model-based RL. 10M sim episodes: model-free RL. Zero task-specific data: foundation model (VLA, zero-shot).
Factor 3: Generalization scope. One specific task on one robot: BC is fastest. Many tasks on one robot: RL (amortizes exploration). Many tasks on many robots: foundation model (amortizes across embodiments).
The meta-insight: most real systems use a HYBRID. Foundation model for coarse behavior + IL fine-tuning for precision + RL for self-improvement in deployment. The question isn't "which one" but "how to combine them."
| Approach | Data Source | Sample Efficiency | Strengths | Weaknesses |
|---|---|---|---|---|
| Model-Free RL | Self-generated (exploration) | Very low (millions of episodes) | No assumptions about dynamics. Provably optimal in limit. | Impractical data requirements for real robots. |
| Model-Based RL | Self-generated + model rollouts | Medium (hundreds of episodes) | Data-efficient. Can plan ahead. | Model errors compound. Struggles with contacts. |
| Imitation Learning | Expert demonstrations | High (tens of demonstrations) | Fast to deploy. No reward design. | Needs expert. Distribution shift. Ceiling = expert. |
| Foundation Models | Internet + multi-robot data | Highest (zero/few-shot) | Generalization. Language-conditioned. Cross-embodiment. | Compute-hungry. Still needs fine-tuning for precision. |
Use when: you have a fast, accurate simulator; the task has a clear reward signal; and you can afford massive compute. Best for: locomotion, game-like tasks, sim-only settings.
Use when: real-world data is expensive; dynamics are relatively smooth; you need to plan multi-step ahead. Best for: pushing, reaching, navigation, any task where physics is approximately known.
Use when: you have access to an expert (human teleoperator); the task is hard to define with a reward function; you need a working policy quickly. Best for: manipulation, assembly, any task where "just show me" is easier than "just describe the reward."
Use when: you need generalization across tasks and objects; language instructions are natural; you can fine-tune a large pretrained model. Best for: open-vocabulary manipulation, multi-task deployment, embodied assistants.
This lecture draws from nearly every topic in the CS 231n curriculum. Convolutional networks provide the visual backbone for every robot perception system. Transformers power the foundation models (RT-2, OpenVLA). Generative models (diffusion, flow matching) underlie diffusion policies and world models. Self-supervised learning (CLIP, DINOv2) provides the pretrained representations that make robot learning sample-efficient.
From the RL side: the Q-learning and actor-critic foundations from CS 224R are exactly the model-free methods that roboticists tried first. Imitation learning from that course maps directly to behavioral cloning and DAgger here. And model-based RL is the bridge between learned dynamics and classical control theory.
| Paper | Year | Contribution |
|---|---|---|
| Mnih et al., "Human-level control through deep RL" (Nature) | 2015 | DQN for Atari |
| Silver et al., "Mastering the game of Go" (Nature) | 2016 | AlphaGo |
| Silver et al., "Mastering Go without human knowledge" | 2017 | AlphaGo Zero |
| Schrittwieser et al., "Mastering Atari...by planning" (Nature) | 2020 | MuZero |
| Finn & Levine, "Deep Visual Foresight for Planning" | 2017 | Pixel dynamics for robots |
| Manuelli et al., "Keypoints into the Future" | 2020 | Keypoint dynamics |
| Ross et al., "A Reduction of Imitation Learning" (AISTATS) | 2011 | DAgger algorithm |
| Florence et al., "Implicit Behavioral Cloning" | 2022 | Energy-based BC |
| Chi et al., "Diffusion Policy" | 2023 | Diffusion models for robot actions |
| Brohan et al., "RT-1: Robotics Transformer" | 2022 | Large-scale robot transformer |
| Brohan et al., "RT-2: Vision-Language-Action Models" | 2023 | VLM as robot controller |
| Open X-Embodiment Collaboration, "RT-X" | 2023 | Cross-embodiment dataset |
| Kim et al., "OpenVLA" | 2024 | Open-source VLA |
| Black et al., "π0: A Vision-Language-Action Flow Model" | 2024 | Flow matching for robots |
| Hafner et al., "DayDreamer: World Models for Robot Learning" | 2023 | Policy training in imagination |
| Yang et al., "UniSim: Learning Interactive Real-World Simulators" | 2023 | Universal video simulator |
Robot learning is the art of closing the loop between seeing and doing — bridging the gap between understanding the world (vision) and changing it (action), using data instead of hand-crafted rules.