Learn a world model, plan in imagination, act efficiently in reality. From random shooting to MBPO.
You're training a robot arm to stack wooden blocks. Each attempt takes about 10 seconds: reach, grasp, lift, place, check, reset. Your model-free RL algorithm — say, SAC — needs around one million environment steps. That's 116 days of non-stop robot operation for a single stacking skill.
In simulation, a million steps is a coffee break. On a real robot, it's a death sentence for your project timeline, your motors, and your grad student's sanity.
What if the robot could practice in its imagination?
Think about how a chess grandmaster plays. They don't physically move pieces to evaluate a position. They simulate moves mentally — "if I take the bishop, they'll push the pawn, I'll castle..." — and only execute moves they've already vetted in their head. The physical board sees maybe 40 moves. The mental board sees thousands.
That's the core idea. Instead of learning purely from real interactions (expensive), you learn a dynamics model — a learned simulator of how the world works — and then plan or practice inside that simulator.
A family of RL methods that explicitly learn a model of the environment's transition dynamics, then use that model for planning, data augmentation, or policy optimization. Real data teaches the model. The model generates synthetic experience. The agent learns from both.
| Method | HalfCheetah Steps | Ant Steps | Real Robot Time |
|---|---|---|---|
| PPO (model-free) | 1M | 3M | ~100 days |
| SAC (model-free) | 300K | 1M | ~35 days |
| MBPO (model-based) | 30K | 100K | ~3 days |
| PETS (model-based) | 10K | — | ~20 minutes |
10-100x less data. That's not a minor improvement. It's the difference between "works in simulation only" and "works on a real robot in your lab."
Model-free: simple but data-hungry. Model-based: data-efficient but introduces model bias — if your learned model is wrong, you'll plan confidently toward disaster. The entire field of MBRL is about managing this trade-off.
The core building block. We want to learn a function that predicts what happens next:
This is supervised learning. We collect transitions (st, at, st+1) from the real environment, then train a neural network to predict st+1 from (st, at).
A parametric function fθ : S × A → S that approximates the environment's true transition function p(st+1 | st, at). Usually a neural network trained with mean squared error on collected transitions.
Symbol by symbol:
• N — number of transitions in our dataset (replay buffer).
• fθ(si, ai) — model's predicted next state. This is a d-dimensional vector (same dimension as the state space).
• si' — the actual next state observed from the real environment.
• || · ||2 — squared L2 norm. Sum of squared differences across all state dimensions.
A standard dynamics model is a feedforward neural network:
python import torch.nn as nn class DynamicsModel(nn.Module): def __init__(self, state_dim, action_dim, hidden=256): super().__init__() self.net = nn.Sequential( nn.Linear(state_dim + action_dim, hidden), nn.ReLU(), nn.Linear(hidden, hidden), nn.ReLU(), nn.Linear(hidden, hidden), nn.ReLU(), nn.Linear(hidden, state_dim) # predict next state ) def forward(self, state, action): x = torch.cat([state, action], dim=-1) return self.net(x) # predicted s_{t+1}
In practice, predicting the change in state works better than predicting the absolute next state:
Most of the state doesn't change much between timesteps. A robot arm at position [1.2, 0.5, 0.8] will be at [1.21, 0.49, 0.81] next step. Predicting the absolute value means the network wastes capacity learning the identity function. Predicting the small delta focuses learning on what actually changes.
Where do we get (s, a, s') transitions? Three options, often used in combination:
• Random policy: Execute random actions, collect transitions. Cheap but only covers states near the initial distribution.
• Current policy: Roll out whatever policy we currently have. Covers relevant states but is expensive.
• Replay buffer: Store ALL transitions ever collected. Train model on full history. Gets richer over time.
If you train the model only on random-policy data but then use it to plan at states a good policy visits, the model is making predictions in regions it never saw during training. This is the fundamental challenge of MBRL — the model must be accurate where the current policy goes, not where past data lives.
The simplest possible use of a dynamics model. No policy network, no gradients, no value function. Just brute-force search:
That's it. Sample random plans, simulate them in your head, pick the best first action. Re-plan every step.
Setup: 1D environment. State = position on number line [0, 10]. Action = velocity in [-1, 1]. Reward = -|state - 7| (want to reach position 7). Current state s0 = 3. Horizon H = 3. N = 4 random sequences.
Sequence 1: actions = [0.8, 0.5, -0.2]
s1 = 3 + 0.8 = 3.8, s2 = 3.8 + 0.5 = 4.3, s3 = 4.3 + (-0.2) = 4.1
R = -|3-7| + -|3.8-7| + -|4.3-7| = -4 + -3.2 + -2.7 = -9.9
Sequence 2: actions = [1.0, 1.0, 1.0]
s1 = 4, s2 = 5, s3 = 6
R = -4 + -3 + -2 = -9.0
Sequence 3: actions = [1.0, 0.9, 0.8]
s1 = 4, s2 = 4.9, s3 = 5.7
R = -4 + -3 + -2.1 = -9.1
Sequence 4: actions = [-0.5, 0.3, 0.1]
s1 = 2.5, s2 = 2.8, s3 = 2.9
R = -4 + -4.5 + -4.2 = -12.7
Winner: Sequence 2 (R = -9.0). Execute first action: a0 = 1.0 (move right as fast as possible toward target 7).
With N = 1000 random sequences of length H = 10, you're evaluating 1000 possible futures in parallel — all in your learned model, zero real environment steps. Even random search finds decent plans when you evaluate enough candidates. And you replan every step, so early mistakes get corrected.
High-dimensional action spaces (e.g., 20-DOF robot) make random sampling hopeless. With H=10 steps and 20-dimensional continuous actions, you need astronomically many samples to cover the space. We need smarter search — see CEM in Section 06.
Random shooting has no policy — it replans from scratch every step. What if we used the model to generate extra training data for a policy?
Dyna (Sutton, 1991) is the foundational idea: mix real transitions with imagined transitions, train a policy on both.
A framework combining direct RL (learning from real experience) with indirect RL (learning from model-generated synthetic experience). The model serves as a data amplifier: one real transition can spawn many imagined ones.
The key insight: every real transition is used twice. Once directly for Q-learning. Once to improve the model, which then generates K additional training transitions. With K=10, each real environment step effectively gives you 11 learning updates.
Setup: 5x5 grid. Agent starts at (0,0), goal at (4,4). Reward = +1 at goal, 0 elsewhere. 4 actions: up/down/left/right. K = 5 model rollouts per step.
Real step: Agent at (2,3), goes right, arrives at (3,3), r=0. Q(2,3,right) updated.
Model training: fθ now knows (2,3)+right → (3,3).
5 imagined steps:
1. Sample (1,1) from buffer, action=up, model predicts (1,2), r=0. Update Q(1,1,up).
2. Sample (3,4) from buffer, action=right, model predicts (4,4), r=1! Update Q(3,4,right).
3. Sample (0,2), action=right, model predicts (1,2), r=0. Update Q(0,2,right).
4. Sample (4,3), action=up, model predicts (4,4), r=1! Update Q(4,3,up).
5. Sample (2,2), action=down, model predicts (2,1), r=0. Update Q(2,2,down).
Result: In one real step, the agent discovered two paths to the goal (via model) that it hasn't physically tried yet. Q-learning with K=0 would need those actual experiences.
If you roll out from an arbitrary state, the model might be inaccurate there (it's never seen that region). Starting from states in the buffer ensures the model is making predictions from regions where it has data — at least for the first step.
Here's the fundamental problem with MBRL. Your model is never perfect. And when you roll out multiple steps, errors compound exponentially.
Let the true dynamics be p(s'|s,a) and the learned model be fθ(s,a). Suppose the model has per-step error bounded by ε:
And suppose the true dynamics has Lipschitz constant L ≥ 1 (small state changes lead to proportionally bounded next-state changes):
Step 1: After 1 model step, error ≤ ε (by assumption).
Step 2: After 2 steps. The model starts from its own predicted state (with up to ε error), so:
Step H: By induction:
Inductive step: errork+1 ≤ L · errork + ε. Solving the recurrence: errorH = ε · Σk=0H-1 Lk = ε(LH - 1)/(L - 1).
Setup: Per-step error ε = 0.01 (1% relative error). Lipschitz constant L = 1.5 (mildly chaotic system).
• H = 1: error ≤ 0.01 (fine)
• H = 5: error ≤ 0.01 × (1.55 - 1)/0.5 = 0.01 × 7.59/0.5 = 0.15 (15% error)
• H = 10: error ≤ 0.01 × (1.510 - 1)/0.5 = 0.01 × 113.3/0.5 = 1.13 (113% error!)
• H = 20: error ≤ 0.01 × (1.520 - 1)/0.5 = 66.5 (completely useless)
A model with 1% error per step is worthless after 20 steps if L = 1.5.
Long model rollouts are dangerous. Even an excellent model (99% accurate per step) produces garbage after enough steps. This is why every serious MBRL algorithm limits rollout horizon. The rest of this lesson is about managing this compounding.
Weather forecasting is exactly this problem. The atmosphere is a chaotic dynamical system (L >> 1). Models are quite good at 1-day predictions (~ε small), but 14-day forecasts are barely better than climatological averages. Same math, same exponential.
Extend the error bound to cumulative reward. Suppose the reward function is Lr-Lipschitz: |r(s1,a) - r(s2,a)| ≤ Lr ||s1 - s2||. Show that the total reward error over H steps satisfies:
|Rtrue - Rmodel| ≤ Lr · ε · H · LH / (L - 1)
Step 1: At model step k, state error: ek ≤ ε(Lk-1)/(L-1).
Step 2: Reward error at step k: |r(sktrue,a) - r(skmodel,a)| ≤ Lr · ek.
Step 3: Total reward error: |Rtrue - Rmodel| ≤ Σk=0H-1 Lr · ek = Lr · ε · Σk=0H-1 (Lk-1)/(L-1).
Step 4: Bounding: Σk=0H-1 Lk ≤ H · LH-1 ≤ H · LH. So total ≤ Lr · ε · H · LH / (L-1). ■
Key insight: The reward error grows as H × exponential, not just exponential. Longer horizons are doubly penalized: more steps AND larger per-step errors at later steps.
The error bound tells us: keep rollouts short. But short horizons mean myopic plans. The solution: Model Predictive Control (MPC) — plan short, replan often.
A planning strategy that optimizes actions over a short horizon H, executes only the first action, observes the real next state, then replans. The planning horizon is a sliding window that moves forward with each real step.
MPC limits error propagation to H steps. If H = 5, the worst-case error is bounded by our H=5 calculation (manageable), regardless of the total episode length.
Random shooting wastes samples. CEM iteratively narrows the search toward good action sequences:
Setup: Same 1D environment as before. State = 3, target = 7. H = 2, N = 6, M = 2 (top 2).
Iteration 1: μ = [0, 0], σ = [1, 1]. Sample 6 sequences:
[0.8, 0.3] → states [3.8, 4.1], R = -3.2 + -2.9 = -6.1
[1.2, 0.9] → [4.2, 5.1], R = -2.8 + -1.9 = -4.7 (elite)
[-0.5, 0.2] → [2.5, 2.7], R = -4.5 + -4.3 = -8.8
[0.9, 1.1] → [3.9, 5.0], R = -3.1 + -2.0 = -5.1 (elite)
[-0.3, -0.8] → [2.7, 1.9], R = -4.3 + -5.1 = -9.4
[0.4, 0.6] → [3.4, 4.0], R = -3.6 + -3.0 = -6.6
Elite: [1.2, 0.9] and [0.9, 1.1]. New μ = [1.05, 1.0], σ = [0.15, 0.1].
Iteration 2: Sample around [1.05, 1.0] with tight std. Concentrates on the best region.
Result: CEM converged to "go right fast" — same answer as random shooting but with 6 samples instead of needing hundreds.
CEM is a special case of evolutionary optimization. The "population" is action sequences. "Fitness" is model-predicted reward. "Selection" keeps the elite. "Mutation" is sampling from the updated Gaussian. Simple, derivative-free, parallelizable.
MPC replans at every step. Each replan runs CEM with I×N forward model evaluations (each of H steps). For I=5, N=500, H=10, that's 25,000 model forward passes per real step. The model must be fast to evaluate. Deep ensemble models (7 networks) push this further.
A single model gives you a point prediction. It can't tell you "I'm not sure about this region." Ensembles fix this by training multiple models and measuring their disagreement.
Train K models {fθ1, ..., fθK} with different random initializations and/or different data subsets (bootstrap). At test time, run all K models. Their disagreement (variance of predictions) estimates epistemic uncertainty — uncertainty due to lack of data.
Symbol by symbol:
• K — number of models in ensemble (typically 5-7).
• fθk — the k-th model, trained independently.
• μ(s,a) — average prediction. Use this as the "best guess" next state.
• σ2(s,a) — variance across models. High = "we haven't seen this region" = don't trust the prediction.
This samples from the full distribution of possible dynamics, not just the mean. If models disagree about what happens after action a, some particles go one way, others go another. The planning algorithm sees high variance in returns → avoids that action. It's naturally pessimistic in uncertain regions.
When models disagree, the agent should be conservative. This emerges automatically from PETS (high-disagreement actions have high-variance returns, which CEM avoids), but can be made explicit:
Setup: 5 models, each 4-layer MLP (200 hidden), outputs Gaussian (μ, σ2) per state dim. Trained on 10,000 real transitions. CEM with N=500, M=50, I=5, H=30.
Result: Achieves near-SAC performance with 10x fewer environment steps. The ensemble's uncertainty estimate prevents the MPC from exploiting model errors.
MPC has a fatal limitation: no policy. It replans from scratch every step, which is computationally expensive at deployment. What if we used the model to train a reusable policy?
MBPO (Janner et al., 2019) is the state-of-the-art answer. Use short model rollouts to generate data, then train a model-free algorithm (SAC) on that data.
An algorithm that: (1) learns an ensemble dynamics model from real data, (2) generates short synthetic rollouts (1-5 steps) starting from real states, and (3) trains a model-free policy (SAC) on the combined real + synthetic dataset. Achieves model-based sample efficiency with model-free asymptotic performance.
The error propagation bound from Section 05 tells us: with L=1.5 and ε=0.01, a 5-step rollout has ~15% state error. That's borderline acceptable for training a policy. A 20-step rollout? 6000% error — pure noise. MBPO explicitly bounds rollout length.
MBPO comes with a theoretical guarantee: under certain conditions, the policy improves monotonically (never gets worse). The key bound:
In English: the true policy performance is at least as good as the model-predicted performance minus a term proportional to the maximum model error. Keep model error small → model-predicted improvement transfers to reality.
Model-based sample efficiency (learns model from few real transitions). Model-free asymptotic performance (SAC eventually converges to optimal given enough data). The model bootstraps early learning; SAC takes over as real data accumulates.
| Property | MPC (PETS) | MBPO |
|---|---|---|
| Has a policy? | No (replans each step) | Yes (SAC) |
| Compute at deploy | High (CEM per step) | Low (one forward pass) |
| Rollout length | 10-30 steps | 1-5 steps |
| Handles model error | Replanning corrects | Short rollouts limit it |
| Sample efficiency | Best (10K steps) | Great (30K steps) |
| Asymptotic perf. | Limited by model | Matches model-free |
MPC replans every step. If a 30-step plan has garbage predictions at step 20, it doesn't matter — you only execute step 1, observe reality, and start fresh. The real observation "resets" the error accumulation every single step.
MBPO bakes errors into training data. If a 20-step rollout drifts far from reality, those fake transitions become permanent training data for SAC. The policy learns from this corrupted data, potentially converging to a bad solution. There's no "reality check" mid-rollout. Short rollouts (1-5) keep each synthetic transition close enough to truth that SAC can learn correctly from it.
Three distinct failure modes plague MBRL. Understanding them is essential for debugging real systems.
The model is trained to minimize prediction error uniformly across the state space (MSE loss). But the policy only visits a tiny fraction of states. The model might be excellent at predicting states the policy never visits, while being poor at the specific states the policy cares about.
Think of it this way: you train a weather model to predict temperature across the entire Earth equally well. But your user only cares about whether it'll rain in San Francisco tomorrow. Global MSE might be low, but SF predictions could be terrible.
Robot arm trained to reach objects. Model learned from random exploration data — mostly middle-of-workspace positions. The optimal policy reaches to far corners (where objects actually are). Model accuracy in corners is terrible because corners were rarely visited during random data collection. Policy exploits corner predictions, gets weird results in reality.
The policy optimization process (SAC, CEM) is an adversarial optimizer against the model. It will find and exploit any systematic errors. If the model overpredicts reward for some (s,a) pair, the policy will steer toward that pair — even though real reward is low.
This is the same phenomenon as overfitting in supervised learning, but more dangerous: the policy actively searches for model weaknesses.
We derived this in Section 05. But here's the insidious part: it's not just that predictions get noisy. The errors are correlated. If the model consistently underpredicts friction, it'll predict the robot slides further than reality at every step. After 10 steps, the model thinks the robot is in a completely different place — systematically wrong, not randomly wrong.
These three failure modes compound each other: (1) Model is trained on wrong distribution (mismatch). (2) Policy exploits model's weaknesses in that distribution (exploitation). (3) Multi-step rollouts amplify the exploitation exponentially (compounding). The policy learns to "hack" the model, finding imaginary reward that doesn't exist in reality.
| Defense | Addresses | Mechanism |
|---|---|---|
| Short rollouts | Compounding | Limits error accumulation to 1-5 steps |
| Ensemble disagreement | Exploitation | Penalize regions where models disagree |
| Online model updates | Mismatch | Retrain model as policy visits new states |
| Dagger-style collection | Mismatch | Collect data using current policy, not random |
| Reward penalty | Exploitation | reff = r - λσ (pessimistic) |
Let's put it all together. Real robot manipulation with MBRL: pushing objects to target locations using a 7-DOF arm.
• State (17D): 7 joint positions + 7 joint velocities + 3D object position.
• Action (7D): Desired joint velocity for each motor.
• Reward: -||object_pos - target_pos|| (negative distance to target).
• Horizon: 100 steps per episode (5 seconds at 20Hz control).
| Method | Real Robot Time | Success Rate | Real Transitions |
|---|---|---|---|
| MBRL (ensemble + MPC) | 20 min | 85% | 5,000 |
| SAC (model-free) | 4 hours | 90% | 50,000 |
| PPO (model-free) | 20+ hours | 80% | 500,000+ |
| Random shooting (no CEM) | 20 min | 60% | 5,000 |
10x less data than SAC. 100x less than PPO. Slightly lower success rate, but achievable in a single lab session rather than days of continuous robot operation.
The robot's dynamics are relatively smooth (no contacts, no sudden discontinuities for pushing). The 17D state is low-dimensional enough that 5,000 transitions cover the relevant region well. The ensemble catches regions of uncertainty (near table edges, unusual object orientations) and avoids them.
Contact-rich tasks (grasping, insertion) have discontinuous dynamics that neural networks struggle with. High-dimensional visual observations (images) need different architectures (latent space models). Multi-object manipulation has combinatorial complexity that 5,000 transitions can't cover.
| Version | Algorithm | Planning | Policy? | Key Idea |
|---|---|---|---|---|
| 0.5 | Random Shooting | Sample & pick best | No | Brute force in model |
| 1.0 | Dyna | Model rollouts → buffer | Yes (Q) | Data augmentation |
| 1.5 | MPC + CEM | Iterative optimization | No | Short horizon + replan |
| 1.5+ | PETS | CEM + particles | No | Ensemble uncertainty |
| 2.0 | MBPO | Short rollouts → SAC | Yes (SAC) | Model-free with model data |
• Data is expensive (real robots, expensive simulations, clinical trials).
• Dynamics are relatively smooth (continuous state, no hard contacts).
• State dimension is moderate (<100D from sensors; for images, use latent models).
• You need fast adaptation (new task, change in dynamics).
• Simulation is free (video games, MuJoCo if you don't care about wall time).
• Dynamics are chaotic (high Lipschitz constant, contacts everywhere).
• You need maximum asymptotic performance and can afford the data.
• Observation space is very high-dimensional (raw pixels without compression).
| This Lesson | Connection | Related Topic |
|---|---|---|
| Dynamics model | Same idea: predict the future from present | World Models |
| Ensemble uncertainty | Bayesian approach to model uncertainty | Bayesian Neural Networks |
| MPC replanning | Kalman filter = model-based state estimation | Kalman Filter |
| Model for data augmentation | Offline RL uses model to fill gaps | Offline RL (MOPO, MOReL) |
| Policy gradients on model data | MBPO uses SAC = policy gradient + Q | Policy Gradients |
| Error compounding | Same issue in imitation learning (DAgger) | Behavioral Cloning |
MBRL's natural evolution: instead of predicting raw state st+1, learn a latent dynamics model that predicts in a compressed representation. This handles high-dimensional observations (images) and enables longer-horizon planning by operating in a smoother space. Key systems: Dreamer (Hafner et al.), IRIS, and foundation world models.
• Contact-rich manipulation: Discontinuous dynamics break smooth model assumptions.
• Multi-task transfer: Can one world model serve many tasks?
• Online adaptation: How fast can models adapt when dynamics change (e.g., robot picks up heavy object)?
• Scaling to vision: Latent models work, but sample efficiency gains diminish at scale.
Model-based RL is about building a mental model of the world and using it to think ahead. Every intelligent system — humans, animals, chess engines — does some form of this. The question is never "should we model?" but "how wrong can our model be before planning in it hurts more than it helps?" The entire MBRL literature is an answer to that question.
"The map is not the territory — but a good map saves you from walking off cliffs." — Adapted from Korzybski