← Gleams
Stanford CS 224R · Lecture 11 · Model-Based RL

The Complete Guide to Model-Based RL

Learn a world model, plan in imagination, act efficiently in reality. From random shooting to MBPO.

Full derivations 6 algorithm versions Error propagation proofs Real robot case study
Roadmap

What You'll Master

Chapter 01

Why Model-Based?

You're training a robot arm to stack wooden blocks. Each attempt takes about 10 seconds: reach, grasp, lift, place, check, reset. Your model-free RL algorithm — say, SAC — needs around one million environment steps. That's 116 days of non-stop robot operation for a single stacking skill.

In simulation, a million steps is a coffee break. On a real robot, it's a death sentence for your project timeline, your motors, and your grad student's sanity.

What if the robot could practice in its imagination?

The Chess Grandmaster Analogy

Think about how a chess grandmaster plays. They don't physically move pieces to evaluate a position. They simulate moves mentally — "if I take the bishop, they'll push the pawn, I'll castle..." — and only execute moves they've already vetted in their head. The physical board sees maybe 40 moves. The mental board sees thousands.

That's the core idea. Instead of learning purely from real interactions (expensive), you learn a dynamics model — a learned simulator of how the world works — and then plan or practice inside that simulator.

Definition
Model-Based Reinforcement Learning (MBRL)

A family of RL methods that explicitly learn a model of the environment's transition dynamics, then use that model for planning, data augmentation, or policy optimization. Real data teaches the model. The model generates synthetic experience. The agent learns from both.

Sample Efficiency: The Numbers

MethodHalfCheetah StepsAnt StepsReal Robot Time
PPO (model-free)1M3M~100 days
SAC (model-free)300K1M~35 days
MBPO (model-based)30K100K~3 days
PETS (model-based)10K~20 minutes

10-100x less data. That's not a minor improvement. It's the difference between "works in simulation only" and "works on a real robot in your lab."

The Core Trade-Off

Model-free: simple but data-hungry. Model-based: data-efficient but introduces model bias — if your learned model is wrong, you'll plan confidently toward disaster. The entire field of MBRL is about managing this trade-off.

Chapter 02

The Dynamics Model

The core building block. We want to learn a function that predicts what happens next:

Learned Dynamics fθ(st, at) → st+1

Given current state + action, predict next state

This is supervised learning. We collect transitions (st, at, st+1) from the real environment, then train a neural network to predict st+1 from (st, at).

Definition
Dynamics Model (Transition Model)

A parametric function fθ : S × A → S that approximates the environment's true transition function p(st+1 | st, at). Usually a neural network trained with mean squared error on collected transitions.

Training Objective

MSE Loss for Dynamics L(θ) = (1/N) Σi=1N || fθ(si, ai) − si' ||2

si' is the observed next state from real data

Symbol by symbol:

N — number of transitions in our dataset (replay buffer).

fθ(si, ai) — model's predicted next state. This is a d-dimensional vector (same dimension as the state space).

si' — the actual next state observed from the real environment.

|| · ||2 — squared L2 norm. Sum of squared differences across all state dimensions.

Architecture

A standard dynamics model is a feedforward neural network:

python
import torch.nn as nn

class DynamicsModel(nn.Module):
    def __init__(self, state_dim, action_dim, hidden=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, state_dim)  # predict next state
        )

    def forward(self, state, action):
        x = torch.cat([state, action], dim=-1)
        return self.net(x)  # predicted s_{t+1}

Delta vs. Absolute Prediction

In practice, predicting the change in state works better than predicting the absolute next state:

Delta Prediction (Standard Practice) fθ(st, at) = Δs
st+1 = st + Δs
Why deltas?

Most of the state doesn't change much between timesteps. A robot arm at position [1.2, 0.5, 0.8] will be at [1.21, 0.49, 0.81] next step. Predicting the absolute value means the network wastes capacity learning the identity function. Predicting the small delta focuses learning on what actually changes.

Data Collection

Where do we get (s, a, s') transitions? Three options, often used in combination:

Random policy: Execute random actions, collect transitions. Cheap but only covers states near the initial distribution.

Current policy: Roll out whatever policy we currently have. Covers relevant states but is expensive.

Replay buffer: Store ALL transitions ever collected. Train model on full history. Gets richer over time.

Warning: Distribution Shift

If you train the model only on random-policy data but then use it to plan at states a good policy visits, the model is making predictions in regions it never saw during training. This is the fundamental challenge of MBRL — the model must be accurate where the current policy goes, not where past data lives.

Chapter 03

Version 0.5: Random Shooting MPC

The simplest possible use of a dynamics model. No policy network, no gradients, no value function. Just brute-force search:

Random Shooting MPC
  1. Observe current state st from real environment.
  2. Sample N random action sequences: A(j) = (at(j), at+1(j), ..., at+H-1(j)) for j = 1..N.
  3. Simulate each sequence through the learned model:
      st+k+1(j) = fθ(st+k(j), at+k(j)) for k = 0..H-1.
  4. Score each sequence: R(j) = Σk=0H-1 r(st+k(j), at+k(j)).
  5. Execute only the first action of the best sequence: at* = at(argmax R).
  6. Repeat from step 1 at next time step.

That's it. Sample random plans, simulate them in your head, pick the best first action. Re-plan every step.

Hand Calculation Example

Worked Example

Setup: 1D environment. State = position on number line [0, 10]. Action = velocity in [-1, 1]. Reward = -|state - 7| (want to reach position 7). Current state s0 = 3. Horizon H = 3. N = 4 random sequences.

Sequence 1: actions = [0.8, 0.5, -0.2]

  s1 = 3 + 0.8 = 3.8, s2 = 3.8 + 0.5 = 4.3, s3 = 4.3 + (-0.2) = 4.1

  R = -|3-7| + -|3.8-7| + -|4.3-7| = -4 + -3.2 + -2.7 = -9.9

Sequence 2: actions = [1.0, 1.0, 1.0]

  s1 = 4, s2 = 5, s3 = 6

  R = -4 + -3 + -2 = -9.0

Sequence 3: actions = [1.0, 0.9, 0.8]

  s1 = 4, s2 = 4.9, s3 = 5.7

  R = -4 + -3 + -2.1 = -9.1

Sequence 4: actions = [-0.5, 0.3, 0.1]

  s1 = 2.5, s2 = 2.8, s3 = 2.9

  R = -4 + -4.5 + -4.2 = -12.7

Winner: Sequence 2 (R = -9.0). Execute first action: a0 = 1.0 (move right as fast as possible toward target 7).

Why this works surprisingly well

With N = 1000 random sequences of length H = 10, you're evaluating 1000 possible futures in parallel — all in your learned model, zero real environment steps. Even random search finds decent plans when you evaluate enough candidates. And you replan every step, so early mistakes get corrected.

Limitations

High-dimensional action spaces (e.g., 20-DOF robot) make random sampling hopeless. With H=10 steps and 20-dimensional continuous actions, you need astronomically many samples to cover the space. We need smarter search — see CEM in Section 06.

Chapter 04

Version 1.0: Dyna

Random shooting has no policy — it replans from scratch every step. What if we used the model to generate extra training data for a policy?

Dyna (Sutton, 1991) is the foundational idea: mix real transitions with imagined transitions, train a policy on both.

Definition
Dyna Architecture

A framework combining direct RL (learning from real experience) with indirect RL (learning from model-generated synthetic experience). The model serves as a data amplifier: one real transition can spawn many imagined ones.

Dyna-Q (Full Algorithm)
  1. Initialize: policy Q(s,a), model fθ, replay buffer D = {}.
  2. Collect real transition: Execute at ~ π(st), observe st+1, rt. Add (st, at, rt, st+1) to D.
  3. Direct RL update: Update Q(st, at) using real transition (standard Q-learning).
  4. Model learning: Train fθ on all data in D to minimize MSE.
  5. Model rollouts (K imagined steps): For k = 1 to K:
      Sample s from previously visited states in D.
      Sample a randomly (or from current policy).
      Simulate: s' = fθ(s, a), r = reward(s, a).
    Indirect RL update: Update Q(s, a) using imagined (s, a, r, s').
  6. Repeat from step 2.

The key insight: every real transition is used twice. Once directly for Q-learning. Once to improve the model, which then generates K additional training transitions. With K=10, each real environment step effectively gives you 11 learning updates.

Data Amplification Factor

Effective Data effective transitions per real step = 1 + K

K = number of model rollouts per real step. Typically K = 5-50.
Worked Example: Dyna-Q Grid World

Setup: 5x5 grid. Agent starts at (0,0), goal at (4,4). Reward = +1 at goal, 0 elsewhere. 4 actions: up/down/left/right. K = 5 model rollouts per step.

Real step: Agent at (2,3), goes right, arrives at (3,3), r=0. Q(2,3,right) updated.

Model training: fθ now knows (2,3)+right → (3,3).

5 imagined steps:

  1. Sample (1,1) from buffer, action=up, model predicts (1,2), r=0. Update Q(1,1,up).

  2. Sample (3,4) from buffer, action=right, model predicts (4,4), r=1! Update Q(3,4,right).

  3. Sample (0,2), action=right, model predicts (1,2), r=0. Update Q(0,2,right).

  4. Sample (4,3), action=up, model predicts (4,4), r=1! Update Q(4,3,up).

  5. Sample (2,2), action=down, model predicts (2,1), r=0. Update Q(2,2,down).

Result: In one real step, the agent discovered two paths to the goal (via model) that it hasn't physically tried yet. Q-learning with K=0 would need those actual experiences.

Why start rollouts from visited states?

If you roll out from an arbitrary state, the model might be inaccurate there (it's never seen that region). Starting from states in the buffer ensures the model is making predictions from regions where it has data — at least for the first step.

Chapter 05

Model Error Propagation

Here's the fundamental problem with MBRL. Your model is never perfect. And when you roll out multiple steps, errors compound exponentially.

The Compounding Error Theorem

Setup

Let the true dynamics be p(s'|s,a) and the learned model be fθ(s,a). Suppose the model has per-step error bounded by ε:

|| fθ(s, a) − p(s'|s,a) || ≤ ε   ∀ s, a

And suppose the true dynamics has Lipschitz constant L ≥ 1 (small state changes lead to proportionally bounded next-state changes):

|| p(s'|s1,a) − p(s'|s2,a) || ≤ L · || s1 − s2 ||
Derivation — H-step error bound

Step 1: After 1 model step, error ≤ ε (by assumption).

Step 2: After 2 steps. The model starts from its own predicted state (with up to ε error), so:

error2 ≤ L · error1 + ε = Lε + ε = ε(L + 1)

Step H: By induction:

H-step Error Bound errorH ≤ ε · (LH − 1) / (L − 1)

For L > 1, this grows exponentially: ~ ε · LH / (L-1)

Inductive step: errork+1 ≤ L · errork + ε. Solving the recurrence: errorH = ε · Σk=0H-1 Lk = ε(LH - 1)/(L - 1).

What the Numbers Mean

Numerical Example

Setup: Per-step error ε = 0.01 (1% relative error). Lipschitz constant L = 1.5 (mildly chaotic system).

• H = 1: error ≤ 0.01 (fine)

• H = 5: error ≤ 0.01 × (1.55 - 1)/0.5 = 0.01 × 7.59/0.5 = 0.15 (15% error)

• H = 10: error ≤ 0.01 × (1.510 - 1)/0.5 = 0.01 × 113.3/0.5 = 1.13 (113% error!)

• H = 20: error ≤ 0.01 × (1.520 - 1)/0.5 = 66.5 (completely useless)

A model with 1% error per step is worthless after 20 steps if L = 1.5.

The Takeaway

Long model rollouts are dangerous. Even an excellent model (99% accurate per step) produces garbage after enough steps. This is why every serious MBRL algorithm limits rollout horizon. The rest of this lesson is about managing this compounding.

Physical Intuition

Weather forecasting is exactly this problem. The atmosphere is a chaotic dynamical system (L >> 1). Models are quite good at 1-day predictions (~ε small), but 14-day forecasts are barely better than climatological averages. Same math, same exponential.

🔨 Derivation Tight Error Bound with Reward Lipschitz ✓ ATTEMPTED

Extend the error bound to cumulative reward. Suppose the reward function is Lr-Lipschitz: |r(s1,a) - r(s2,a)| ≤ Lr ||s1 - s2||. Show that the total reward error over H steps satisfies:

|Rtrue - Rmodel| ≤ Lr · ε · H · LH / (L - 1)

At step k, the state error is at most ε(Lk-1)/(L-1). The reward error at step k is at most Lr × (state error at step k). Sum over all H steps.
Σk=0H-1 (Lk - 1)/(L-1) = [Σ Lk - H]/(L-1) = [(LH-1)/(L-1) - H]/(L-1). For loose bound, use Σ Lk ≤ H · LH.

Step 1: At model step k, state error: ek ≤ ε(Lk-1)/(L-1).

Step 2: Reward error at step k: |r(sktrue,a) - r(skmodel,a)| ≤ Lr · ek.

Step 3: Total reward error: |Rtrue - Rmodel| ≤ Σk=0H-1 Lr · ek = Lr · ε · Σk=0H-1 (Lk-1)/(L-1).

Step 4: Bounding: Σk=0H-1 Lk ≤ H · LH-1 ≤ H · LH. So total ≤ Lr · ε · H · LH / (L-1). ■

Key insight: The reward error grows as H × exponential, not just exponential. Longer horizons are doubly penalized: more steps AND larger per-step errors at later steps.

Chapter 06

Version 1.5: Short Horizon + Replan

The error bound tells us: keep rollouts short. But short horizons mean myopic plans. The solution: Model Predictive Control (MPC) — plan short, replan often.

Definition
Model Predictive Control (MPC)

A planning strategy that optimizes actions over a short horizon H, executes only the first action, observes the real next state, then replans. The planning horizon is a sliding window that moves forward with each real step.

MPC limits error propagation to H steps. If H = 5, the worst-case error is bounded by our H=5 calculation (manageable), regardless of the total episode length.

Cross-Entropy Method (CEM)

Random shooting wastes samples. CEM iteratively narrows the search toward good action sequences:

CEM for Action Optimization
  1. Initialize distribution: μ = 0, σ = 1 (per action dimension, per time step).
  2. Sample N action sequences from 𝒩(μ, σ2).
  3. Evaluate each sequence by rolling out through model, compute total reward.
  4. Select elite: Keep top M sequences (M << N, typically top 10%).
  5. Update μ, σ to fit the elite set (mean and std of top-M sequences).
  6. Repeat steps 2-5 for I iterations (typically I = 3-5).
  7. Execute first action of final μ sequence.

Hand Calculation: CEM

Worked Example: CEM (1D, H=2)

Setup: Same 1D environment as before. State = 3, target = 7. H = 2, N = 6, M = 2 (top 2).

Iteration 1: μ = [0, 0], σ = [1, 1]. Sample 6 sequences:

  [0.8, 0.3] → states [3.8, 4.1], R = -3.2 + -2.9 = -6.1

  [1.2, 0.9] → [4.2, 5.1], R = -2.8 + -1.9 = -4.7 (elite)

  [-0.5, 0.2] → [2.5, 2.7], R = -4.5 + -4.3 = -8.8

  [0.9, 1.1] → [3.9, 5.0], R = -3.1 + -2.0 = -5.1 (elite)

  [-0.3, -0.8] → [2.7, 1.9], R = -4.3 + -5.1 = -9.4

  [0.4, 0.6] → [3.4, 4.0], R = -3.6 + -3.0 = -6.6

Elite: [1.2, 0.9] and [0.9, 1.1]. New μ = [1.05, 1.0], σ = [0.15, 0.1].

Iteration 2: Sample around [1.05, 1.0] with tight std. Concentrates on the best region.

Result: CEM converged to "go right fast" — same answer as random shooting but with 6 samples instead of needing hundreds.

CEM = Evolution Strategy for Planning

CEM is a special case of evolutionary optimization. The "population" is action sequences. "Fitness" is model-predicted reward. "Selection" keeps the elite. "Mutation" is sampling from the updated Gaussian. Simple, derivative-free, parallelizable.

The Full MPC Loop

MPC with CEM For each real time step t:
  1. Run CEM(st, H, N, M, I) → optimal action sequence a*t:t+H-1
  2. Execute a*t in real environment
  3. Observe st+1, store (st, a*t, st+1) in D
  4. (Optional) Retrain fθ on D
Computational Cost

MPC replans at every step. Each replan runs CEM with I×N forward model evaluations (each of H steps). For I=5, N=500, H=10, that's 25,000 model forward passes per real step. The model must be fast to evaluate. Deep ensemble models (7 networks) push this further.

Chapter 07

Ensemble Models & Uncertainty

A single model gives you a point prediction. It can't tell you "I'm not sure about this region." Ensembles fix this by training multiple models and measuring their disagreement.

Definition
Epistemic Uncertainty via Ensembles

Train K models {fθ1, ..., fθK} with different random initializations and/or different data subsets (bootstrap). At test time, run all K models. Their disagreement (variance of predictions) estimates epistemic uncertainty — uncertainty due to lack of data.

Ensemble Prediction & Uncertainty mean prediction: μ(s,a) = (1/K) Σk=1K fθk(s, a)
uncertainty: σ2(s,a) = (1/K) Σk=1K || fθk(s,a) − μ(s,a) ||2

Symbol by symbol:

K — number of models in ensemble (typically 5-7).

fθk — the k-th model, trained independently.

μ(s,a) — average prediction. Use this as the "best guess" next state.

σ2(s,a) — variance across models. High = "we haven't seen this region" = don't trust the prediction.

PETS: Probabilistic Ensemble Trajectory Sampling

PETS Algorithm (Chua et al., 2018)
  1. Train ensemble of K probabilistic models. Each fθk outputs (μk, Σk) — mean AND variance of predicted next state.
  2. For planning (CEM): To evaluate an action sequence, propagate particles:
      For each particle p and each step h:
       Randomly assign particle to model k ~ Uniform(1..K).
       Sample sh+1 ~ 𝒩(μk(sh,ah), Σk(sh,ah)).
  3. Score action sequence by average reward across particles.
  4. MPC: Execute first action, replan each step.
Why random model assignment per particle per step?

This samples from the full distribution of possible dynamics, not just the mean. If models disagree about what happens after action a, some particles go one way, others go another. The planning algorithm sees high variance in returns → avoids that action. It's naturally pessimistic in uncertain regions.

Pessimism Principle

When models disagree, the agent should be conservative. This emerges automatically from PETS (high-disagreement actions have high-variance returns, which CEM avoids), but can be made explicit:

Pessimistic Reward Penalty reff(s, a) = r(s, a) − λ · σ(s, a)

λ controls exploration-exploitation. Higher λ = more conservative.
PETS on HalfCheetah

Setup: 5 models, each 4-layer MLP (200 hidden), outputs Gaussian (μ, σ2) per state dim. Trained on 10,000 real transitions. CEM with N=500, M=50, I=5, H=30.

Result: Achieves near-SAC performance with 10x fewer environment steps. The ensemble's uncertainty estimate prevents the MPC from exploiting model errors.

Chapter 08

Version 2.0: Model-Based Policy Optimization

MPC has a fatal limitation: no policy. It replans from scratch every step, which is computationally expensive at deployment. What if we used the model to train a reusable policy?

MBPO (Janner et al., 2019) is the state-of-the-art answer. Use short model rollouts to generate data, then train a model-free algorithm (SAC) on that data.

Definition
Model-Based Policy Optimization (MBPO)

An algorithm that: (1) learns an ensemble dynamics model from real data, (2) generates short synthetic rollouts (1-5 steps) starting from real states, and (3) trains a model-free policy (SAC) on the combined real + synthetic dataset. Achieves model-based sample efficiency with model-free asymptotic performance.

MBPO (Full Algorithm)
  1. Initialize: policy πφ, ensemble model {fθk}, real buffer Dreal, model buffer Dmodel.
  2. Collect real data: Execute πφ in environment for E steps. Add to Dreal.
  3. Train model: Update ensemble on Dreal (MSE loss).
  4. Generate model data: For M rollouts:
      Sample start state s0 uniformly from Dreal.
      Roll out using πφ and random model fθk for h steps (h = 1 to 5).
      Add all (s, a, r, s') to Dmodel.
  5. Train policy: Run G gradient steps of SAC on Dreal ∪ Dmodel.
  6. Repeat from step 2.

Why Short Rollouts (h = 1 to 5)?

The error propagation bound from Section 05 tells us: with L=1.5 and ε=0.01, a 5-step rollout has ~15% state error. That's borderline acceptable for training a policy. A 20-step rollout? 6000% error — pure noise. MBPO explicitly bounds rollout length.

MBPO Rollout Length Selection h* = argmaxh { model data benefit − model error cost }

Typically h* ∈ {1, 2, 3, 4, 5} depending on model quality

Monotonic Improvement Guarantee

MBPO comes with a theoretical guarantee: under certain conditions, the policy improves monotonically (never gets worse). The key bound:

MBPO Lower Bound on Policy Improvement η[π'] ≥ η̂model[π'] − C · maxs 𝔼a~π'[ DTV(p(s'|s,a) || fθ(s'|s,a)) ]

η = true return, η̂ = model return, C = constant depending on H and γ

In English: the true policy performance is at least as good as the model-predicted performance minus a term proportional to the maximum model error. Keep model error small → model-predicted improvement transfers to reality.

MBPO = Best of Both Worlds

Model-based sample efficiency (learns model from few real transitions). Model-free asymptotic performance (SAC eventually converges to optimal given enough data). The model bootstraps early learning; SAC takes over as real data accumulates.

Comparison: MPC vs. MBPO

PropertyMPC (PETS)MBPO
Has a policy?No (replans each step)Yes (SAC)
Compute at deployHigh (CEM per step)Low (one forward pass)
Rollout length10-30 steps1-5 steps
Handles model errorReplanning correctsShort rollouts limit it
Sample efficiencyBest (10K steps)Great (30K steps)
Asymptotic perf.Limited by modelMatches model-free
Checkpoint — Before you move on
MBPO uses 1-5 step rollouts while MPC uses 10-30 steps. Both use the same model. Why can MPC afford longer rollouts without catastrophic failure, while MBPO cannot?
✓ Gate cleared
Model Answer

MPC replans every step. If a 30-step plan has garbage predictions at step 20, it doesn't matter — you only execute step 1, observe reality, and start fresh. The real observation "resets" the error accumulation every single step.

MBPO bakes errors into training data. If a 20-step rollout drifts far from reality, those fake transitions become permanent training data for SAC. The policy learns from this corrupted data, potentially converging to a bad solution. There's no "reality check" mid-rollout. Short rollouts (1-5) keep each synthetic transition close enough to truth that SAC can learn correctly from it.

Chapter 09

When Models Fail

Three distinct failure modes plague MBRL. Understanding them is essential for debugging real systems.

Failure Mode 1: Objective Mismatch

Definition
Objective Mismatch

The model is trained to minimize prediction error uniformly across the state space (MSE loss). But the policy only visits a tiny fraction of states. The model might be excellent at predicting states the policy never visits, while being poor at the specific states the policy cares about.

Think of it this way: you train a weather model to predict temperature across the entire Earth equally well. But your user only cares about whether it'll rain in San Francisco tomorrow. Global MSE might be low, but SF predictions could be terrible.

Concrete Example

Robot arm trained to reach objects. Model learned from random exploration data — mostly middle-of-workspace positions. The optimal policy reaches to far corners (where objects actually are). Model accuracy in corners is terrible because corners were rarely visited during random data collection. Policy exploits corner predictions, gets weird results in reality.

Failure Mode 2: Model Exploitation

Definition
Model Exploitation

The policy optimization process (SAC, CEM) is an adversarial optimizer against the model. It will find and exploit any systematic errors. If the model overpredicts reward for some (s,a) pair, the policy will steer toward that pair — even though real reward is low.

Model Exploitation π* = argmaxπ Jmodel(π) ≠ argmaxπ Jreal(π)

The model-optimal policy is NOT the real-optimal policy when the model has systematic errors

This is the same phenomenon as overfitting in supervised learning, but more dangerous: the policy actively searches for model weaknesses.

Failure Mode 3: Compounding Errors in Multi-Step

We derived this in Section 05. But here's the insidious part: it's not just that predictions get noisy. The errors are correlated. If the model consistently underpredicts friction, it'll predict the robot slides further than reality at every step. After 10 steps, the model thinks the robot is in a completely different place — systematically wrong, not randomly wrong.

The Triad of Death

These three failure modes compound each other: (1) Model is trained on wrong distribution (mismatch). (2) Policy exploits model's weaknesses in that distribution (exploitation). (3) Multi-step rollouts amplify the exploitation exponentially (compounding). The policy learns to "hack" the model, finding imaginary reward that doesn't exist in reality.

Defenses

DefenseAddressesMechanism
Short rolloutsCompoundingLimits error accumulation to 1-5 steps
Ensemble disagreementExploitationPenalize regions where models disagree
Online model updatesMismatchRetrain model as policy visits new states
Dagger-style collectionMismatchCollect data using current policy, not random
Reward penaltyExploitationreff = r - λσ (pessimistic)
Chapter 10

Case Study: Real Robot Learning

Let's put it all together. Real robot manipulation with MBRL: pushing objects to target locations using a 7-DOF arm.

The Setup

State (17D): 7 joint positions + 7 joint velocities + 3D object position.

Action (7D): Desired joint velocity for each motor.

Reward: -||object_pos - target_pos|| (negative distance to target).

Horizon: 100 steps per episode (5 seconds at 20Hz control).

The Recipe

Real Robot MBRL Pipeline
  1. Initial data (2 min): Random actions for 25 episodes (2500 transitions). Motor commands in [-1,1], clipped.
  2. Train ensemble: 5 models, each 4-layer MLP (500 hidden), predicting Δs. Train until validation loss plateaus (~2 min on GPU).
  3. MPC with CEM: H=25 steps, N=500 samples, M=50 elites, I=5 iterations. ~150ms per planning step (fast enough for 20Hz control with parallelization).
  4. Execute + collect: Run MPC policy for 5 episodes. Add to buffer.
  5. Retrain model: Full retrain on expanded buffer.
  6. Repeat 4-5: 4 iterations total. ~20 minutes wall clock time.

Results Comparison

MethodReal Robot TimeSuccess RateReal Transitions
MBRL (ensemble + MPC)20 min85%5,000
SAC (model-free)4 hours90%50,000
PPO (model-free)20+ hours80%500,000+
Random shooting (no CEM)20 min60%5,000

10x less data than SAC. 100x less than PPO. Slightly lower success rate, but achievable in a single lab session rather than days of continuous robot operation.

Why 20 Minutes Works

The robot's dynamics are relatively smooth (no contacts, no sudden discontinuities for pushing). The 17D state is low-dimensional enough that 5,000 transitions cover the relevant region well. The ensemble catches regions of uncertainty (near table edges, unusual object orientations) and avoids them.

When This Recipe Fails

Contact-rich tasks (grasping, insertion) have discontinuous dynamics that neural networks struggle with. High-dimensional visual observations (images) need different architectures (latent space models). Multi-object manipulation has combinatorial complexity that 5,000 transitions can't cover.

Chapter 11

Summary & Cheat Sheet

Version Comparison

VersionAlgorithmPlanningPolicy?Key Idea
0.5Random ShootingSample & pick bestNoBrute force in model
1.0DynaModel rollouts → bufferYes (Q)Data augmentation
1.5MPC + CEMIterative optimizationNoShort horizon + replan
1.5+PETSCEM + particlesNoEnsemble uncertainty
2.0MBPOShort rollouts → SACYes (SAC)Model-free with model data

Decision Tree: When to Use Model-Based RL

Use Model-Based When...

Data is expensive (real robots, expensive simulations, clinical trials).

Dynamics are relatively smooth (continuous state, no hard contacts).

State dimension is moderate (<100D from sensors; for images, use latent models).

You need fast adaptation (new task, change in dynamics).

Stick with Model-Free When...

Simulation is free (video games, MuJoCo if you don't care about wall time).

Dynamics are chaotic (high Lipschitz constant, contacts everywhere).

You need maximum asymptotic performance and can afford the data.

Observation space is very high-dimensional (raw pixels without compression).

Key Equations Reference

Dynamics Learning L(θ) = (1/N) Σ || fθ(s, a) − s' ||2
Error Propagation errorH ≤ ε · (LH − 1) / (L − 1)
Ensemble Uncertainty σ2(s,a) = (1/K) Σk || fk(s,a) − μ(s,a) ||2
Pessimistic Reward reff = r − λσ
Chapter 12

Connections & What's Next

Where MBRL Lives in the RL Landscape

This LessonConnectionRelated Topic
Dynamics modelSame idea: predict the future from presentWorld Models
Ensemble uncertaintyBayesian approach to model uncertaintyBayesian Neural Networks
MPC replanningKalman filter = model-based state estimationKalman Filter
Model for data augmentationOffline RL uses model to fill gapsOffline RL (MOPO, MOReL)
Policy gradients on model dataMBPO uses SAC = policy gradient + QPolicy Gradients
Error compoundingSame issue in imitation learning (DAgger)Behavioral Cloning

The World Models Frontier

MBRL's natural evolution: instead of predicting raw state st+1, learn a latent dynamics model that predicts in a compressed representation. This handles high-dimensional observations (images) and enables longer-horizon planning by operating in a smoother space. Key systems: Dreamer (Hafner et al.), IRIS, and foundation world models.

Open Problems

Contact-rich manipulation: Discontinuous dynamics break smooth model assumptions.

Multi-task transfer: Can one world model serve many tasks?

Online adaptation: How fast can models adapt when dynamics change (e.g., robot picks up heavy object)?

Scaling to vision: Latent models work, but sample efficiency gains diminish at scale.

The Big Picture

Model-based RL is about building a mental model of the world and using it to think ahead. Every intelligent system — humans, animals, chess engines — does some form of this. The question is never "should we model?" but "how wrong can our model be before planning in it hurts more than it helps?" The entire MBRL literature is an answer to that question.

"The map is not the territory — but a good map saves you from walking off cliffs." — Adapted from Korzybski