CS224R — Model-Based RL: The Complete Guide

Roadmap

What You'll Master

01Why Model-Based? 02The Dynamics Model 03Random Shooting MPC 04Dyna: Real + Imagined 05Error Propagation 06Short Horizon + Replan 07Ensemble Models 08Model-Based Policy Opt. 09When Models Fail 10Real Robot Learning 11Summary & Cheat Sheet 12Connections

Chapter 01

Why Model-Based?

You're training a robot arm to stack wooden blocks. Each attempt takes about 10 seconds: reach, grasp, lift, place, check, reset. Your model-free RL algorithm — say, SAC — needs around one million environment steps. That's 116 days of non-stop robot operation for a single stacking skill.

In simulation, a million steps is a coffee break. On a real robot, it's a death sentence for your project timeline, your motors, and your grad student's sanity.

What if the robot could practice in its imagination?

The Chess Grandmaster Analogy

Think about how a chess grandmaster plays. They don't physically move pieces to evaluate a position. They simulate moves mentally — "if I take the bishop, they'll push the pawn, I'll castle..." — and only execute moves they've already vetted in their head. The physical board sees maybe 40 moves. The mental board sees thousands.

That's the core idea. Instead of learning purely from real interactions (expensive), you learn a dynamics model — a learned simulator of how the world works — and then plan or practice inside that simulator.

Definition

Model-Based Reinforcement Learning (MBRL)

A family of RL methods that explicitly learn a model of the environment's transition dynamics, then use that model for planning, data augmentation, or policy optimization. Real data teaches the model. The model generates synthetic experience. The agent learns from both.

Sample Efficiency: The Numbers

Method	HalfCheetah Steps	Ant Steps	Real Robot Time
PPO (model-free)	1M	3M	~100 days
SAC (model-free)	300K	1M	~35 days
MBPO (model-based)	30K	100K	~3 days
PETS (model-based)	10K	—	~20 minutes

10-100x less data. That's not a minor improvement. It's the difference between "works in simulation only" and "works on a real robot in your lab."

The Core Trade-Off

Model-free: simple but data-hungry. Model-based: data-efficient but introduces model bias — if your learned model is wrong, you'll plan confidently toward disaster. The entire field of MBRL is about managing this trade-off.

Chapter 02

The Dynamics Model

The core building block. We want to learn a function that predicts what happens next:

Learned Dynamics f_θ(s_t, a_t) → s_t+1

Given current state + action, predict next state

This is supervised learning. We collect transitions (s_t, a_t, s_t+1) from the real environment, then train a neural network to predict s_t+1 from (s_t, a_t).

Definition

Dynamics Model (Transition Model)

A parametric function f_θ : S × A → S that approximates the environment's true transition function p(s_t+1 | s_t, a_t). Usually a neural network trained with mean squared error on collected transitions.

Training Objective

MSE Loss for Dynamics L(θ) = (1/N) Σ_i=1^N || f_θ(s_i, a_i) − s_i' ||²

s_i' is the observed next state from real data

Symbol by symbol:

• N — number of transitions in our dataset (replay buffer).

• f_θ(s_i, a_i) — model's predicted next state. This is a d-dimensional vector (same dimension as the state space).

• s_i' — the actual next state observed from the real environment.

• || · ||² — squared L2 norm. Sum of squared differences across all state dimensions.

Architecture

A standard dynamics model is a feedforward neural network:

python
import torch.nn as nn

class DynamicsModel(nn.Module):
    def __init__(self, state_dim, action_dim, hidden=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, state_dim)  # predict next state
        )

    def forward(self, state, action):
        x = torch.cat([state, action], dim=-1)
        return self.net(x)  # predicted s_{t+1}

Delta vs. Absolute Prediction

In practice, predicting the change in state works better than predicting the absolute next state:

Delta Prediction (Standard Practice) f_θ(s_t, a_t) = Δs
s_t+1 = s_t + Δs

Why deltas?

Most of the state doesn't change much between timesteps. A robot arm at position [1.2, 0.5, 0.8] will be at [1.21, 0.49, 0.81] next step. Predicting the absolute value means the network wastes capacity learning the identity function. Predicting the small delta focuses learning on what actually changes.

Data Collection

Where do we get (s, a, s') transitions? Three options, often used in combination:

• Random policy: Execute random actions, collect transitions. Cheap but only covers states near the initial distribution.

• Current policy: Roll out whatever policy we currently have. Covers relevant states but is expensive.

• Replay buffer: Store ALL transitions ever collected. Train model on full history. Gets richer over time.

Warning: Distribution Shift

If you train the model only on random-policy data but then use it to plan at states a good policy visits, the model is making predictions in regions it never saw during training. This is the fundamental challenge of MBRL — the model must be accurate where the current policy goes, not where past data lives.

Chapter 03

Version 0.5: Random Shooting MPC

The simplest possible use of a dynamics model. No policy network, no gradients, no value function. Just brute-force search:

Random Shooting MPC

Observe current state s_t from real environment.
Sample N random action sequences: A^(j) = (a_t^(j), a_t+1^(j), ..., a_t+H-1^(j)) for j = 1..N.
Simulate each sequence through the learned model:
s_t+k+1^(j) = f_θ(s_t+k^(j), a_t+k^(j)) for k = 0..H-1.
Score each sequence: R^(j) = Σ_k=0^H-1 r(s_t+k^(j), a_t+k^(j)).
Execute only the first action of the best sequence: a_t* = a_t^{(argmax R)}.
Repeat from step 1 at next time step.

That's it. Sample random plans, simulate them in your head, pick the best first action. Re-plan every step.

Hand Calculation Example

Worked Example

Setup: 1D environment. State = position on number line [0, 10]. Action = velocity in [-1, 1]. Reward = -|state - 7| (want to reach position 7). Current state s₀ = 3. Horizon H = 3. N = 4 random sequences.

Sequence 1: actions = [0.8, 0.5, -0.2]

s₁ = 3 + 0.8 = 3.8, s₂ = 3.8 + 0.5 = 4.3, s₃ = 4.3 + (-0.2) = 4.1

R = -|3-7| + -|3.8-7| + -|4.3-7| = -4 + -3.2 + -2.7 = -9.9

Sequence 2: actions = [1.0, 1.0, 1.0]

s₁ = 4, s₂ = 5, s₃ = 6

R = -4 + -3 + -2 = -9.0

Sequence 3: actions = [1.0, 0.9, 0.8]

s₁ = 4, s₂ = 4.9, s₃ = 5.7

R = -4 + -3 + -2.1 = -9.1

Sequence 4: actions = [-0.5, 0.3, 0.1]

s₁ = 2.5, s₂ = 2.8, s₃ = 2.9

R = -4 + -4.5 + -4.2 = -12.7

Winner: Sequence 2 (R = -9.0). Execute first action: a₀ = 1.0 (move right as fast as possible toward target 7).

Why this works surprisingly well

With N = 1000 random sequences of length H = 10, you're evaluating 1000 possible futures in parallel — all in your learned model, zero real environment steps. Even random search finds decent plans when you evaluate enough candidates. And you replan every step, so early mistakes get corrected.

Limitations

High-dimensional action spaces (e.g., 20-DOF robot) make random sampling hopeless. With H=10 steps and 20-dimensional continuous actions, you need astronomically many samples to cover the space. We need smarter search — see CEM in Section 06.

Chapter 04

Version 1.0: Dyna

Random shooting has no policy — it replans from scratch every step. What if we used the model to generate extra training data for a policy?

Dyna (Sutton, 1991) is the foundational idea: mix real transitions with imagined transitions, train a policy on both.

Definition

Dyna Architecture

A framework combining direct RL (learning from real experience) with indirect RL (learning from model-generated synthetic experience). The model serves as a data amplifier: one real transition can spawn many imagined ones.

Dyna-Q (Full Algorithm)

Initialize: policy Q(s,a), model f_θ, replay buffer D = {}.
Collect real transition: Execute a_t ~ π(s_t), observe s_t+1, r_t. Add (s_t, a_t, r_t, s_t+1) to D.
Direct RL update: Update Q(s_t, a_t) using real transition (standard Q-learning).
Model learning: Train f_θ on all data in D to minimize MSE.
Model rollouts (K imagined steps): For k = 1 to K:
Sample s from previously visited states in D.
Sample a randomly (or from current policy).
Simulate: s' = f_θ(s, a), r = reward(s, a).
Indirect RL update: Update Q(s, a) using imagined (s, a, r, s').
Repeat from step 2.

The key insight: every real transition is used twice. Once directly for Q-learning. Once to improve the model, which then generates K additional training transitions. With K=10, each real environment step effectively gives you 11 learning updates.

Data Amplification Factor

Effective Data effective transitions per real step = 1 + K

K = number of model rollouts per real step. Typically K = 5-50.

Worked Example: Dyna-Q Grid World

Setup: 5x5 grid. Agent starts at (0,0), goal at (4,4). Reward = +1 at goal, 0 elsewhere. 4 actions: up/down/left/right. K = 5 model rollouts per step.

Real step: Agent at (2,3), goes right, arrives at (3,3), r=0. Q(2,3,right) updated.

Model training: f_θ now knows (2,3)+right → (3,3).

5 imagined steps:

1. Sample (1,1) from buffer, action=up, model predicts (1,2), r=0. Update Q(1,1,up).

2. Sample (3,4) from buffer, action=right, model predicts (4,4), r=1! Update Q(3,4,right).

3. Sample (0,2), action=right, model predicts (1,2), r=0. Update Q(0,2,right).

4. Sample (4,3), action=up, model predicts (4,4), r=1! Update Q(4,3,up).

5. Sample (2,2), action=down, model predicts (2,1), r=0. Update Q(2,2,down).

Result: In one real step, the agent discovered two paths to the goal (via model) that it hasn't physically tried yet. Q-learning with K=0 would need those actual experiences.

Why start rollouts from visited states?

If you roll out from an arbitrary state, the model might be inaccurate there (it's never seen that region). Starting from states in the buffer ensures the model is making predictions from regions where it has data — at least for the first step.

Chapter 05

Model Error Propagation

Here's the fundamental problem with MBRL. Your model is never perfect. And when you roll out multiple steps, errors compound exponentially.

The Compounding Error Theorem

Setup

Let the true dynamics be p(s'|s,a) and the learned model be f_θ(s,a). Suppose the model has per-step error bounded by ε:

|| f_θ(s, a) − p(s'|s,a) || ≤ ε ∀ s, a

And suppose the true dynamics has Lipschitz constant L ≥ 1 (small state changes lead to proportionally bounded next-state changes):

|| p(s'|s₁,a) − p(s'|s₂,a) || ≤ L · || s₁ − s₂ ||

Derivation — H-step error bound

Step 1: After 1 model step, error ≤ ε (by assumption).

Step 2: After 2 steps. The model starts from its own predicted state (with up to ε error), so:

error₂ ≤ L · error₁ + ε = Lε + ε = ε(L + 1)

Step H: By induction:

H-step Error Bound error_H ≤ ε · (L^H − 1) / (L − 1)

For L > 1, this grows exponentially: ~ ε · L^H / (L-1)

Inductive step: error_k+1 ≤ L · error_k + ε. Solving the recurrence: error_H = ε · Σ_k=0^H-1 L^k = ε(L^H - 1)/(L - 1).

What the Numbers Mean

Numerical Example

Setup: Per-step error ε = 0.01 (1% relative error). Lipschitz constant L = 1.5 (mildly chaotic system).

• H = 1: error ≤ 0.01 (fine)

• H = 5: error ≤ 0.01 × (1.5⁵ - 1)/0.5 = 0.01 × 7.59/0.5 = 0.15 (15% error)

• H = 10: error ≤ 0.01 × (1.5¹⁰ - 1)/0.5 = 0.01 × 113.3/0.5 = 1.13 (113% error!)

• H = 20: error ≤ 0.01 × (1.5²⁰ - 1)/0.5 = 66.5 (completely useless)

A model with 1% error per step is worthless after 20 steps if L = 1.5.

The Takeaway

Long model rollouts are dangerous. Even an excellent model (99% accurate per step) produces garbage after enough steps. This is why every serious MBRL algorithm limits rollout horizon. The rest of this lesson is about managing this compounding.

Physical Intuition

Weather forecasting is exactly this problem. The atmosphere is a chaotic dynamical system (L >> 1). Models are quite good at 1-day predictions (~ε small), but 14-day forecasts are barely better than climatological averages. Same math, same exponential.

🔨 Derivation Tight Error Bound with Reward Lipschitz ▶ ✓ ATTEMPTED

Extend the error bound to cumulative reward. Suppose the reward function is L_r-Lipschitz: |r(s₁,a) - r(s₂,a)| ≤ L_r ||s₁ - s₂||. Show that the total reward error over H steps satisfies:

|R_true - R_model| ≤ L_r · ε · H · L^H / (L - 1)

At step k, the state error is at most ε(L^k-1)/(L-1). The reward error at step k is at most L_r × (state error at step k). Sum over all H steps.

Σ_k=0^H-1 (L^k - 1)/(L-1) = [Σ L^k - H]/(L-1) = [(L^H-1)/(L-1) - H]/(L-1). For loose bound, use Σ L^k ≤ H · L^H.

Step 1: At model step k, state error: e_k ≤ ε(L^k-1)/(L-1).

Step 2: Reward error at step k: |r(s_k^true,a) - r(s_k^model,a)| ≤ L_r · e_k.

Step 3: Total reward error: |R_true - R_model| ≤ Σ_k=0^H-1 L_r · e_k = L_r · ε · Σ_k=0^H-1 (L^k-1)/(L-1).

Step 4: Bounding: Σ_k=0^H-1 L^k ≤ H · L^H-1 ≤ H · L^H. So total ≤ L_r · ε · H · L^H / (L-1). ■

Key insight: The reward error grows as H × exponential, not just exponential. Longer horizons are doubly penalized: more steps AND larger per-step errors at later steps.

Chapter 06

Version 1.5: Short Horizon + Replan

The error bound tells us: keep rollouts short. But short horizons mean myopic plans. The solution: Model Predictive Control (MPC) — plan short, replan often.

Definition

Model Predictive Control (MPC)

A planning strategy that optimizes actions over a short horizon H, executes only the first action, observes the real next state, then replans. The planning horizon is a sliding window that moves forward with each real step.

MPC limits error propagation to H steps. If H = 5, the worst-case error is bounded by our H=5 calculation (manageable), regardless of the total episode length.

Cross-Entropy Method (CEM)

Random shooting wastes samples. CEM iteratively narrows the search toward good action sequences:

CEM for Action Optimization

Initialize distribution: μ = 0, σ = 1 (per action dimension, per time step).
Sample N action sequences from 𝒩(μ, σ²).
Evaluate each sequence by rolling out through model, compute total reward.
Select elite: Keep top M sequences (M << N, typically top 10%).
Update μ, σ to fit the elite set (mean and std of top-M sequences).
Repeat steps 2-5 for I iterations (typically I = 3-5).
Execute first action of final μ sequence.

Hand Calculation: CEM

Worked Example: CEM (1D, H=2)

Setup: Same 1D environment as before. State = 3, target = 7. H = 2, N = 6, M = 2 (top 2).

Iteration 1: μ = [0, 0], σ = [1, 1]. Sample 6 sequences:

[0.8, 0.3] → states [3.8, 4.1], R = -3.2 + -2.9 = -6.1

[1.2, 0.9] → [4.2, 5.1], R = -2.8 + -1.9 = -4.7 (elite)

[-0.5, 0.2] → [2.5, 2.7], R = -4.5 + -4.3 = -8.8

[0.9, 1.1] → [3.9, 5.0], R = -3.1 + -2.0 = -5.1 (elite)

[-0.3, -0.8] → [2.7, 1.9], R = -4.3 + -5.1 = -9.4

[0.4, 0.6] → [3.4, 4.0], R = -3.6 + -3.0 = -6.6

Elite: [1.2, 0.9] and [0.9, 1.1]. New μ = [1.05, 1.0], σ = [0.15, 0.1].

Iteration 2: Sample around [1.05, 1.0] with tight std. Concentrates on the best region.

Result: CEM converged to "go right fast" — same answer as random shooting but with 6 samples instead of needing hundreds.

CEM = Evolution Strategy for Planning

CEM is a special case of evolutionary optimization. The "population" is action sequences. "Fitness" is model-predicted reward. "Selection" keeps the elite. "Mutation" is sampling from the updated Gaussian. Simple, derivative-free, parallelizable.

The Full MPC Loop

MPC with CEM For each real time step t:
1. Run CEM(s_t, H, N, M, I) → optimal action sequence a*_t:t+H-1
2. Execute a*_t in real environment
3. Observe s_t+1, store (s_t, a*_t, s_t+1) in D
4. (Optional) Retrain f_θ on D

Computational Cost

MPC replans at every step. Each replan runs CEM with I×N forward model evaluations (each of H steps). For I=5, N=500, H=10, that's 25,000 model forward passes per real step. The model must be fast to evaluate. Deep ensemble models (7 networks) push this further.

Chapter 07

Ensemble Models & Uncertainty

A single model gives you a point prediction. It can't tell you "I'm not sure about this region." Ensembles fix this by training multiple models and measuring their disagreement.

Definition

Epistemic Uncertainty via Ensembles

Train K models {f_θ₁, ..., f_{θ_K}} with different random initializations and/or different data subsets (bootstrap). At test time, run all K models. Their disagreement (variance of predictions) estimates epistemic uncertainty — uncertainty due to lack of data.

Ensemble Prediction & Uncertainty mean prediction: μ(s,a) = (1/K) Σ_k=1^K f_{θ_k}(s, a)
uncertainty: σ²(s,a) = (1/K) Σ_k=1^K || f_{θ_k}(s,a) − μ(s,a) ||²

Symbol by symbol:

• K — number of models in ensemble (typically 5-7).

• f_{θ_k} — the k-th model, trained independently.

• μ(s,a) — average prediction. Use this as the "best guess" next state.

• σ²(s,a) — variance across models. High = "we haven't seen this region" = don't trust the prediction.

PETS: Probabilistic Ensemble Trajectory Sampling

PETS Algorithm (Chua et al., 2018)

Train ensemble of K probabilistic models. Each f_{θ_k} outputs (μ_k, Σ_k) — mean AND variance of predicted next state.
For planning (CEM): To evaluate an action sequence, propagate particles:
For each particle p and each step h:
Randomly assign particle to model k ~ Uniform(1..K).
Sample s_h+1 ~ 𝒩(μ_k(s_h,a_h), Σ_k(s_h,a_h)).
Score action sequence by average reward across particles.
MPC: Execute first action, replan each step.

Why random model assignment per particle per step?

This samples from the full distribution of possible dynamics, not just the mean. If models disagree about what happens after action a, some particles go one way, others go another. The planning algorithm sees high variance in returns → avoids that action. It's naturally pessimistic in uncertain regions.

Pessimism Principle

When models disagree, the agent should be conservative. This emerges automatically from PETS (high-disagreement actions have high-variance returns, which CEM avoids), but can be made explicit:

Pessimistic Reward Penalty r_eff(s, a) = r(s, a) − λ · σ(s, a)

λ controls exploration-exploitation. Higher λ = more conservative.

PETS on HalfCheetah

Setup: 5 models, each 4-layer MLP (200 hidden), outputs Gaussian (μ, σ²) per state dim. Trained on 10,000 real transitions. CEM with N=500, M=50, I=5, H=30.

Result: Achieves near-SAC performance with 10x fewer environment steps. The ensemble's uncertainty estimate prevents the MPC from exploiting model errors.

Chapter 08

Version 2.0: Model-Based Policy Optimization

MPC has a fatal limitation: no policy. It replans from scratch every step, which is computationally expensive at deployment. What if we used the model to train a reusable policy?

MBPO (Janner et al., 2019) is the state-of-the-art answer. Use short model rollouts to generate data, then train a model-free algorithm (SAC) on that data.

Definition

Model-Based Policy Optimization (MBPO)

An algorithm that: (1) learns an ensemble dynamics model from real data, (2) generates short synthetic rollouts (1-5 steps) starting from real states, and (3) trains a model-free policy (SAC) on the combined real + synthetic dataset. Achieves model-based sample efficiency with model-free asymptotic performance.

MBPO (Full Algorithm)

Initialize: policy π_φ, ensemble model {f_{θ_k}}, real buffer D_real, model buffer D_model.
Collect real data: Execute π_φ in environment for E steps. Add to D_real.
Train model: Update ensemble on D_real (MSE loss).
Generate model data: For M rollouts:
Sample start state s₀ uniformly from D_real.
Roll out using π_φ and random model f_{θ_k} for h steps (h = 1 to 5).
Add all (s, a, r, s') to D_model.
Train policy: Run G gradient steps of SAC on D_real ∪ D_model.
Repeat from step 2.

Why Short Rollouts (h = 1 to 5)?

The error propagation bound from Section 05 tells us: with L=1.5 and ε=0.01, a 5-step rollout has ~15% state error. That's borderline acceptable for training a policy. A 20-step rollout? 6000% error — pure noise. MBPO explicitly bounds rollout length.

MBPO Rollout Length Selection h* = argmax_h { model data benefit − model error cost }

Typically h* ∈ {1, 2, 3, 4, 5} depending on model quality

Monotonic Improvement Guarantee

MBPO comes with a theoretical guarantee: under certain conditions, the policy improves monotonically (never gets worse). The key bound:

MBPO Lower Bound on Policy Improvement η[π'] ≥ η̂_model[π'] − C · max_s 𝔼_a~π'[ D_TV(p(s'|s,a) || f_θ(s'|s,a)) ]

η = true return, η̂ = model return, C = constant depending on H and γ

In English: the true policy performance is at least as good as the model-predicted performance minus a term proportional to the maximum model error. Keep model error small → model-predicted improvement transfers to reality.

MBPO = Best of Both Worlds

Model-based sample efficiency (learns model from few real transitions). Model-free asymptotic performance (SAC eventually converges to optimal given enough data). The model bootstraps early learning; SAC takes over as real data accumulates.

Comparison: MPC vs. MBPO

Property	MPC (PETS)	MBPO
Has a policy?	No (replans each step)	Yes (SAC)
Compute at deploy	High (CEM per step)	Low (one forward pass)
Rollout length	10-30 steps	1-5 steps
Handles model error	Replanning corrects	Short rollouts limit it
Sample efficiency	Best (10K steps)	Great (30K steps)
Asymptotic perf.	Limited by model	Matches model-free

Checkpoint — Before you move on

MBPO uses 1-5 step rollouts while MPC uses 10-30 steps. Both use the same model. Why can MPC afford longer rollouts without catastrophic failure, while MBPO cannot?

✓ Gate cleared

Model Answer

MPC replans every step. If a 30-step plan has garbage predictions at step 20, it doesn't matter — you only execute step 1, observe reality, and start fresh. The real observation "resets" the error accumulation every single step.

MBPO bakes errors into training data. If a 20-step rollout drifts far from reality, those fake transitions become permanent training data for SAC. The policy learns from this corrupted data, potentially converging to a bad solution. There's no "reality check" mid-rollout. Short rollouts (1-5) keep each synthetic transition close enough to truth that SAC can learn correctly from it.

Chapter 09

When Models Fail

Three distinct failure modes plague MBRL. Understanding them is essential for debugging real systems.

Failure Mode 1: Objective Mismatch

Definition

Objective Mismatch

The model is trained to minimize prediction error uniformly across the state space (MSE loss). But the policy only visits a tiny fraction of states. The model might be excellent at predicting states the policy never visits, while being poor at the specific states the policy cares about.

Think of it this way: you train a weather model to predict temperature across the entire Earth equally well. But your user only cares about whether it'll rain in San Francisco tomorrow. Global MSE might be low, but SF predictions could be terrible.

Concrete Example

Robot arm trained to reach objects. Model learned from random exploration data — mostly middle-of-workspace positions. The optimal policy reaches to far corners (where objects actually are). Model accuracy in corners is terrible because corners were rarely visited during random data collection. Policy exploits corner predictions, gets weird results in reality.

Failure Mode 2: Model Exploitation

Definition

Model Exploitation

The policy optimization process (SAC, CEM) is an adversarial optimizer against the model. It will find and exploit any systematic errors. If the model overpredicts reward for some (s,a) pair, the policy will steer toward that pair — even though real reward is low.

Model Exploitation π* = argmax_π J_model(π) ≠ argmax_π J_real(π)

The model-optimal policy is NOT the real-optimal policy when the model has systematic errors

This is the same phenomenon as overfitting in supervised learning, but more dangerous: the policy actively searches for model weaknesses.

Failure Mode 3: Compounding Errors in Multi-Step

We derived this in Section 05. But here's the insidious part: it's not just that predictions get noisy. The errors are correlated. If the model consistently underpredicts friction, it'll predict the robot slides further than reality at every step. After 10 steps, the model thinks the robot is in a completely different place — systematically wrong, not randomly wrong.

The Triad of Death

These three failure modes compound each other: (1) Model is trained on wrong distribution (mismatch). (2) Policy exploits model's weaknesses in that distribution (exploitation). (3) Multi-step rollouts amplify the exploitation exponentially (compounding). The policy learns to "hack" the model, finding imaginary reward that doesn't exist in reality.

Defenses

Defense	Addresses	Mechanism
Short rollouts	Compounding	Limits error accumulation to 1-5 steps
Ensemble disagreement	Exploitation	Penalize regions where models disagree
Online model updates	Mismatch	Retrain model as policy visits new states
Dagger-style collection	Mismatch	Collect data using current policy, not random
Reward penalty	Exploitation	r_eff = r - λσ (pessimistic)

Chapter 10

Case Study: Real Robot Learning

Let's put it all together. Real robot manipulation with MBRL: pushing objects to target locations using a 7-DOF arm.

The Setup

• State (17D): 7 joint positions + 7 joint velocities + 3D object position.

• Action (7D): Desired joint velocity for each motor.

• Reward: -||object_pos - target_pos|| (negative distance to target).

• Horizon: 100 steps per episode (5 seconds at 20Hz control).

The Recipe

Real Robot MBRL Pipeline

Initial data (2 min): Random actions for 25 episodes (2500 transitions). Motor commands in [-1,1], clipped.
Train ensemble: 5 models, each 4-layer MLP (500 hidden), predicting Δs. Train until validation loss plateaus (~2 min on GPU).
MPC with CEM: H=25 steps, N=500 samples, M=50 elites, I=5 iterations. ~150ms per planning step (fast enough for 20Hz control with parallelization).
Execute + collect: Run MPC policy for 5 episodes. Add to buffer.
Retrain model: Full retrain on expanded buffer.
Repeat 4-5: 4 iterations total. ~20 minutes wall clock time.

Results Comparison

Method	Real Robot Time	Success Rate	Real Transitions
MBRL (ensemble + MPC)	20 min	85%	5,000
SAC (model-free)	4 hours	90%	50,000
PPO (model-free)	20+ hours	80%	500,000+
Random shooting (no CEM)	20 min	60%	5,000

10x less data than SAC. 100x less than PPO. Slightly lower success rate, but achievable in a single lab session rather than days of continuous robot operation.

Why 20 Minutes Works

The robot's dynamics are relatively smooth (no contacts, no sudden discontinuities for pushing). The 17D state is low-dimensional enough that 5,000 transitions cover the relevant region well. The ensemble catches regions of uncertainty (near table edges, unusual object orientations) and avoids them.

When This Recipe Fails

Contact-rich tasks (grasping, insertion) have discontinuous dynamics that neural networks struggle with. High-dimensional visual observations (images) need different architectures (latent space models). Multi-object manipulation has combinatorial complexity that 5,000 transitions can't cover.

Chapter 11

Summary & Cheat Sheet

Version Comparison

Version	Algorithm	Planning	Policy?	Key Idea
0.5	Random Shooting	Sample & pick best	No	Brute force in model
1.0	Dyna	Model rollouts → buffer	Yes (Q)	Data augmentation
1.5	MPC + CEM	Iterative optimization	No	Short horizon + replan
1.5+	PETS	CEM + particles	No	Ensemble uncertainty
2.0	MBPO	Short rollouts → SAC	Yes (SAC)	Model-free with model data

Decision Tree: When to Use Model-Based RL

Use Model-Based When...

• Data is expensive (real robots, expensive simulations, clinical trials).

• Dynamics are relatively smooth (continuous state, no hard contacts).

• State dimension is moderate (<100D from sensors; for images, use latent models).

• You need fast adaptation (new task, change in dynamics).

Stick with Model-Free When...

• Simulation is free (video games, MuJoCo if you don't care about wall time).

• Dynamics are chaotic (high Lipschitz constant, contacts everywhere).

• You need maximum asymptotic performance and can afford the data.

• Observation space is very high-dimensional (raw pixels without compression).

Key Equations Reference

Dynamics Learning L(θ) = (1/N) Σ || f_θ(s, a) − s' ||²

Error Propagation error_H ≤ ε · (L^H − 1) / (L − 1)

Ensemble Uncertainty σ²(s,a) = (1/K) Σ_k || f_k(s,a) − μ(s,a) ||²

Pessimistic Reward r_eff = r − λσ

Chapter 12

Connections & What's Next

Where MBRL Lives in the RL Landscape

This Lesson	Connection	Related Topic
Dynamics model	Same idea: predict the future from present	World Models
Ensemble uncertainty	Bayesian approach to model uncertainty	Bayesian Neural Networks
MPC replanning	Kalman filter = model-based state estimation	Kalman Filter
Model for data augmentation	Offline RL uses model to fill gaps	Offline RL (MOPO, MOReL)
Policy gradients on model data	MBPO uses SAC = policy gradient + Q	Policy Gradients
Error compounding	Same issue in imitation learning (DAgger)	Behavioral Cloning

The World Models Frontier

MBRL's natural evolution: instead of predicting raw state s_t+1, learn a latent dynamics model that predicts in a compressed representation. This handles high-dimensional observations (images) and enables longer-horizon planning by operating in a smoother space. Key systems: Dreamer (Hafner et al.), IRIS, and foundation world models.

Open Problems

• Contact-rich manipulation: Discontinuous dynamics break smooth model assumptions.

• Multi-task transfer: Can one world model serve many tasks?

• Online adaptation: How fast can models adapt when dynamics change (e.g., robot picks up heavy object)?

• Scaling to vision: Latent models work, but sample efficiency gains diminish at scale.

The Big Picture

Model-based RL is about building a mental model of the world and using it to think ahead. Every intelligent system — humans, animals, chess engines — does some form of this. The question is never "should we model?" but "how wrong can our model be before planning in it hurts more than it helps?" The entire MBRL literature is an answer to that question.

"The map is not the territory — but a good map saves you from walking off cliffs." — Adapted from Korzybski

The Complete Guide to Model-Based RL

What You'll Master

Why Model-Based?

The Chess Grandmaster Analogy

Sample Efficiency: The Numbers

The Dynamics Model

Training Objective

Architecture

Delta vs. Absolute Prediction

Data Collection

Version 0.5: Random Shooting MPC

Hand Calculation Example

Version 1.0: Dyna

Data Amplification Factor

Model Error Propagation

The Compounding Error Theorem

What the Numbers Mean

Version 1.5: Short Horizon + Replan

Cross-Entropy Method (CEM)

Hand Calculation: CEM

The Full MPC Loop

Ensemble Models & Uncertainty

PETS: Probabilistic Ensemble Trajectory Sampling

Pessimism Principle

Version 2.0: Model-Based Policy Optimization

Why Short Rollouts (h = 1 to 5)?

Monotonic Improvement Guarantee

Comparison: MPC vs. MBPO

When Models Fail

Failure Mode 1: Objective Mismatch

Failure Mode 2: Model Exploitation

Failure Mode 3: Compounding Errors in Multi-Step

Defenses

Case Study: Real Robot Learning

The Setup

The Recipe

Results Comparison

Summary & Cheat Sheet

Version Comparison

Decision Tree: When to Use Model-Based RL

Key Equations Reference

Connections & What's Next

Where MBRL Lives in the RL Landscape

The World Models Frontier

Open Problems