Janner, Fu, Zhang, Levine — NeurIPS 2019

When to Trust Your Model

Model-Based Policy Optimization — short model rollouts branched from real data achieve the sample efficiency of model-based methods and the asymptotic performance of model-free methods.

Prerequisites: MDPs & Bellman equations + SAC or any actor-critic
10
Chapters
5+
Simulations

Chapter 0: The Problem

Reinforcement learning has two families. Model-free methods (SAC, PPO) learn by trial and error in the real environment. They converge to great policies, but they need millions of environment steps to get there. Model-based methods learn a dynamics model st+1 = f(st, at) and plan through it. They learn much faster because they can generate unlimited synthetic experience, but they have a fatal flaw.

The flaw is compounding model error. A learned model is never perfect. When you roll it forward for one step, the prediction is slightly off. Two steps? The error grows. Ten steps? The model hallucinates physics that doesn't exist. A hundred steps? The agent has learned to exploit glitches in the model rather than solve the real task.

By 2019, this compounding error problem had forced model-based methods into a corner. They either worked only on short-horizon tasks (truncated to 200 steps instead of the standard 1000), or they collapsed entirely on complex environments like Ant. Meanwhile, model-free SAC was crushing these benchmarks, just very slowly.

The dilemma: Model-based RL gives you unlimited cheap data but that data is biased. Model-free RL gives you unbiased data but it's expensive. How do you get the best of both? MBPO's answer: use the model, but only for very short rollouts, and always start those rollouts from real states.
Compounding Model Error

Watch how model prediction error grows with rollout length. Each step multiplies the previous error. Drag the slider to change per-step error rate.

Per-step error5%
Why do model-based RL methods struggle with long-horizon tasks?

Chapter 1: The Key Insight

Previous model-based methods rolled out the model from the initial state distribution for the full task horizon. If your task is 1000 steps, you'd generate 1000-step model trajectories. This entangles the model horizon with the task horizon, and since model error compounds exponentially, it breaks on long tasks.

MBPO's insight is radical in its simplicity: decouple the model horizon from the task horizon.

Instead of rolling out the model from scratch for 1000 steps, you:

  1. Collect real environment transitions into a replay buffer
  2. Sample a real state from that buffer
  3. Branch a short model rollout (1 to k steps) starting from that real state
  4. Add the model-generated transitions to a separate model buffer
  5. Train your policy (SAC) on data from both buffers

That's it. Short rollouts from real states. The model never has to predict more than a few steps into the future, so compounding error stays small. But you generate many such short rollouts, so you still get the data amplification that makes model-based methods fast.

Why starting from real states matters: If you start from an initial state and roll out 1000 steps with the model, by step 500 you're in a part of state space that has nothing to do with reality. But if you start from a real state and roll out 1 step, your prediction is grounded. The model only needs to be locally accurate, not globally accurate. This is the Dyna idea (Sutton, 1990) rediscovered and justified with modern theory.
What is MBPO's key insight for avoiding compounding model error?

Chapter 2: Model-Based RL Background

We operate in a Markov Decision Process (S, A, p, r, γ, ρ0). The dynamics p(s'|s, a) are unknown. The goal is to find π* = argmaxπ η[π], where:

η[π] = Eπ[∑t=0 γt r(st, at)]

Model-based RL learns a model pθ(s'|s, a) from data, then uses it to improve the policy.

The Dyna architecture

The intellectual ancestor of MBPO is Sutton's Dyna (1990). The Dyna loop is:

Act
Take action in real environment, observe (s, a, r, s')
Learn model
Update pθ(s'|s,a) with new transition
Dream
Sample a state from replay, generate synthetic transitions with model
Plan
Update policy/Q-function using both real and synthetic transitions

Dyna uses 1-step model rollouts. MBPO generalizes this to k-step rollouts and provides theoretical justification for choosing k.

The model learning problem

We train pθ via maximum likelihood on data D collected from the real environment:

θ ← argmaxθ ED[log pθ(s', r | s, a)]

The model is a neural network that takes (s, a) and outputs a distribution over (s', r). In MBPO, this is a Gaussian with learned mean and diagonal covariance: pθ(s', r | s, a) = N(μθ(s, a), Σθ(s, a)).

Key distinction from model-free: Model-free methods (SAC, PPO) learn a policy directly from environment interactions. They need many real samples but never build an explicit dynamics model. Model-based methods build a model and can generate unlimited synthetic data, but that data is only as good as the model. MBPO bridges the two: it uses a model-based data augmentation procedure on top of a model-free policy optimizer (SAC).
What is the core idea of the Dyna architecture?

Chapter 3: Monotonic Improvement

Can we guarantee that using the model actually helps? The paper derives a bound of the form:

η[π] ≥ η̂[π] − C(εm, επ)

Where η[π] is the true return, η̂[π] is the model return, and C is a penalty that depends on two error sources:

The full-rollout bound (Theorem 4.1)

η[π] ≥ η̂[π] − [2γrmaxm + 2επ) / (1−γ)2 + 4rmaxεπ / (1−γ)]

This says: as long as you improve by more than C under the model, you're guaranteed to improve in the real environment too. But notice the (1−γ)2 in the denominator. With γ = 0.99, that's 1/10000. The penalty C is enormous. This bound is too pessimistic to be useful.

The branched rollout bound (Theorem 4.2)

Now instead of full rollouts, branch k-step rollouts from real states:

η[π] ≥ ηbranch[π] − 2rmaxk+1επ/(1−γ)2 + γkεπ/(1−γ) + k(εm + 2επ)/(1−γ)]

The critical change: the penalty now scales linearly with k (the model rollout length) instead of quadratically with the full horizon 1/(1−γ). Short rollouts = small penalty.

The theoretical tension: When you plug in pessimistic worst-case values for εm, the bound says k* = 0 — don't use the model at all! But this is overly conservative. In practice, models generalize much better than worst-case bounds suggest. The paper's key contribution is showing that empirically-measured model error justifies k > 0.
Monotonic Improvement Bound

The bound penalty C grows with rollout length k. Short rollouts keep C small, guaranteeing improvement. Drag the model error slider to see how accuracy affects the optimal k.

Model error εm0.05
Discount γ0.99
Why does the branched rollout bound improve over the full-rollout bound?

Chapter 4: The MBPO Algorithm

MBPO is built from three components: (1) an ensemble of probabilistic dynamics models, (2) SAC as the policy optimizer, and (3) the branched rollout strategy. Here is the full algorithm:

Initialize
Policy πφ, model ensemble pθ, empty Denv, empty Dmodel
1. Train model
Fit pθ to Denv via maximum likelihood
2. Act in env
Take action with πφ, add (s, a, r, s') to Denv
3. Branch rollouts
Sample M states from Denv, run k-step rollouts with model → add to Dmodel
4. Update policy
G gradient steps of SAC on data from Dmodel (and Denv)
Repeat
Steps 2-4 for E environment steps, then retrain model (step 1)

The crucial detail: even when k is small (even k=1), you perform M model rollouts at each environment step. This amplifies each real sample by a factor of M, enabling 20-40 gradient steps per environment step instead of the 1-4 that's typical in model-free methods. That's where the sample efficiency comes from.

Two replay buffers: Denv stores real transitions (kept forever). Dmodel stores model-generated transitions (flushed periodically as the model improves). SAC trains on data from Dmodel. This separation lets you control exactly how much model data versus real data feeds into policy learning.
MBPO Data Flow

Watch the MBPO loop: real environment steps (blue) generate states, short model rollouts (teal) branch from those states to fill the model buffer, and SAC (orange) trains on the augmented data. Click Play to animate.

Hyperparameters

MBPO has a few key hyperparameters beyond those of SAC:

How does MBPO achieve high sample efficiency despite using only short model rollouts?

Chapter 5: Rollout Length Analysis

How do you choose k, the rollout length? This is the central design decision in MBPO. Two forces are in tension:

The surprising finding: k=1 is hard to beat

The paper's ablation study reveals something surprising: on the Hopper task, fixing k=1 for the entire training run captures most of MBPO's benefit. A single model step from a real state is accurate enough to be useful, and you can compensate for the short horizon by doing many such 1-step rollouts.

Scheduled rollout lengths

In practice, MBPO schedules k to increase during training. Early on, the model is trained on little data and is inaccurate, so k should be small. As training progresses and the model sees more data, it becomes more accurate and can support longer rollouts.

For Hopper: k increases linearly from 1 to 15 over training. For Ant: k=1 throughout (the dynamics are harder to model). For HalfCheetah: k=1 (already works well).

What happens with long rollouts?

The paper shows that 200-step model rollouts still produce reasonable-looking trajectories, but they perform worse for policy optimization than short rollouts. Why? Because even if individual trajectories look plausible, the distribution of states they visit is subtly wrong. The policy can learn to exploit these distributional inaccuracies. At 500 steps, rollouts are too inaccurate for effective learning.

Entangling model horizon and task horizon: Prior model-based methods (PETS, SLBO, MB-MPO) rolled out the model from the initial state for the full task horizon. This meant they could only work on short tasks (200-step versions of benchmarks). MBPO decouples these: it uses the model for k=1 to 25 steps regardless of whether the task is 200 or 1000 steps long. This is why MBPO scales to the standard 1000-step MuJoCo benchmarks where others fail.
Why does MBPO schedule the rollout length k to increase during training?

Chapter 6: Model Ensemble

MBPO doesn't use a single dynamics model. It uses a bootstrap ensemble of B probabilistic neural networks. Each network captures two types of uncertainty:

Aleatoric uncertainty (noise in the data)

Each model outputs a Gaussian distribution: piθ(s', r | s, a) = N(μiθ(s,a), Σiθ(s,a)). The learned variance Σ captures inherent stochasticity in the dynamics. Even a perfect model would have this uncertainty.

Epistemic uncertainty (uncertainty about the model itself)

The bootstrap ensemble captures model uncertainty. Each network is trained on a different bootstrap sample of the data. In regions with lots of data, all networks agree (low epistemic uncertainty). In regions with little data, they disagree (high epistemic uncertainty).

Making predictions

To generate a single model transition, MBPO samples a model uniformly at random from the ensemble. Crucially, different steps of a single rollout can use different models. This prevents the policy from exploiting the specific biases of any single model.

Practical details

MBPO trains 7 models but uses only the best 5 (by validation loss). Each model is a 4-layer MLP with 200 hidden units. The ensemble adds modest compute: training 7 small MLPs is much cheaper than the policy optimization steps.

Why ensembles prevent exploitation: If you use a single model, the policy can find states where that specific model is wrong and exploit them. With an ensemble, exploiting one model doesn't help because the next step might use a different model. The policy is forced to find strategies that work under all models — which means strategies that work in reality.
Ensemble Predictions

5 models (colored lines) predict the next state from the same current state. Where data is abundant they agree. Where data is scarce, they diverge. Click to resample the state.

Why does MBPO use an ensemble of models instead of a single model?

Chapter 7: Results

MBPO is evaluated on the standard 1000-step MuJoCo continuous control benchmarks: Hopper, Walker2d, HalfCheetah, Ant, and Humanoid. The baselines include:

The headline results

MBPO achieves the sample efficiency of model-based methods with the asymptotic performance of model-free methods. On the Ant task, MBPO at 300K steps matches SAC at 3M steps — a 10x improvement in sample efficiency. On Hopper and Walker2d, it reaches near-optimal performance in the equivalent of 14 and 40 minutes of real-time simulation.

MBPO vs Baselines: Sample Efficiency

Steps to reach a target return on the Ant task. MBPO reaches SAC's 3M-step performance in just 300K steps.

Key findings

No model: Running SAC with MBPO's high gradient-to-data ratio (G=20-40) but without model data marginally speeds up learning, but cannot match MBPO's sample efficiency. The model data genuinely helps.

k=1 baseline: Even single-step model rollouts provide a surprisingly strong baseline, outperforming prior model-based methods that use long rollouts from initial states.

No model exploitation: On Hopper, the policy's returns under the model closely match its returns in the real environment. The short rollouts prevent the policy from exploiting model inaccuracies. In fact, model returns tend to underestimate real returns.

The scaling result: PETS and SLBO fail entirely on Ant and Humanoid because they roll out the model for the full task horizon. MBPO works on all benchmarks because it only needs the model to be accurate for a few steps. This is the practical payoff of decoupling model horizon from task horizon.
On the Ant task, how does MBPO's sample efficiency compare to SAC?

Chapter 8: When to Trust the Model

The paper's title asks "When to Trust Your Model?" — and the answer is nuanced. The theoretical bounds say: never, unless you can empirically estimate model generalization.

Pessimistic bounds vs empirical reality

The worst-case bound (Theorem 4.2) says the optimal rollout length k* = 0. Don't use the model. This is because the bound assumes the worst case for model generalization: that model error on new policy distributions equals model error on the training distribution plus a maximum distribution shift penalty.

But empirically, models generalize much better than worst case. The paper measures how model error grows as the policy drifts from the data-collecting policy (Figure 1 in the paper). Two key findings:

  1. Models trained with more data have lower error on their training distribution (expected)
  2. Models trained with more data also generalize better — the rate of error growth with policy shift decreases (surprising)

The practical model error estimate

Instead of using worst-case εm, MBPO uses an empirical estimate ε'm that accounts for generalization:

ε'mπ) ≈ εm + επ · dε'm/dεπ

Where dε'm/dεπ is the empirically measured rate of model error growth with policy shift. Plugging this into Theorem 4.3 gives k* > 0 for sufficiently accurate models — the model is worth using.

The title, answered: Trust your model when (1) you have enough data that it generalizes well, and (2) you only ask it to predict a few steps ahead. Don't trust it for long-horizon predictions. Don't trust it early in training when data is scarce. MBPO operationalizes this by scheduling k to increase as the model improves.
What empirical finding allows MBPO to justify using the model despite pessimistic theoretical bounds?

Chapter 9: Connections

What MBPO built on

Dyna (Sutton, 1990): The original model-based + model-free hybrid. MBPO is a deep RL instantiation of Dyna's core idea — 1-step model rollouts from replay buffer states — with theoretical justification and modern components (ensembles, SAC).

PETS (Chua et al., 2018): Introduced the probabilistic ensemble for model-based RL. MBPO borrows the ensemble architecture but uses it for data generation rather than trajectory optimization.

SAC (Haarnoja et al., 2018): The policy optimizer inside MBPO. SAC's off-policy nature is essential — it can learn from the mix of real and model-generated data in the replay buffers.

STEVE (Buckman et al., 2018): Used short model rollouts for value expansion rather than data augmentation. MBPO shows data augmentation outperforms value expansion.

What MBPO inspired

Dreamer (Hafner et al., 2020): Learns a world model in latent space and imagines entire trajectories. While Dreamer uses longer rollouts in latent space, MBPO's analysis of compounding error informed the design.

TD-MPC / TD-MPC2 (Hansen et al., 2022, 2024): Combines model learning with temporal difference learning and model-predictive control. Uses learned latent dynamics for short-horizon planning.

DreamerV3 (Hafner et al., 2023): Scales world-model RL to diverse domains. The idea of managing model error through careful rollout strategies traces back to MBPO's analysis.

MBPO's legacy: MBPO demonstrated that you don't need a perfect model or clever planning — just a decent model, short rollouts, and a good model-free algorithm. This "model as data augmenter" philosophy has become the dominant paradigm in model-based deep RL. The paper's theoretical framework for analyzing model usage has been cited by virtually every subsequent model-based RL paper.

Cheat sheet

Core idea
Short model rollouts (k=1 to 25) branched from real replay buffer states
Components
Ensemble of 7 probabilistic NNs + SAC + two replay buffers (real + model)
Key numbers
k=1-25, M=400 rollouts/step, G=20-40 gradient updates/step, 10x sample efficiency
Key result
Matches SAC asymptotic performance at 10x fewer samples, scales to 1000-step tasks
Legacy
Proved "model as data augmenter" works; inspired Dreamer, TD-MPC, modern MBRL
What is MBPO's relationship to the Dyna architecture?