MBPO — Veanors

Chapter 0: The Problem

Reinforcement learning has two families. Model-free methods (SAC, PPO) learn by trial and error in the real environment. They converge to great policies, but they need millions of environment steps to get there. Model-based methods learn a dynamics model s_t+1 = f(s_t, a_t) and plan through it. They learn much faster because they can generate unlimited synthetic experience, but they have a fatal flaw.

The flaw is compounding model error. A learned model is never perfect. When you roll it forward for one step, the prediction is slightly off. Two steps? The error grows. Ten steps? The model hallucinates physics that doesn't exist. A hundred steps? The agent has learned to exploit glitches in the model rather than solve the real task.

By 2019, this compounding error problem had forced model-based methods into a corner. They either worked only on short-horizon tasks (truncated to 200 steps instead of the standard 1000), or they collapsed entirely on complex environments like Ant. Meanwhile, model-free SAC was crushing these benchmarks, just very slowly.

The dilemma: Model-based RL gives you unlimited cheap data but that data is biased. Model-free RL gives you unbiased data but it's expensive. How do you get the best of both? MBPO's answer: use the model, but only for very short rollouts, and always start those rollouts from real states.

Compounding Model Error

Watch how model prediction error grows with rollout length. Each step multiplies the previous error. Drag the slider to change per-step error rate.

Per-step error5%

Why do model-based RL methods struggle with long-horizon tasks?

Small per-step model errors compound over long rollouts, causing the model to hallucinate unrealistic dynamics that the policy exploits The model takes too long to train Long horizons require more parameters

Chapter 1: The Key Insight

Previous model-based methods rolled out the model from the initial state distribution for the full task horizon. If your task is 1000 steps, you'd generate 1000-step model trajectories. This entangles the model horizon with the task horizon, and since model error compounds exponentially, it breaks on long tasks.

MBPO's insight is radical in its simplicity: decouple the model horizon from the task horizon.

Instead of rolling out the model from scratch for 1000 steps, you:

Collect real environment transitions into a replay buffer
Sample a real state from that buffer
Branch a short model rollout (1 to k steps) starting from that real state
Add the model-generated transitions to a separate model buffer
Train your policy (SAC) on data from both buffers

That's it. Short rollouts from real states. The model never has to predict more than a few steps into the future, so compounding error stays small. But you generate many such short rollouts, so you still get the data amplification that makes model-based methods fast.

Why starting from real states matters: If you start from an initial state and roll out 1000 steps with the model, by step 500 you're in a part of state space that has nothing to do with reality. But if you start from a real state and roll out 1 step, your prediction is grounded. The model only needs to be locally accurate, not globally accurate. This is the Dyna idea (Sutton, 1990) rediscovered and justified with modern theory.

What is MBPO's key insight for avoiding compounding model error?

Use short model rollouts (1 to k steps) branched from real states in the replay buffer, instead of long rollouts from the initial state distribution Use a more accurate model Use model-free methods instead

Chapter 2: Model-Based RL Background

We operate in a Markov Decision Process (S, A, p, r, γ, ρ₀). The dynamics p(s'|s, a) are unknown. The goal is to find π* = argmax_π η[π], where:

η[π] = E_π[∑_t=0^∞ γ^t r(s_t, a_t)]

Model-based RL learns a model p_θ(s'|s, a) from data, then uses it to improve the policy.

The Dyna architecture

The intellectual ancestor of MBPO is Sutton's Dyna (1990). The Dyna loop is:

Act

Take action in real environment, observe (s, a, r, s')

↓

Learn model

Update p_θ(s'|s,a) with new transition

↓

Dream

Sample a state from replay, generate synthetic transitions with model

↓

Plan

Update policy/Q-function using both real and synthetic transitions

Dyna uses 1-step model rollouts. MBPO generalizes this to k-step rollouts and provides theoretical justification for choosing k.

The model learning problem

We train p_θ via maximum likelihood on data D collected from the real environment:

θ ← argmax_θ E_D[log p_θ(s', r | s, a)]

The model is a neural network that takes (s, a) and outputs a distribution over (s', r). In MBPO, this is a Gaussian with learned mean and diagonal covariance: p_θ(s', r | s, a) = N(μ_θ(s, a), Σ_θ(s, a)).

Key distinction from model-free: Model-free methods (SAC, PPO) learn a policy directly from environment interactions. They need many real samples but never build an explicit dynamics model. Model-based methods build a model and can generate unlimited synthetic data, but that data is only as good as the model. MBPO bridges the two: it uses a model-based data augmentation procedure on top of a model-free policy optimizer (SAC).

What is the core idea of the Dyna architecture?

Interleave real environment interaction with synthetic model-generated transitions to update the policy, getting the best of both data sources Only use the model for planning Only use real data

Chapter 3: Monotonic Improvement

Can we guarantee that using the model actually helps? The paper derives a bound of the form:

η[π] ≥ η̂[π] − C(ε_m, ε_π)

Where η[π] is the true return, η̂[π] is the model return, and C is a penalty that depends on two error sources:

ε_m — model generalization error (how wrong the model is on the data distribution)
ε_π — policy distribution shift (how far the current policy has moved from the data-collecting policy)

The full-rollout bound (Theorem 4.1)

η[π] ≥ η̂[π] − [2γr_max(ε_m + 2ε_π) / (1−γ)² + 4r_maxε_π / (1−γ)]

This says: as long as you improve by more than C under the model, you're guaranteed to improve in the real environment too. But notice the (1−γ)² in the denominator. With γ = 0.99, that's 1/10000. The penalty C is enormous. This bound is too pessimistic to be useful.

The branched rollout bound (Theorem 4.2)

Now instead of full rollouts, branch k-step rollouts from real states:

η[π] ≥ η^branch[π] − 2r_max[γ^k+1ε_π/(1−γ)² + γ^kε_π/(1−γ) + k(ε_m + 2ε_π)/(1−γ)]

The critical change: the penalty now scales linearly with k (the model rollout length) instead of quadratically with the full horizon 1/(1−γ). Short rollouts = small penalty.

The theoretical tension: When you plug in pessimistic worst-case values for ε_m, the bound says k* = 0 — don't use the model at all! But this is overly conservative. In practice, models generalize much better than worst-case bounds suggest. The paper's key contribution is showing that empirically-measured model error justifies k > 0.

Monotonic Improvement Bound

The bound penalty C grows with rollout length k. Short rollouts keep C small, guaranteeing improvement. Drag the model error slider to see how accuracy affects the optimal k.

Model error ε_m0.05

Discount γ0.99

Why does the branched rollout bound improve over the full-rollout bound?

The penalty scales linearly with rollout length k instead of quadratically with the full horizon, so short rollouts from real states have a much smaller penalty It uses a better model It collects more real data

Chapter 4: The MBPO Algorithm

MBPO is built from three components: (1) an ensemble of probabilistic dynamics models, (2) SAC as the policy optimizer, and (3) the branched rollout strategy. Here is the full algorithm:

Initialize

Policy π_φ, model ensemble p_θ, empty D_env, empty D_model

↓

1. Train model

Fit p_θ to D_env via maximum likelihood

↓

2. Act in env

Take action with π_φ, add (s, a, r, s') to D_env

↓

3. Branch rollouts

Sample M states from D_env, run k-step rollouts with model → add to D_model

↓

4. Update policy

G gradient steps of SAC on data from D_model (and D_env)

↓

Repeat

Steps 2-4 for E environment steps, then retrain model (step 1)

The crucial detail: even when k is small (even k=1), you perform M model rollouts at each environment step. This amplifies each real sample by a factor of M, enabling 20-40 gradient steps per environment step instead of the 1-4 that's typical in model-free methods. That's where the sample efficiency comes from.

Two replay buffers: D_env stores real transitions (kept forever). D_model stores model-generated transitions (flushed periodically as the model improves). SAC trains on data from D_model. This separation lets you control exactly how much model data versus real data feeds into policy learning.

MBPO Data Flow

Watch the MBPO loop: real environment steps (blue) generate states, short model rollouts (teal) branch from those states to fill the model buffer, and SAC (orange) trains on the augmented data. Click Play to animate.

Hyperparameters

MBPO has a few key hyperparameters beyond those of SAC:

k — rollout length (1 to 25, often scheduled to increase during training)
M — number of model rollouts per environment step (400)
G — gradient updates per environment step (20-40, enabled by model data)
Model retrain frequency — every 250 environment steps
Ensemble size — 7 models, pick best 5 (via validation loss)

How does MBPO achieve high sample efficiency despite using only short model rollouts?

It performs many (M=400) short rollouts per environment step, amplifying each real sample and enabling 20-40 gradient updates instead of the typical 1-4 It uses a larger neural network It trains for more epochs

Chapter 5: Rollout Length Analysis

How do you choose k, the rollout length? This is the central design decision in MBPO. Two forces are in tension:

Longer rollouts generate more diverse data and cover more of the state space → better policy learning
Shorter rollouts have less compounding error → more accurate data

The surprising finding: k=1 is hard to beat

The paper's ablation study reveals something surprising: on the Hopper task, fixing k=1 for the entire training run captures most of MBPO's benefit. A single model step from a real state is accurate enough to be useful, and you can compensate for the short horizon by doing many such 1-step rollouts.

Scheduled rollout lengths

In practice, MBPO schedules k to increase during training. Early on, the model is trained on little data and is inaccurate, so k should be small. As training progresses and the model sees more data, it becomes more accurate and can support longer rollouts.

For Hopper: k increases linearly from 1 to 15 over training. For Ant: k=1 throughout (the dynamics are harder to model). For HalfCheetah: k=1 (already works well).

What happens with long rollouts?

The paper shows that 200-step model rollouts still produce reasonable-looking trajectories, but they perform worse for policy optimization than short rollouts. Why? Because even if individual trajectories look plausible, the distribution of states they visit is subtly wrong. The policy can learn to exploit these distributional inaccuracies. At 500 steps, rollouts are too inaccurate for effective learning.

Entangling model horizon and task horizon: Prior model-based methods (PETS, SLBO, MB-MPO) rolled out the model from the initial state for the full task horizon. This meant they could only work on short tasks (200-step versions of benchmarks). MBPO decouples these: it uses the model for k=1 to 25 steps regardless of whether the task is 200 or 1000 steps long. This is why MBPO scales to the standard 1000-step MuJoCo benchmarks where others fail.

Why does MBPO schedule the rollout length k to increase during training?

Early in training the model has little data and is inaccurate, so short rollouts are safer. As the model improves with more data, it can support longer rollouts. Longer rollouts are always better It's purely for computational efficiency

Chapter 6: Model Ensemble

MBPO doesn't use a single dynamics model. It uses a bootstrap ensemble of B probabilistic neural networks. Each network captures two types of uncertainty:

Aleatoric uncertainty (noise in the data)

Each model outputs a Gaussian distribution: pⁱ_θ(s', r | s, a) = N(μⁱ_θ(s,a), Σⁱ_θ(s,a)). The learned variance Σ captures inherent stochasticity in the dynamics. Even a perfect model would have this uncertainty.

Epistemic uncertainty (uncertainty about the model itself)

The bootstrap ensemble captures model uncertainty. Each network is trained on a different bootstrap sample of the data. In regions with lots of data, all networks agree (low epistemic uncertainty). In regions with little data, they disagree (high epistemic uncertainty).

Making predictions

To generate a single model transition, MBPO samples a model uniformly at random from the ensemble. Crucially, different steps of a single rollout can use different models. This prevents the policy from exploiting the specific biases of any single model.

Practical details

MBPO trains 7 models but uses only the best 5 (by validation loss). Each model is a 4-layer MLP with 200 hidden units. The ensemble adds modest compute: training 7 small MLPs is much cheaper than the policy optimization steps.

Why ensembles prevent exploitation: If you use a single model, the policy can find states where that specific model is wrong and exploit them. With an ensemble, exploiting one model doesn't help because the next step might use a different model. The policy is forced to find strategies that work under all models — which means strategies that work in reality.

Ensemble Predictions

5 models (colored lines) predict the next state from the same current state. Where data is abundant they agree. Where data is scarce, they diverge. Click to resample the state.

Why does MBPO use an ensemble of models instead of a single model?

The ensemble captures epistemic uncertainty and prevents the policy from exploiting biases of any single model, since different rollout steps use different models To train faster Because a single model can't represent the dynamics

Chapter 7: Results

MBPO is evaluated on the standard 1000-step MuJoCo continuous control benchmarks: Hopper, Walker2d, HalfCheetah, Ant, and Humanoid. The baselines include:

SAC (model-free, state-of-the-art asymptotic performance)
PPO (model-free, popular on-policy method)
PETS (model-based, trajectory optimization with ensembles)
SLBO (model-based, monotonic improvement from initial states)
STEVE (model-based, stochastic ensemble value expansion)

The headline results

MBPO achieves the sample efficiency of model-based methods with the asymptotic performance of model-free methods. On the Ant task, MBPO at 300K steps matches SAC at 3M steps — a 10x improvement in sample efficiency. On Hopper and Walker2d, it reaches near-optimal performance in the equivalent of 14 and 40 minutes of real-time simulation.

MBPO vs Baselines: Sample Efficiency

Steps to reach a target return on the Ant task. MBPO reaches SAC's 3M-step performance in just 300K steps.

Key findings

No model: Running SAC with MBPO's high gradient-to-data ratio (G=20-40) but without model data marginally speeds up learning, but cannot match MBPO's sample efficiency. The model data genuinely helps.

k=1 baseline: Even single-step model rollouts provide a surprisingly strong baseline, outperforming prior model-based methods that use long rollouts from initial states.

No model exploitation: On Hopper, the policy's returns under the model closely match its returns in the real environment. The short rollouts prevent the policy from exploiting model inaccuracies. In fact, model returns tend to underestimate real returns.

The scaling result: PETS and SLBO fail entirely on Ant and Humanoid because they roll out the model for the full task horizon. MBPO works on all benchmarks because it only needs the model to be accurate for a few steps. This is the practical payoff of decoupling model horizon from task horizon.

On the Ant task, how does MBPO's sample efficiency compare to SAC?

MBPO at 300K steps matches SAC at 3M steps — a 10x improvement in sample efficiency while matching asymptotic performance They perform about the same MBPO is twice as fast

Chapter 8: When to Trust the Model

The paper's title asks "When to Trust Your Model?" — and the answer is nuanced. The theoretical bounds say: never, unless you can empirically estimate model generalization.

Pessimistic bounds vs empirical reality

The worst-case bound (Theorem 4.2) says the optimal rollout length k* = 0. Don't use the model. This is because the bound assumes the worst case for model generalization: that model error on new policy distributions equals model error on the training distribution plus a maximum distribution shift penalty.

But empirically, models generalize much better than worst case. The paper measures how model error grows as the policy drifts from the data-collecting policy (Figure 1 in the paper). Two key findings:

Models trained with more data have lower error on their training distribution (expected)
Models trained with more data also generalize better — the rate of error growth with policy shift decreases (surprising)

The practical model error estimate

Instead of using worst-case ε_m, MBPO uses an empirical estimate ε'_m that accounts for generalization:

ε'_m(ε_π) ≈ ε_m + ε_π · dε'_m/dε_π

Where dε'_m/dε_π is the empirically measured rate of model error growth with policy shift. Plugging this into Theorem 4.3 gives k* > 0 for sufficiently accurate models — the model is worth using.

The title, answered: Trust your model when (1) you have enough data that it generalizes well, and (2) you only ask it to predict a few steps ahead. Don't trust it for long-horizon predictions. Don't trust it early in training when data is scarce. MBPO operationalizes this by scheduling k to increase as the model improves.

What empirical finding allows MBPO to justify using the model despite pessimistic theoretical bounds?

Models trained with more data not only have lower training error but also generalize better — their error grows more slowly as the policy shifts, making the effective model error much smaller than worst-case bounds suggest The model is always accurate The theoretical bounds are wrong

Chapter 9: Connections

What MBPO built on

Dyna (Sutton, 1990): The original model-based + model-free hybrid. MBPO is a deep RL instantiation of Dyna's core idea — 1-step model rollouts from replay buffer states — with theoretical justification and modern components (ensembles, SAC).

PETS (Chua et al., 2018): Introduced the probabilistic ensemble for model-based RL. MBPO borrows the ensemble architecture but uses it for data generation rather than trajectory optimization.

SAC (Haarnoja et al., 2018): The policy optimizer inside MBPO. SAC's off-policy nature is essential — it can learn from the mix of real and model-generated data in the replay buffers.

STEVE (Buckman et al., 2018): Used short model rollouts for value expansion rather than data augmentation. MBPO shows data augmentation outperforms value expansion.

What MBPO inspired

Dreamer (Hafner et al., 2020): Learns a world model in latent space and imagines entire trajectories. While Dreamer uses longer rollouts in latent space, MBPO's analysis of compounding error informed the design.

TD-MPC / TD-MPC2 (Hansen et al., 2022, 2024): Combines model learning with temporal difference learning and model-predictive control. Uses learned latent dynamics for short-horizon planning.

DreamerV3 (Hafner et al., 2023): Scales world-model RL to diverse domains. The idea of managing model error through careful rollout strategies traces back to MBPO's analysis.

MBPO's legacy: MBPO demonstrated that you don't need a perfect model or clever planning — just a decent model, short rollouts, and a good model-free algorithm. This "model as data augmenter" philosophy has become the dominant paradigm in model-based deep RL. The paper's theoretical framework for analyzing model usage has been cited by virtually every subsequent model-based RL paper.

Cheat sheet

Core idea

Short model rollouts (k=1 to 25) branched from real replay buffer states

Components

Ensemble of 7 probabilistic NNs + SAC + two replay buffers (real + model)

Key numbers

k=1-25, M=400 rollouts/step, G=20-40 gradient updates/step, 10x sample efficiency

Key result

Matches SAC asymptotic performance at 10x fewer samples, scales to 1000-step tasks

Legacy

Proved "model as data augmenter" works; inspired Dreamer, TD-MPC, modern MBRL

What is MBPO's relationship to the Dyna architecture?

MBPO is a deep RL instantiation of Dyna's core idea — using model-generated transitions from replay buffer states to augment policy learning — with theoretical justification, ensembles, and SAC They are unrelated Dyna is a special case of MBPO