Model-Based Policy Optimization — short model rollouts branched from real data achieve the sample efficiency of model-based methods and the asymptotic performance of model-free methods.
Reinforcement learning has two families. Model-free methods (SAC, PPO) learn by trial and error in the real environment. They converge to great policies, but they need millions of environment steps to get there. Model-based methods learn a dynamics model st+1 = f(st, at) and plan through it. They learn much faster because they can generate unlimited synthetic experience, but they have a fatal flaw.
The flaw is compounding model error. A learned model is never perfect. When you roll it forward for one step, the prediction is slightly off. Two steps? The error grows. Ten steps? The model hallucinates physics that doesn't exist. A hundred steps? The agent has learned to exploit glitches in the model rather than solve the real task.
By 2019, this compounding error problem had forced model-based methods into a corner. They either worked only on short-horizon tasks (truncated to 200 steps instead of the standard 1000), or they collapsed entirely on complex environments like Ant. Meanwhile, model-free SAC was crushing these benchmarks, just very slowly.
Watch how model prediction error grows with rollout length. Each step multiplies the previous error. Drag the slider to change per-step error rate.
Previous model-based methods rolled out the model from the initial state distribution for the full task horizon. If your task is 1000 steps, you'd generate 1000-step model trajectories. This entangles the model horizon with the task horizon, and since model error compounds exponentially, it breaks on long tasks.
MBPO's insight is radical in its simplicity: decouple the model horizon from the task horizon.
Instead of rolling out the model from scratch for 1000 steps, you:
That's it. Short rollouts from real states. The model never has to predict more than a few steps into the future, so compounding error stays small. But you generate many such short rollouts, so you still get the data amplification that makes model-based methods fast.
We operate in a Markov Decision Process (S, A, p, r, γ, ρ0). The dynamics p(s'|s, a) are unknown. The goal is to find π* = argmaxπ η[π], where:
Model-based RL learns a model pθ(s'|s, a) from data, then uses it to improve the policy.
The intellectual ancestor of MBPO is Sutton's Dyna (1990). The Dyna loop is:
Dyna uses 1-step model rollouts. MBPO generalizes this to k-step rollouts and provides theoretical justification for choosing k.
We train pθ via maximum likelihood on data D collected from the real environment:
The model is a neural network that takes (s, a) and outputs a distribution over (s', r). In MBPO, this is a Gaussian with learned mean and diagonal covariance: pθ(s', r | s, a) = N(μθ(s, a), Σθ(s, a)).
Can we guarantee that using the model actually helps? The paper derives a bound of the form:
Where η[π] is the true return, η̂[π] is the model return, and C is a penalty that depends on two error sources:
This says: as long as you improve by more than C under the model, you're guaranteed to improve in the real environment too. But notice the (1−γ)2 in the denominator. With γ = 0.99, that's 1/10000. The penalty C is enormous. This bound is too pessimistic to be useful.
Now instead of full rollouts, branch k-step rollouts from real states:
The critical change: the penalty now scales linearly with k (the model rollout length) instead of quadratically with the full horizon 1/(1−γ). Short rollouts = small penalty.
The bound penalty C grows with rollout length k. Short rollouts keep C small, guaranteeing improvement. Drag the model error slider to see how accuracy affects the optimal k.
MBPO is built from three components: (1) an ensemble of probabilistic dynamics models, (2) SAC as the policy optimizer, and (3) the branched rollout strategy. Here is the full algorithm:
The crucial detail: even when k is small (even k=1), you perform M model rollouts at each environment step. This amplifies each real sample by a factor of M, enabling 20-40 gradient steps per environment step instead of the 1-4 that's typical in model-free methods. That's where the sample efficiency comes from.
Watch the MBPO loop: real environment steps (blue) generate states, short model rollouts (teal) branch from those states to fill the model buffer, and SAC (orange) trains on the augmented data. Click Play to animate.
MBPO has a few key hyperparameters beyond those of SAC:
How do you choose k, the rollout length? This is the central design decision in MBPO. Two forces are in tension:
The paper's ablation study reveals something surprising: on the Hopper task, fixing k=1 for the entire training run captures most of MBPO's benefit. A single model step from a real state is accurate enough to be useful, and you can compensate for the short horizon by doing many such 1-step rollouts.
In practice, MBPO schedules k to increase during training. Early on, the model is trained on little data and is inaccurate, so k should be small. As training progresses and the model sees more data, it becomes more accurate and can support longer rollouts.
For Hopper: k increases linearly from 1 to 15 over training. For Ant: k=1 throughout (the dynamics are harder to model). For HalfCheetah: k=1 (already works well).
The paper shows that 200-step model rollouts still produce reasonable-looking trajectories, but they perform worse for policy optimization than short rollouts. Why? Because even if individual trajectories look plausible, the distribution of states they visit is subtly wrong. The policy can learn to exploit these distributional inaccuracies. At 500 steps, rollouts are too inaccurate for effective learning.
MBPO doesn't use a single dynamics model. It uses a bootstrap ensemble of B probabilistic neural networks. Each network captures two types of uncertainty:
Each model outputs a Gaussian distribution: piθ(s', r | s, a) = N(μiθ(s,a), Σiθ(s,a)). The learned variance Σ captures inherent stochasticity in the dynamics. Even a perfect model would have this uncertainty.
The bootstrap ensemble captures model uncertainty. Each network is trained on a different bootstrap sample of the data. In regions with lots of data, all networks agree (low epistemic uncertainty). In regions with little data, they disagree (high epistemic uncertainty).
To generate a single model transition, MBPO samples a model uniformly at random from the ensemble. Crucially, different steps of a single rollout can use different models. This prevents the policy from exploiting the specific biases of any single model.
MBPO trains 7 models but uses only the best 5 (by validation loss). Each model is a 4-layer MLP with 200 hidden units. The ensemble adds modest compute: training 7 small MLPs is much cheaper than the policy optimization steps.
5 models (colored lines) predict the next state from the same current state. Where data is abundant they agree. Where data is scarce, they diverge. Click to resample the state.
MBPO is evaluated on the standard 1000-step MuJoCo continuous control benchmarks: Hopper, Walker2d, HalfCheetah, Ant, and Humanoid. The baselines include:
MBPO achieves the sample efficiency of model-based methods with the asymptotic performance of model-free methods. On the Ant task, MBPO at 300K steps matches SAC at 3M steps — a 10x improvement in sample efficiency. On Hopper and Walker2d, it reaches near-optimal performance in the equivalent of 14 and 40 minutes of real-time simulation.
Steps to reach a target return on the Ant task. MBPO reaches SAC's 3M-step performance in just 300K steps.
No model: Running SAC with MBPO's high gradient-to-data ratio (G=20-40) but without model data marginally speeds up learning, but cannot match MBPO's sample efficiency. The model data genuinely helps.
k=1 baseline: Even single-step model rollouts provide a surprisingly strong baseline, outperforming prior model-based methods that use long rollouts from initial states.
No model exploitation: On Hopper, the policy's returns under the model closely match its returns in the real environment. The short rollouts prevent the policy from exploiting model inaccuracies. In fact, model returns tend to underestimate real returns.
The paper's title asks "When to Trust Your Model?" — and the answer is nuanced. The theoretical bounds say: never, unless you can empirically estimate model generalization.
The worst-case bound (Theorem 4.2) says the optimal rollout length k* = 0. Don't use the model. This is because the bound assumes the worst case for model generalization: that model error on new policy distributions equals model error on the training distribution plus a maximum distribution shift penalty.
But empirically, models generalize much better than worst case. The paper measures how model error grows as the policy drifts from the data-collecting policy (Figure 1 in the paper). Two key findings:
Instead of using worst-case εm, MBPO uses an empirical estimate ε'm that accounts for generalization:
Where dε'm/dεπ is the empirically measured rate of model error growth with policy shift. Plugging this into Theorem 4.3 gives k* > 0 for sufficiently accurate models — the model is worth using.
Dyna (Sutton, 1990): The original model-based + model-free hybrid. MBPO is a deep RL instantiation of Dyna's core idea — 1-step model rollouts from replay buffer states — with theoretical justification and modern components (ensembles, SAC).
PETS (Chua et al., 2018): Introduced the probabilistic ensemble for model-based RL. MBPO borrows the ensemble architecture but uses it for data generation rather than trajectory optimization.
SAC (Haarnoja et al., 2018): The policy optimizer inside MBPO. SAC's off-policy nature is essential — it can learn from the mix of real and model-generated data in the replay buffers.
STEVE (Buckman et al., 2018): Used short model rollouts for value expansion rather than data augmentation. MBPO shows data augmentation outperforms value expansion.
Dreamer (Hafner et al., 2020): Learns a world model in latent space and imagines entire trajectories. While Dreamer uses longer rollouts in latent space, MBPO's analysis of compounding error informed the design.
TD-MPC / TD-MPC2 (Hansen et al., 2022, 2024): Combines model learning with temporal difference learning and model-predictive control. Uses learned latent dynamics for short-horizon planning.
DreamerV3 (Hafner et al., 2023): Scales world-model RL to diverse domains. The idea of managing model error through careful rollout strategies traces back to MBPO's analysis.