Using models of the world to think ahead — and what to do when the world changes.
Everything we've done since Chapter 5 has been model-free: the agent interacts with the environment and learns directly from experience, never trying to understand how the environment works. But what if we did build a model? What if the agent could say "if I take this action, I think the environment will do that"?
A model is anything that allows the agent to predict the environment's response. Given a state and action, a model produces (or simulates) a next state and reward. With a model, the agent can think — it can mentally simulate experience and learn from those simulations without ever touching the real environment.
There are two types of models:
Distribution model
Produces all possible next states and their probabilities: p(s',r|s,a). This is what DP uses. It gives the full picture but is expensive to store and use.
Sample model
Produces a single sampled transition: one (s',r) drawn from the distribution. Cheaper, easier to build, and sufficient for most planning methods.
Think of a weather model. A distribution model would say "40% chance of rain, 30% cloudy, 30% sunny." A sample model would just say "today it will rain" — one sample from the distribution. Sample models are easier to build (you can learn them from experience) and easier to use (you just sample and apply standard RL updates).
Planning means using a model to improve a policy. The simplest form: sample a (state, action) pair, ask the model what would happen, and use the result to update value estimates. Each simulated transition is as useful for learning as a real one — the Q-learning or Sarsa update is identical. The only difference is where the transition came from.
The Dyna architecture (Sutton, 1990) is the elegant idea at the heart of this chapter: combine direct RL and planning in a single loop. After each real step in the environment, the agent does three things: updates Q from the real transition (direct RL), updates the model (model learning), and then performs n additional Q updates using simulated transitions from the model (planning).
The model is simple: a lookup table. For every (state, action) pair the agent has actually experienced, the model records what happened — the resulting next state and reward. When planning, the agent picks a random previously-seen (S,A) pair, asks the model for (S',R), and applies a Q-learning update. This is called random-sample one-step tabular Q-planning.
The parameter n controls how much planning per real step. With n=0, Dyna reduces to ordinary Q-learning. With n=50, the agent does 50 simulated updates per real step. More planning means faster learning, at the cost of computation.
pseudocode # Tabular Dyna-Q Initialize Q(s,a), Model(s,a) for all s, a for each episode: S ← start state while S is not terminal: A ← ε-greedy from Q(S, ·) take A, observe R, S' # Direct RL Q(S,A) += α[R + γ maxa Q(S',a) − Q(S,A)] # Model learning Model(S,A) ← (R, S') # Planning: n simulated updates repeat n times: S̃ ← random previously observed state à ← random action previously taken in S̃ R̃, S̃' ← Model(S̃, Ã) Q(S̃,Ã) += α[R̃ + γ maxa Q(S̃',a) − Q(S̃,Ã)] S ← S'
This is where it clicks. Below is a maze. The agent (orange) starts at the bottom-left and must reach the goal (green) at the top-right. Walls block the path. The agent earns reward 0 on every step and +1 on reaching the goal, so it wants to find the shortest path as fast as possible.
The slider controls n — the number of planning steps per real step. With n=0 (pure Q-learning), the agent stumbles for many episodes. With n=5 or n=50, watch how much faster it learns. Planning multiplies every real step by a factor of n, letting value information flood through the Q-table much faster.
Set planning steps (n) and click Run Episode to watch the agent navigate, or Train 50 to train for 50 episodes and see the learning curve. The right panel shows the model being built.
The model panel on the right side of the canvas shows which (state, action) pairs the model has learned. Dark cells are unexplored. Colored cells have been visited. As the agent explores, the model fills in — and planning exploits that model immediately.
Models are learned from experience, so they can be wrong. More importantly, the environment can change. When it does, the model reflects the old environment, and planning from it can be actively harmful. This chapter examines two classic scenarios.
The agent learns a short path through a gap. Then the gap is blocked and a new gap opens elsewhere. Dyna-Q keeps planning from the old model and clings to the now-blocked path. It takes many real steps to discover the change and relearn.
The opposite: a wall is removed, opening a shorter path. Dyna-Q never discovers the shortcut because it never explores there — the old (longer) path still works fine. The model says "wall," so the agent never even tries.
The solution is Dyna-Q+. It adds an exploration bonus: for each (s,a) pair, it tracks how many time steps have elapsed since the pair was last tried in the real environment. The bonus is κ√τ, where τ is the time since last visit and κ is a small constant (e.g., 0.001). This extra "reward" for visiting stale state-action pairs encourages the agent to check whether its model is still correct.
Each cell represents a (state, action) pair. Color shows recency: teal = recently visited, red = stale. Dyna-Q+ adds a bonus proportional to √τ (square root of staleness), shown as brightness.
In Dyna-Q, planning steps are chosen randomly: we pick a random previously-seen (S,A) pair and update it. But not all updates are equally useful. If a state's value just changed a lot, then its predecessors (states that lead to it) should be updated next, because their values depend on the changed state.
Prioritized sweeping maintains a priority queue of (state, action) pairs, ordered by the urgency of their update. Urgency is measured by how much the value would change if we applied the update. The bigger the expected change, the higher the priority.
pseudocode # Prioritized Sweeping (simplified) Initialize Q(s,a), Model(s,a), PriorityQueue PQ for each step: take action A in state S, observe R, S' # Direct RL update δ ← |R + γ maxa Q(S',a) − Q(S,A)| if δ > θ: insert (S,A) into PQ with priority δ # Planning: update highest-priority pairs repeat n times while PQ not empty: (S̃,Ã) ← pop top of PQ R̃, S̃' ← Model(S̃,Ã) Q(S̃,Ã) += α[R̃ + γ maxa Q(S̃',a) − Q(S̃,Ã)] # Propagate to predecessors for each (S̄,Ā) predicted to lead to S̃: R̄ ← predicted reward δ ← |R̄ + γ maxa Q(S̃,a) − Q(S̄,Ā)| if δ > θ: insert (S̄,Ā) into PQ with priority δ
In the maze task from Chapter 2, prioritized sweeping can solve the maze in 2-3 episodes where random Dyna-Q takes 20+. The improvement is most dramatic in large state spaces where most states are far from the reward, because random updates waste effort on states whose values haven't changed.
Every RL update can be categorized along three dimensions: (1) state values vs. action values, (2) one-step vs. multi-step, and (3) expected vs. sample. This chapter focuses on the third dimension.
An expected update considers all possible next states, weighted by their probabilities (like DP). A sample update uses a single sampled next state (like Q-learning or Sarsa). Expected updates are more accurate per update, but sample updates are cheaper per update.
At low branching factor, expected updates win (precise, not too costly). At high branching factor, sample updates win (they get b updates in the time of one expected update). Drag the slider to see the crossover.
Sutton and Barto show that for large enough b, sample updates are more efficient in terms of total computation. One expected update costs b times as much as one sample update. If b sample updates reduce the error by more than one expected update, sampling wins. The crossover happens around b ≈ 10 for typical problems.
| Update Type | Cost per update | Accuracy per update | Best when |
|---|---|---|---|
| Expected | O(b) | No sampling noise | Small branching factor |
| Sample | O(1) | Noisy (single sample) | Large branching factor |
In classical DP, every sweep visits every state. But most real problems have enormous state spaces where the agent only visits a tiny fraction. Why waste computation on states the agent will never reach?
Trajectory sampling is the idea of focusing updates on states the agent actually encounters under its current policy. Instead of exhaustive sweeps, the agent simulates trajectories from the model (following the current policy) and updates only the visited states.
Consider a game of chess. The state space is roughly 1047. Full DP sweeps are inconceivable. But from any position, the agent follows a plausible line of play and only updates the positions it actually reaches. This is trajectory sampling in action.
Exhaustive sweeps (DP):
• Visit every state every sweep
• Uniform coverage
• O(|S|) per sweep
• Impractical for large |S|
Trajectory sampling:
• Visit only on-policy states
• Focused coverage
• O(episode length) per trajectory
• Scales to huge |S|
Sutton and Barto show that trajectory sampling initially converges faster than exhaustive sweeps, because it focuses on relevant states. However, over very long training, exhaustive sweeps eventually catch up and may produce more uniform accuracy. For practical purposes, trajectory sampling dominates — you get good performance on the states that matter long before you'd finish a single exhaustive sweep.
Real-time dynamic programming (RTDP) applies the trajectory sampling idea to value iteration. Instead of sweeping all states, the agent performs value iteration updates only on states it actually visits while interacting with the environment (or a model of it).
The algorithm is simple: the agent follows its current greedy policy, and at each state visited, it performs a single Bellman optimality backup. Over time, the values of on-policy states converge to optimal, and the policy converges to the optimal policy for the relevant part of the state space.
RTDP is particularly effective in stochastic shortest path problems, where the agent wants to reach a goal as quickly as possible. It's also a building block for more sophisticated algorithms like LRTDP (Labeled RTDP) which adds convergence checks to identify states whose values have converged, skipping them in future trajectories.
Everything so far has been background planning: the agent gradually improves its policy/value function in the background, and when it needs to act, it consults the already-computed Q values. But there's another approach: decision-time planning, where all the computational effort is spent at the moment a decision is needed.
Think of a chess player. A background planner would have a pre-computed evaluation for every possible position. A decision-time planner looks at the current board and searches forward from this specific position, exploring possible move sequences to decide the best one. The search tree is rooted at the current state and goes only as deep as time allows.
Heuristic search is the classic decision-time planning method. The agent builds a search tree from the current state, using the model to expand possible futures. At the leaves of the tree, it uses a heuristic value estimate (or the current V/Q table). By searching deeper, it gets a more accurate value for the current state's actions.
Rollout algorithms are a simpler variant: from the current state, simulate complete trajectories using a rollout policy (often just the current policy), average the returns, and pick the action with the highest average return. This is a Monte Carlo estimate of the action values at the current state. The rollout policy provides the baseline, and the one-step lookahead improves upon it.
Monte Carlo Tree Search (MCTS) is the most successful decision-time planning algorithm. It's what powered AlphaGo's victory over Lee Sedol in 2016. The idea: grow a search tree from the current state by repeatedly running four phases — selection, expansion, simulation, and backpropagation.
In the selection phase, we traverse the tree from the root, choosing child nodes that balance exploitation (high value) and exploration (low visit count). The standard choice is UCB1: pick the child maximizing Q(s,a) + c √(ln N(s) / N(s,a)), where N is the visit count and c is an exploration constant.
In expansion, we add one new node at the frontier of the tree. In simulation (also called "playout" or "rollout"), we play out from the new node to a terminal state using a fast, simple policy (often random). The resulting return is the evaluation of the new node.
In backpropagation, we walk back up the tree and update the visit counts and average values at each ancestor node.
Watch the MCTS tree grow. Each click runs one iteration (select, expand, simulate, backpropagate). Node size = visit count. Color = average value (green = high, red = low).
After many iterations, the root action with the highest visit count (not highest value — visit count is more robust) is selected as the agent's action. The tree is then usually discarded (or the subtree under the chosen action is kept as the starting point for the next decision).
This chapter unified two seemingly different ideas — model-free learning (Chapters 5-7) and model-based planning (Chapter 4) — into a single framework. The Dyna architecture showed that learning and planning are the same computation, applied to real vs. simulated experience.
| Method | Core Idea | Key Property |
|---|---|---|
| Dyna-Q | Plan with learned model | n planning steps per real step |
| Dyna-Q+ | Exploration bonus for staleness | Detects model changes |
| Prioritized Sweeping | Update in order of urgency | Value propagates along useful paths |
| Trajectory Sampling | Focus on on-policy states | Scales to huge state spaces |
| RTDP | Value iteration on trajectories | Convergence for relevant states |
| MCTS | Decision-time tree search | Asymmetric, focused search |
Strengths of model-based RL:
• Much more sample-efficient
• Can plan ahead before acting
• Simulated experience is free
• Can detect and adapt to changes
Limitations:
• Model learning is hard in practice
• Wrong models can be worse than no model
• Computational cost of planning
• Distribution models often infeasible
What comes next: Chapter 7 showed how to tune the bootstrapping horizon. This chapter showed how to use models for planning. Both are tabular methods. Chapter 9 begins Part II of the book: what happens when the state space is too large for tables? We'll need function approximation — using parameterized functions (linear models, neural networks) to generalize across states.