Ch 8: Planning and Learning — Sutton & Barto RL

Chapter 0: Models and Planning

Everything we've done since Chapter 5 has been model-free: the agent interacts with the environment and learns directly from experience, never trying to understand how the environment works. But what if we did build a model? What if the agent could say "if I take this action, I think the environment will do that"?

A model is anything that allows the agent to predict the environment's response. Given a state and action, a model produces (or simulates) a next state and reward. With a model, the agent can think — it can mentally simulate experience and learn from those simulations without ever touching the real environment.

The core idea: Real experience is expensive. A robot might need hours to try an action. A model lets the agent rehearse thousands of actions in its head, for free. Planning is learning from simulated experience.

There are two types of models:

Distribution model

Produces all possible next states and their probabilities: p(s',r|s,a). This is what DP uses. It gives the full picture but is expensive to store and use.

Sample model

Produces a single sampled transition: one (s',r) drawn from the distribution. Cheaper, easier to build, and sufficient for most planning methods.

Think of a weather model. A distribution model would say "40% chance of rain, 30% cloudy, 30% sunny." A sample model would just say "today it will rain" — one sample from the distribution. Sample models are easier to build (you can learn them from experience) and easier to use (you just sample and apply standard RL updates).

Planning means using a model to improve a policy. The simplest form: sample a (state, action) pair, ask the model what would happen, and use the result to update value estimates. Each simulated transition is as useful for learning as a real one — the Q-learning or Sarsa update is identical. The only difference is where the transition came from.

Model Learning

Agent observes real transitions, builds model

↓

Planning

Agent simulates transitions from model, updates Q

↓

Acting

Agent uses updated Q to choose real actions

↻ repeat

Check: What is the key advantage of having a model of the environment?

The agent can simulate experience and learn from it without interacting with the real environment It eliminates the need for exploration It guarantees the optimal policy immediately

Chapter 1: The Dyna Architecture

The Dyna architecture (Sutton, 1990) is the elegant idea at the heart of this chapter: combine direct RL and planning in a single loop. After each real step in the environment, the agent does three things: updates Q from the real transition (direct RL), updates the model (model learning), and then performs n additional Q updates using simulated transitions from the model (planning).

1. Real Experience

Take action, observe (S', R)

↓

2. Direct RL Update

Q(S,A) += alpha [R + gamma max Q(S',a) - Q(S,A)]

↓

3. Model Update

Model(S,A) := (S', R) — record what happened

↓

4. Planning (n steps)

Repeat n times: pick random previously-seen (S,A), simulate, update Q

↻ back to step 1

The model is simple: a lookup table. For every (state, action) pair the agent has actually experienced, the model records what happened — the resulting next state and reward. When planning, the agent picks a random previously-seen (S,A) pair, asks the model for (S',R), and applies a Q-learning update. This is called random-sample one-step tabular Q-planning.

Key insight: From the perspective of the Q-learning update rule, it doesn't matter whether the transition (S, A, R, S') came from the real environment or from the model. The update is identical. Planning is just "extra practice" using simulated data.

The parameter n controls how much planning per real step. With n=0, Dyna reduces to ordinary Q-learning. With n=50, the agent does 50 simulated updates per real step. More planning means faster learning, at the cost of computation.

pseudocode
# Tabular Dyna-Q
Initialize Q(s,a), Model(s,a) for all s, a

for each episode:
  S ← start state
  while S is not terminal:
    A ← ε-greedy from Q(S, ·)
    take A, observe R, S'

    # Direct RL
    Q(S,A) += α[R + γ max_a Q(S',a) − Q(S,A)]

    # Model learning
    Model(S,A) ← (R, S')

    # Planning: n simulated updates
    repeat n times:
      S̃ ← random previously observed state
      Ã ← random action previously taken in S̃
      R̃, S̃' ← Model(S̃, Ã)
      Q(S̃,Ã) += α[R̃ + γ max_a Q(S̃',a) − Q(S̃,Ã)]

    S ← S'

Check: In Dyna-Q, where do the simulated transitions for planning come from?

From random initialization From the learned model, using previously observed (state, action) pairs From a neural network that generates plausible transitions

Chapter 2: Dyna-Q in Action

This is where it clicks. Below is a maze. The agent (orange) starts at the bottom-left and must reach the goal (green) at the top-right. Walls block the path. The agent earns reward 0 on every step and +1 on reaching the goal, so it wants to find the shortest path as fast as possible.

The slider controls n — the number of planning steps per real step. With n=0 (pure Q-learning), the agent stumbles for many episodes. With n=5 or n=50, watch how much faster it learns. Planning multiplies every real step by a factor of n, letting value information flood through the Q-table much faster.

Showcase: Dyna-Q Maze

Set planning steps (n) and click Run Episode to watch the agent navigate, or Train 50 to train for 50 episodes and see the learning curve. The right panel shows the model being built.

Planning steps n = 10

ε = 0.10

Episode 0 | Steps: --

What to notice: With n=0 (no planning), the first several episodes take hundreds of steps — the agent wanders aimlessly. With n=50, even the second episode is dramatically shorter, because after the first episode's real transitions, the model has been queried 50 times per step, spreading value information throughout the Q-table. The agent essentially "thinks about" the maze between real steps.

The model panel on the right side of the canvas shows which (state, action) pairs the model has learned. Dark cells are unexplored. Colored cells have been visited. As the agent explores, the model fills in — and planning exploits that model immediately.

Check: Why does increasing the planning steps (n) in Dyna-Q reduce the number of real steps needed?

It makes the agent explore more randomly Each real step triggers n simulated updates, spreading value information faster through Q It changes the discount factor

Chapter 3: When the Model is Wrong

Models are learned from experience, so they can be wrong. More importantly, the environment can change. When it does, the model reflects the old environment, and planning from it can be actively harmful. This chapter examines two classic scenarios.

The Blocking Maze

The agent learns a short path through a gap. Then the gap is blocked and a new gap opens elsewhere. Dyna-Q keeps planning from the old model and clings to the now-blocked path. It takes many real steps to discover the change and relearn.

The Shortcut Maze

The opposite: a wall is removed, opening a shorter path. Dyna-Q never discovers the shortcut because it never explores there — the old (longer) path still works fine. The model says "wall," so the agent never even tries.

The problem: The model only reflects states the agent has visited. If the environment changes in places the agent hasn't visited recently, the model stays stale. The agent becomes trapped by its own outdated beliefs.

The solution is Dyna-Q+. It adds an exploration bonus: for each (s,a) pair, it tracks how many time steps have elapsed since the pair was last tried in the real environment. The bonus is κ√τ, where τ is the time since last visit and κ is a small constant (e.g., 0.001). This extra "reward" for visiting stale state-action pairs encourages the agent to check whether its model is still correct.

R_bonus = R_model + κ √τ

Model Staleness Visualization

Each cell represents a (state, action) pair. Color shows recency: teal = recently visited, red = stale. Dyna-Q+ adds a bonus proportional to √τ (square root of staleness), shown as brightness.

κ = 0.001

The trade-off: Dyna-Q+ is slower than Dyna-Q in stationary environments (the exploration bonus wastes some effort on unnecessary re-exploration). But in changing environments, it detects changes faster because it periodically re-checks old transitions. The bonus decays relative to recency, so frequently-visited pairs get almost no bonus.

Check: In Dyna-Q+, what does the exploration bonus kappa * sqrt(tau) encourage the agent to do?

Revisit state-action pairs it hasn't tried recently, to check if the model is still correct Always take the greedy action Increase the step size over time

Chapter 4: Prioritized Sweeping

In Dyna-Q, planning steps are chosen randomly: we pick a random previously-seen (S,A) pair and update it. But not all updates are equally useful. If a state's value just changed a lot, then its predecessors (states that lead to it) should be updated next, because their values depend on the changed state.

Prioritized sweeping maintains a priority queue of (state, action) pairs, ordered by the urgency of their update. Urgency is measured by how much the value would change if we applied the update. The bigger the expected change, the higher the priority.

Value Change

Q(S,A) changes by a large delta

↓ predecessors affected

Insert Predecessors

All (S̄,Ā) that lead to S get priority |delta|

↓ pop highest priority

Update

Apply Q-learning update to top-priority pair

↻ repeat n times

Why it works: Think of value information as water flowing backward from the goal. Random updates splash water everywhere, hoping some lands on useful states. Prioritized sweeping channels the water upstream along the exact path it needs to travel. After finding the goal once, the value information floods backward along the best route far faster than random planning.

pseudocode
# Prioritized Sweeping (simplified)
Initialize Q(s,a), Model(s,a), PriorityQueue PQ

for each step:
  take action A in state S, observe R, S'

  # Direct RL update
  δ ← |R + γ max_a Q(S',a) − Q(S,A)|
  if δ > θ: insert (S,A) into PQ with priority δ

  # Planning: update highest-priority pairs
  repeat n times while PQ not empty:
    (S̃,Ã) ← pop top of PQ
    R̃, S̃' ← Model(S̃,Ã)
    Q(S̃,Ã) += α[R̃ + γ max_a Q(S̃',a) − Q(S̃,Ã)]

    # Propagate to predecessors
    for each (S̄,Ā) predicted to lead to S̃:
      R̄ ← predicted reward
      δ ← |R̄ + γ max_a Q(S̃,a) − Q(S̄,Ā)|
      if δ > θ: insert (S̄,Ā) into PQ with priority δ

In the maze task from Chapter 2, prioritized sweeping can solve the maze in 2-3 episodes where random Dyna-Q takes 20+. The improvement is most dramatic in large state spaces where most states are far from the reward, because random updates waste effort on states whose values haven't changed.

Check: What determines the priority of a (state, action) pair in prioritized sweeping?

How recently the pair was visited The magnitude of the expected value change (TD error) The distance from the start state

Chapter 5: Expected vs Sample Updates

Every RL update can be categorized along three dimensions: (1) state values vs. action values, (2) one-step vs. multi-step, and (3) expected vs. sample. This chapter focuses on the third dimension.

An expected update considers all possible next states, weighted by their probabilities (like DP). A sample update uses a single sampled next state (like Q-learning or Sarsa). Expected updates are more accurate per update, but sample updates are cheaper per update.

Expected: Q(s,a) ← ∑_s',r p(s',r|s,a) [r + γ max_a' Q(s',a')]

Sample: Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') − Q(s,a)]

The tradeoff: Expected updates use all the information in the model (no sampling noise) but cost O(b) per update, where b is the branching factor (number of possible next states). Sample updates cost O(1) but are noisy. Which is better depends on b and how many updates you can afford.

Expected vs Sample: Branching Factor

At low branching factor, expected updates win (precise, not too costly). At high branching factor, sample updates win (they get b updates in the time of one expected update). Drag the slider to see the crossover.

Branching factor b = 10

Sutton and Barto show that for large enough b, sample updates are more efficient in terms of total computation. One expected update costs b times as much as one sample update. If b sample updates reduce the error by more than one expected update, sampling wins. The crossover happens around b ≈ 10 for typical problems.

Update Type	Cost per update	Accuracy per update	Best when
Expected	O(b)	No sampling noise	Small branching factor
Sample	O(1)	Noisy (single sample)	Large branching factor

Check: When does a sample update become more efficient than an expected update (per unit of computation)?

When the branching factor is large, so you can do many sample updates in the time of one expected update When the state space is small When gamma is close to 1

Chapter 6: Trajectory Sampling

In classical DP, every sweep visits every state. But most real problems have enormous state spaces where the agent only visits a tiny fraction. Why waste computation on states the agent will never reach?

Trajectory sampling is the idea of focusing updates on states the agent actually encounters under its current policy. Instead of exhaustive sweeps, the agent simulates trajectories from the model (following the current policy) and updates only the visited states.

The key insight: Under the current policy, some states are visited frequently and some are never reached. Trajectory sampling naturally concentrates computation on the on-policy distribution — the states that matter most for the current policy. This is far more efficient than updating states the agent will never see.

Consider a game of chess. The state space is roughly 10⁴⁷. Full DP sweeps are inconceivable. But from any position, the agent follows a plausible line of play and only updates the positions it actually reaches. This is trajectory sampling in action.

Exhaustive sweeps (DP):

• Visit every state every sweep
• Uniform coverage
• O(|S|) per sweep
• Impractical for large |S|

Trajectory sampling:

• Visit only on-policy states
• Focused coverage
• O(episode length) per trajectory
• Scales to huge |S|

Sutton and Barto show that trajectory sampling initially converges faster than exhaustive sweeps, because it focuses on relevant states. However, over very long training, exhaustive sweeps eventually catch up and may produce more uniform accuracy. For practical purposes, trajectory sampling dominates — you get good performance on the states that matter long before you'd finish a single exhaustive sweep.

Check: Why is trajectory sampling more efficient than exhaustive sweeps in large state spaces?

It focuses computation on states the agent actually visits under its current policy It uses a larger step size It doesn't require a model

Chapter 7: Real-Time DP

Real-time dynamic programming (RTDP) applies the trajectory sampling idea to value iteration. Instead of sweeping all states, the agent performs value iteration updates only on states it actually visits while interacting with the environment (or a model of it).

The algorithm is simple: the agent follows its current greedy policy, and at each state visited, it performs a single Bellman optimality backup. Over time, the values of on-policy states converge to optimal, and the policy converges to the optimal policy for the relevant part of the state space.

V(S) ← max_a ∑_s',r p(s',r|S,a) [r + γ V(s')]

RTDP convergence: Under certain conditions (ergodicity or proper policies), RTDP converges to the optimal value function for all states that are relevant under the optimal policy. It may never converge for states that are unreachable under the optimal policy — but that's fine, because we don't care about those states.

RTDP is particularly effective in stochastic shortest path problems, where the agent wants to reach a goal as quickly as possible. It's also a building block for more sophisticated algorithms like LRTDP (Labeled RTDP) which adds convergence checks to identify states whose values have converged, skipping them in future trajectories.

Standard Value Iteration

Sweep ALL states every iteration

↓ restrict to on-policy states

RTDP

Update only states visited on current trajectory

↓ add convergence checks

LRTDP

Label converged states, skip them, converge faster

Check: How does RTDP differ from standard value iteration?

It updates only states visited along the agent's actual trajectory, not all states It uses a different Bellman equation It eliminates the discount factor

Chapter 8: Decision-Time Planning

Everything so far has been background planning: the agent gradually improves its policy/value function in the background, and when it needs to act, it consults the already-computed Q values. But there's another approach: decision-time planning, where all the computational effort is spent at the moment a decision is needed.

Think of a chess player. A background planner would have a pre-computed evaluation for every possible position. A decision-time planner looks at the current board and searches forward from this specific position, exploring possible move sequences to decide the best one. The search tree is rooted at the current state and goes only as deep as time allows.

Background vs Decision-Time: Background planning builds a general-purpose policy. Decision-time planning builds a specific-purpose plan for the current decision. Background planning is amortized over many decisions. Decision-time planning concentrates all effort on the one decision that matters right now.

Heuristic search is the classic decision-time planning method. The agent builds a search tree from the current state, using the model to expand possible futures. At the leaves of the tree, it uses a heuristic value estimate (or the current V/Q table). By searching deeper, it gets a more accurate value for the current state's actions.

Rollout algorithms are a simpler variant: from the current state, simulate complete trajectories using a rollout policy (often just the current policy), average the returns, and pick the action with the highest average return. This is a Monte Carlo estimate of the action values at the current state. The rollout policy provides the baseline, and the one-step lookahead improves upon it.

Rollout guarantee: A rollout algorithm using policy π as its rollout policy is guaranteed to produce a policy that is at least as good as π (for the current state). It's policy improvement applied in real time, one state at a time. If the rollout policy is already decent, the rollout algorithm can be very effective.

Check: What is the key difference between background planning and decision-time planning?

Background planning uses a model; decision-time planning does not Background planning builds a general policy in advance; decision-time planning searches forward from the current state when a decision is needed Decision-time planning is always faster

Chapter 9: Monte Carlo Tree Search

Monte Carlo Tree Search (MCTS) is the most successful decision-time planning algorithm. It's what powered AlphaGo's victory over Lee Sedol in 2016. The idea: grow a search tree from the current state by repeatedly running four phases — selection, expansion, simulation, and backpropagation.

1. Selection

Walk down the tree using UCB to balance explore/exploit

↓

2. Expansion

Add one new child node to the tree

↓

3. Simulation

Play out a random rollout to the end

↓

4. Backpropagation

Update visit counts and values up the tree

↻ repeat many times, then pick best root action

In the selection phase, we traverse the tree from the root, choosing child nodes that balance exploitation (high value) and exploration (low visit count). The standard choice is UCB1: pick the child maximizing Q(s,a) + c √(ln N(s) / N(s,a)), where N is the visit count and c is an exploration constant.

In expansion, we add one new node at the frontier of the tree. In simulation (also called "playout" or "rollout"), we play out from the new node to a terminal state using a fast, simple policy (often random). The resulting return is the evaluation of the new node.

In backpropagation, we walk back up the tree and update the visit counts and average values at each ancestor node.

MCTS Tree Growth

Watch the MCTS tree grow. Each click runs one iteration (select, expand, simulate, backpropagate). Node size = visit count. Color = average value (green = high, red = low).

Iterations: 0

Why MCTS works so well: It focuses computation on the most promising parts of the search tree (via UCB exploration). Early iterations explore broadly; later iterations zoom in on the best moves. The tree grows asymmetrically — deep in promising directions, shallow in unpromising ones. This is far more efficient than searching to a fixed depth everywhere.

After many iterations, the root action with the highest visit count (not highest value — visit count is more robust) is selected as the agent's action. The tree is then usually discarded (or the subtree under the chosen action is kept as the starting point for the next decision).

AlphaGo and beyond: AlphaGo combined MCTS with deep neural networks. The policy network guided the selection phase (replacing UCB's random exploration with learned intuition). The value network replaced the random rollout with a learned evaluation. The combination was superhuman. AlphaZero went further, learning entirely from self-play with MCTS — no human data at all.

Check: In MCTS, what is the purpose of the simulation (rollout) phase?

To estimate the value of a newly expanded node by playing out to a terminal state To train the model To update the policy network

Chapter 10: Summary

This chapter unified two seemingly different ideas — model-free learning (Chapters 5-7) and model-based planning (Chapter 4) — into a single framework. The Dyna architecture showed that learning and planning are the same computation, applied to real vs. simulated experience.

Method	Core Idea	Key Property
Dyna-Q	Plan with learned model	n planning steps per real step
Dyna-Q+	Exploration bonus for staleness	Detects model changes
Prioritized Sweeping	Update in order of urgency	Value propagates along useful paths
Trajectory Sampling	Focus on on-policy states	Scales to huge state spaces
RTDP	Value iteration on trajectories	Convergence for relevant states
MCTS	Decision-time tree search	Asymmetric, focused search

The dimensions of planning: Every planning method can be characterized by: (1) whether it uses a distribution or sample model, (2) whether updates are expected or sampled, (3) whether planning is background or decision-time, (4) whether state coverage is exhaustive or on-policy. Different combinations of these choices produce all the methods in this chapter.

Strengths of model-based RL:

• Much more sample-efficient
• Can plan ahead before acting
• Simulated experience is free
• Can detect and adapt to changes

Limitations:

• Model learning is hard in practice
• Wrong models can be worse than no model
• Computational cost of planning
• Distribution models often infeasible

What comes next: Chapter 7 showed how to tune the bootstrapping horizon. This chapter showed how to use models for planning. Both are tabular methods. Chapter 9 begins Part II of the book: what happens when the state space is too large for tables? We'll need function approximation — using parameterized functions (linear models, neural networks) to generalize across states.

The big picture: Part I gave us all the fundamental ideas: value functions, Bellman equations, DP, MC, TD, n-step, and planning. These are the building blocks. Part II will scale these ideas to problems too large for lookup tables. Every algorithm in Part II is a variation on Part I's ideas, combined with function approximation. The concepts you've learned transfer directly.

"Planning and learning are two sides of the same coin.
Both use value updates; they differ only in the source of experience."
— Richard S. Sutton

Check: What is the fundamental insight of the Dyna architecture?

Real experience and simulated experience can use the same learning update, so planning and learning are the same computation Models are always better than model-free learning Planning should only happen at decision time