Sutton & Barto, Chapter 4

Dynamic Programming

Computing optimal policies when you have a perfect model of the environment.

Prerequisites: Chapter 3 (MDPs & Bellman equations). That's it.
11
Chapters
2
Simulations
11
Quizzes

Chapter 0: What is DP?

In Chapter 3 we wrote down the Bellman equations — elegant recursive relationships that the value function must satisfy. But writing down an equation and solving it are very different things. How do you actually compute the optimal value function for an MDP with thousands of states?

Dynamic programming (DP) answers this: take the Bellman equations and turn them into update rules. Start with any guess for the values, then iteratively apply the Bellman equation at every state, sweeping through the whole state space. Each sweep pushes your values closer to the truth.

There is one crucial requirement: DP needs a complete model of the environment — the full transition dynamics p(s',r|s,a). You must know every possible next state, its probability, and its reward. This is a strong assumption. Later chapters will relax it.

The core idea: DP turns the Bellman equation from a statement about values ("if the values are correct, they satisfy this relationship") into an algorithm ("repeatedly apply this rule and the values will become correct"). It is the computational backbone of everything that follows.
Bellman Equation
A system of |S| simultaneous equations
↓ turn into update rule
DP Update
Sweep through all states, apply the equation
↻ repeat until convergence
Optimal Values
v* or vπ to any desired precision

The term "dynamic programming" was coined by Richard Bellman in the 1950s. The "programming" refers to optimization (as in "linear programming"), not computer code. The "dynamic" refers to problems that unfold over time — sequential decision making.

Think of DP as a planning method. You sit down with the complete blueprints of the environment (all transitions, all rewards) and compute the best strategy before you ever take a single action. It is the gold standard — if you have a model, DP gives you the answer. The rest of the book asks: what if you don't have a model?

What DP needsWhat it produces
Complete model p(s',r|s,a)Optimal value function v*
Reward function (embedded in model)Optimal policy π*
Discount factor γAction-value function q*

The chapter ahead covers four tightly linked ideas. Policy evaluation computes how good a given policy is. Policy improvement makes the policy better. Policy iteration alternates the two until optimal. Value iteration combines them into one tight loop. And GPI reveals that all RL algorithms are variations on this pattern.

DP in the real world: Despite the model requirement, DP is not just a theoretical curiosity. Many practical problems do have known models: inventory management, network routing, elevator scheduling, game AIs (chess, Go endgames), and financial derivatives pricing. DP also serves as the inner loop of model-based RL — you learn a model, then plan with DP. Chapter 8 (Dyna) will explore this idea.
Check: What does DP require that later methods (MC, TD) will not?

Chapter 1: Policy Evaluation

Given a policy π, how do we compute its value function vπ? This is the prediction problem. We know from Chapter 3 that vπ satisfies the Bellman expectation equation:

vπ(s) = ∑a π(a|s) ∑s',r p(s',r|s,a) [ r + γ vπ(s') ]

If we know p and π, this is a system of |S| linear equations in |S| unknowns. We could solve it directly with linear algebra, but that costs O(|S|3) and blows up for large state spaces. Instead, DP uses iterative policy evaluation.

Start with an arbitrary initial guess v0(s) for all states (zeroes work fine). Then apply the Bellman equation as an update rule:

vk+1(s) = ∑a π(a|s) ∑s',r p(s',r|s,a) [ r + γ vk(s') ]

Each application of this update to every state is called a sweep. After enough sweeps, vk converges to vπ. The Bellman operator is a contraction mapping — each sweep brings us strictly closer to the true values. The maximum error across all states shrinks by a factor of γ with every sweep.

Key insight: We replaced "solve this system of equations" with "iterate this update rule." The update uses the old values vk on the right side to compute new values vk+1 on the left. Convergence is guaranteed as long as γ < 1 or the episode terminates.

In practice, we typically use in-place updates: instead of maintaining two arrays (old and new), we update values one at a time in a single array. Each state immediately sees the freshest values of its neighbors. This converges faster on average, though the order of updates matters.

How do we know when to stop? We track the maximum change across all states during each sweep: Δ = maxs |vk+1(s) − vk(s)|. When Δ drops below a small threshold θ, we declare convergence. The smaller θ, the more precise our answer, but the more sweeps we need.

Convergence speed: With discount factor γ < 1, the error contracts by γ per sweep, so log(1/ε) / log(1/γ) sweeps suffice for precision ε. With γ = 0.9, that's about 22 sweeps for 10% accuracy, 44 for 1%. With γ = 1 (undiscounted, episodic), convergence is slower but still guaranteed because terminal states anchor the computation.
SymbolMeaning
π(a|s)Probability of action a in state s under policy π
p(s',r|s,a)Transition dynamics (the model)
γDiscount factor (0 ≤ γ ≤ 1)
vkValue estimate after k sweeps
θConvergence threshold (stop when max change < θ)
pseudocode
# Iterative Policy Evaluation
Input: policy π, threshold θ
Initialize V(s) = 0 for all s

repeat:
  Δ ← 0
  for each s:
    v ← V(s)
    V(s) ← ∑a π(a|s) ∑s',r p(s',r|s,a)[r + γ V(s')]
    Δ ← max(Δ, |v − V(s)|)
until Δ < θ

return V ≈ vπ

Here is a concrete Python implementation for the 4×4 gridworld:

python
import numpy as np

def policy_evaluation(grid=4, gamma=1.0, theta=1e-6):
    V = np.zeros(grid * grid)
    terminals = {0, grid * grid - 1}
    actions = [(-1,0), (1,0), (0,-1), (0,1)]  # up, down, left, right

    while True:
        delta = 0
        for s in range(grid * grid):
            if s in terminals: continue
            v = V[s]
            new_v = 0
            for dr, dc in actions:
                r, c = s // grid + dr, s % grid + dc
                sp = s if (r < 0 or r >= grid or
                         c < 0 or c >= grid) else r*grid+c
                new_v += 0.25 * (-1 + gamma * V[sp])
            V[s] = new_v
            delta = max(delta, abs(v - V[s]))
        if delta < theta: break
    return V.reshape(grid, grid)
Check: In iterative policy evaluation, what does one "sweep" consist of?

Chapter 2: Iterative Sweeps

Let's watch policy evaluation in action. Below is a 4×4 gridworld. The agent can move up, down, left, or right. Moving off the grid keeps you in place. Every move costs −1 reward. The top-left and bottom-right corners are terminal states (shaded in teal). We evaluate the equiprobable random policy — each action has probability 0.25.

All values start at zero. Each time you click Sweep, the algorithm applies one full Bellman update across all 14 non-terminal states. Watch the values evolve. States near the terminals settle first; states in the center, far from any exit, take the longest to converge to their (very negative) true values.

Policy Evaluation: 4×4 Gridworld

Random policy (π(a|s) = 0.25 for all a). Discount γ = 1. Terminal states at top-left and bottom-right. Click Sweep to advance one iteration, or Auto-run to animate.

k = 0
Speed300ms
What to notice: After just a few sweeps, values near the terminal states settle first. Values far from terminals take longer — information has to propagate across the grid one cell per sweep. By k ≈ 150, the values have converged: the center states (around −20) are the most negative because a random walk from there takes the longest to reach any terminal.

Try pausing at k=1 and inspecting the values. Only states adjacent to a terminal have changed from zero (to −1), because only they have a neighbor with a non-zero value. At k=2, the effect spreads one cell further. This wavefront of information is characteristic of synchronous sweeps.

The convergence chart at the bottom of the canvas tracks Δ — the maximum change across all states in each sweep. Watch how Δ drops rapidly at first (the easy states near terminals) then slows down (the center states are still adjusting). Convergence is geometric: each sweep reduces the maximum error by at most a factor of γ. With γ = 1, convergence is slower but still guaranteed because the terminal states anchor the values.

Sweep count vs. accuracy: You don't always need full convergence. Even 10-20 sweeps give values accurate enough for policy improvement to produce the optimal policy. This observation motivates truncated evaluation and ultimately value iteration (Chapter 7).
Check: Under the random policy on this grid, which states converge to the most negative values?

Chapter 3: The Gridworld Example

This is Example 4.1 from Sutton & Barto. The 4×4 grid has states numbered 1–14 (state 0 and state 15 are terminal). Under the equiprobable random policy with γ = 1, the converged values are:

0.0−14−20−22
−14−18−20−20
−20−20−18−14
−22−20−140.0

The pattern is symmetric about the diagonal connecting the two terminal corners. States equidistant from both terminals have equal values. The center states (around −20) are the "worst" — under random walk, it takes the longest to stumble into a terminal from there.

These values have a natural interpretation: vπ(s) = −(expected number of steps to termination from s under π). Since each step costs −1 with γ = 1, the value is exactly the negative expected hitting time. State 12 (bottom-left corner, value −22) is the worst — a random walker from there bounces around for about 22 steps before stumbling into a terminal corner.

Why this matters: These values tell you something useful even though the random policy is terrible. They quantify how bad each state is under this policy. That information is exactly what we need to improve the policy — which is the next step.

Let's verify the Bellman equation at one state. Take state 1 (top row, second cell, value −14). Its four neighbors under the random policy are:

Up: hits wall, stays at state 1. But wait — the problem states actions that go off the grid leave you in the same state. So the successor is state 1 itself: reward −1, then value −14.

Right: moves to state 2, value −20.

Down: moves to state 5, value −18.

Left: moves to state 0 (terminal), value 0.

v(1) = 0.25[−1 + (−14)] + 0.25[−1 + (−20)] + 0.25[−1 + (−18)] + 0.25[−1 + 0] = −14

Each term is: probability × [immediate reward + discounted successor value]. The four terms sum to exactly −14 — the Bellman equation is satisfied. You can verify this at any state in the converged grid.

Exercise: Try verifying state 5 (second row, second cell, value −18). Its neighbors are: up to state 1 (−14), right to state 6 (−20), down to state 9 (−20), left to state 4 (−14). Check: 0.25[−1 + (−14)] + 0.25[−1 + (−20)] + 0.25[−1 + (−20)] + 0.25[−1 + (−14)] = −18.
Check: Why is the converged value grid symmetric about the diagonal?

Chapter 4: Policy Improvement

We now know how to evaluate a policy. But knowing vπ is only half the story. The real goal is to find a better policy. Here is the idea: once you have the value function vπ, you can ask a simple question at each state: "What if I took a different action here, then followed π after that?"

That question is answered by the action-value function:

qπ(s, a) = ∑s',r p(s',r|s,a) [ r + γ vπ(s') ]

This is the value of taking action a in state s, and then following π forever after. We can compute this for every action using vπ and the model — no new evaluation needed. If qπ(s, a) > vπ(s) for some action a, then switching to action a in state s (and following π elsewhere) would be strictly better than always following π.

The greedy policy takes this to the extreme — at every state, it picks the best action:

π'(s) = argmaxa qπ(s, a) = argmaxas',r p(s',r|s,a) [ r + γ vπ(s') ]
Key insight: You don't need to search over all possible policies. You just compute values for the current policy, then act greedily with respect to those values. This "one-step lookahead" is enough to guarantee improvement. Greedy in the short term, optimal in the long run.

Going back to our gridworld example: the random policy gives value −14 at state 1. Let's compute qπ(1, a) for each action:

ActionSuccessorqπ(1, a) = −1 + vπ(s')
Up (hit wall)s1 (itself)−1 + (−14) = −15
Rights2−1 + (−20) = −21
Downs5−1 + (−18) = −19
Lefts0 (terminal)−1 + 0 = −1

The greedy choice is clear: go left (q = −1) instead of averaging all four (v = −14). That's a dramatic improvement — from an expected 14 more steps of random wandering to a single step to the terminal.

In fact, for this particular gridworld, the greedy policy with respect to the random policy's values is already optimal. One round of improvement is enough. This is not always the case, but it shows how powerful even a single improvement step can be.

One-step lookahead: Policy improvement doesn't plan far ahead. It just asks, "given where each action leads me, and the values I have, which single action is best?" Yet this myopic reasoning, repeated at every state, produces a globally better policy. The magic is in the values — they already encode the long-term consequences.
Check: How is the improved policy π' constructed from vπ?

Chapter 5: The Policy Improvement Theorem

Why does acting greedily actually produce a better policy? This is guaranteed by the policy improvement theorem, one of the most important results in RL theory.

Policy Improvement Theorem: Let π and π' be two policies. If for all states s:

qπ(s, π'(s)) ≥ vπ(s)

then π' is at least as good as π. That is, vπ'(s) ≥ vπ(s) for all s.

The condition says: at every state, the action chosen by π' must be at least as valuable (under the old policy's q-values) as what π would do. If π' is greedy with respect to vπ, this is automatically satisfied — by definition, the greedy action has the highest q-value.

The proof chains inequalities forward through time. Start at state s:

vπ(s) ≤ qπ(s, π'(s))

Now expand qπ using the Bellman equation:

= Eπ'[Rt+1 + γ vπ(St+1)] ≤ Eπ'[Rt+1 + γ qπ(St+1, π'(St+1))]

We applied the same inequality at St+1. Continuing to expand gives:

≤ Eπ'[Rt+1 + γ Rt+2 + γ2 Rt+3 + …] = vπ'(s)

The chain telescopes: each step replaces vπ(St+k) with qπ(St+k, π'(St+k)), and the final result is the total return under π', which is vπ'(s).

When does equality hold? When π' equals π everywhere — meaning the greedy policy is the same as the original. In that case, vπ already satisfies the Bellman optimality equation: vπ(s) = maxa qπ(s, a). So π is already optimal. This is how we know policy iteration has converged.

The theorem also works for stochastic policies. If π' is not deterministic but instead mixes over actions, the condition qπ(s, π'(s)) ≥ vπ(s) generalizes to the expected q-value under π' being at least as large. This flexibility is important for methods like ε-greedy improvement, where we mostly act greedy but occasionally explore.

Why this theorem is so powerful: It means we never need to search over the entire policy space. We evaluate once, compute one greedy step, and we are guaranteed progress. Repeated application converges to the global optimum — not a local one. This is rare in optimization: most iterative methods only guarantee local optima. The special structure of MDPs (the Bellman equation) gives us this global guarantee for free.
Check: What does the policy improvement theorem guarantee?

Chapter 6: Policy Iteration

Now we combine the two pieces. Policy iteration alternates between evaluation and improvement in a loop:

1. Evaluate
Compute vπ by iterative sweeps (until convergence)
2. Improve
π' = greedy(vπ) at every state
3. Check
If π' = π, stop — π is optimal. Otherwise set π ← π' and go to step 1.

Each cycle is guaranteed to strictly improve the policy (by the policy improvement theorem) unless we are already optimal. Since a finite MDP has a finite number of deterministic policies, the algorithm must terminate in a finite number of iterations. No infinite loops, no oscillation — strict monotonic improvement until optimality.

In practice, policy iteration is remarkably fast. For the 4×4 gridworld, the random policy improves to optimal in just one improvement step. For larger problems, convergence typically takes far fewer iterations than there are policies — often single digits, even for MDPs with millions of states.

How fast is "fast"? Jack's car rental problem (Example 4.2 in the book) has over 441 states and 21 possible actions per state. Policy iteration finds the optimal strategy in just 4 iterations. Each iteration does many evaluation sweeps (the inner loop), but only 4 policy changes are needed. That's 4 improvements out of 21441 possible policies — an astronomically efficient search.
Key insight: Policy iteration works because evaluation and improvement bootstrap off each other. Evaluation says "here's how good your current plan is." Improvement says "here's a better plan given those values." Repeat. This dance converges to optimality.
pseudocode
Initialize V(s) arbitrarily, π(s) arbitrarily

loop:
  # 1. Policy Evaluation
  repeat:
    Δ ← 0
    for each s:
      v ← V(s)
      V(s) ← ∑s',r p(s',r|s,π(s))[r + γ V(s')]
      Δ ← max(Δ, |v − V(s)|)
  until Δ < θ

  # 2. Policy Improvement
  stable ← true
  for each s:
    old ← π(s)
    π(s) ← argmaxas',r p(s',r|s,a)[r + γ V(s')]
    if old ≠ π(s): stable ← false
  if stable: return V, π

Notice that the evaluation step for a deterministic policy is simpler than for a stochastic one: there is no sum over actions, just a single backup for π(s). This makes each evaluation sweep cheaper. The bottleneck is the number of sweeps needed for evaluation to converge before each improvement step.

Counting policies: A deterministic policy on our 4×4 grid (14 non-terminal states, 4 actions each) has 414 ≈ 268 million possibilities. Brute-force search would evaluate each one. Policy iteration typically finds the optimum in 2–3 iterations — an astronomically faster path to the answer. This gap only widens with larger state spaces.

Here is the full cycle for our gridworld:

Start
π0 = random policy (each action prob 0.25)
Evaluate π0
~150 sweeps → vπ0 (values from −14 to −22)
Improve
π1 = greedy(vπ0): each state points to nearest terminal
Check
π1 is already optimal — evaluating it gives the same greedy policy. Done!

Total: one evaluation, one improvement, done. The optimal policy is to always move toward the nearest terminal corner. For states equidistant from both corners (the anti-diagonal), any direction toward either terminal is optimal.

Check: Why must policy iteration terminate on a finite MDP?

Chapter 7: Value Iteration

Policy iteration has an inner loop (many sweeps for evaluation) inside an outer loop (policy improvement). Do we really need to wait for evaluation to fully converge? What if we do just one sweep of evaluation and then immediately improve?

It turns out this works perfectly. Value iteration combines one sweep of the Bellman optimality equation into a single update:

vk+1(s) = maxas',r p(s',r|s,a) [ r + γ vk(s') ]

Notice the max over actions — this simultaneously evaluates and improves. Each sweep backs up the best action's value. You don't need to maintain a separate policy. After convergence, the optimal policy is simply the greedy policy with respect to v*.

Value iteration = policy evaluation where you take the max. Compare the two updates side by side. Policy evaluation averages over π(a|s). Value iteration takes the max. That single change — replacing a weighted sum with a max — turns evaluation into optimization.
Value Iteration: 4×4 Gridworld

Watch the optimal values and policy arrows emerge sweep by sweep. Terminal states (shaded) at top-left and bottom-right. Every move costs −1. Click Step to advance one Bellman optimality backup, or Auto-run to animate. Policy arrows show the greedy action(s) at each state.

k = 0
Speed300ms

The arrows tell the story. At k=1, every state adjacent to a terminal already points toward it. By k=3 or 4, most arrows have stabilized into the optimal policy: each state's arrow points in the direction that gets it to a terminal in the fewest steps. Because both corners are terminal, states along the anti-diagonal have ties — going toward either corner is equally good, so they show two arrows.

Compare speeds: Policy evaluation of the random policy takes ~150 sweeps to converge. Value iteration converges in ~7 sweeps (for the same grid). The max operator is much more aggressive — it immediately locks in the best direction, and values converge to v* directly rather than first computing vπ for a bad policy. The trade-off: each value iteration sweep does more work per state (computing max over all actions instead of a single action).

Policy Evaluation update:

v(s) ← ∑a π(a|s) · [r + γ v(s')]

Weighted average over all actions

Value Iteration update:

v(s) ← maxa [r + γ v(s')]

Best action only

pseudocode
# Value Iteration
Initialize V(s) = 0 for all s

repeat:
  Δ ← 0
  for each s:
    v ← V(s)
    V(s) ← maxas',r p(s',r|s,a)[r + γ V(s')]
    Δ ← max(Δ, |v − V(s)|)
until Δ < θ

# Extract policy
π(s) ← argmaxas',r p(s',r|s,a)[r + γ V(s')]

Let's trace one sweep of value iteration at k=0 (all values start at 0). Consider state 5 (second row, second column):

ActionLands onr + γ v0(s')
Ups1−1 + 0 = −1
Downs9−1 + 0 = −1
Lefts4−1 + 0 = −1
Rights6−1 + 0 = −1

All actions tie at −1, so v1(5) = −1. Now at k=1, state 1's neighbors include state 0 (terminal, value 0), so v2(1) = max(−1+0, −1+(−1), ...) = −1. But state 5 now sees state 1 at −1, making the "up" direction lead to −1 + (−1) = −2, while "left" to state 4 is −1 + (−1) = −2. The values deepen each sweep until they stabilize at the true optimum — each state's value equals the negative of its shortest distance to the nearest terminal.

Optimal values for this grid: Under value iteration with γ = 1, the converged v* values equal the negative Manhattan distance to the nearest terminal. State 1 (one step from s0) converges to −1. State 5 (two steps) converges to −2. State 10 (two steps from s15) converges to −2. The center state s5 is 2 steps away, so v*(5) = −2. Much better than −18 under the random policy!
Check: How does the value iteration update differ from policy evaluation?

Chapter 8: Generalized Policy Iteration

Policy iteration and value iteration look like very different algorithms, but they are two extremes of the same spectrum. Generalized Policy Iteration (GPI) is the unifying framework that encompasses both — and nearly every RL algorithm you will ever encounter.

Policy iteration: evaluate to full convergence (many sweeps), then improve once. The inner loop runs until Δ < θ.

Value iteration: evaluate for exactly one sweep, then improve immediately (implicitly, via max). No inner loop at all.

GPI says: any interleaving of evaluation and improvement converges, as long as both processes continue to operate. You can do 3 sweeps of evaluation then improve. You can evaluate one state and improve one state. You can even update a single state's value and immediately act greedily at that state. It all works.

The continuum between the two extremes is a trade-off. More evaluation sweeps before improving means more accurate values, so improvement steps are better informed. Fewer evaluation sweeps means faster iteration but potentially less efficient improvement steps. In practice, truncating evaluation early (as value iteration does) is almost always preferred — the extra accuracy rarely justifies the cost.

The GPI dance: Think of two competing forces. Evaluation adjusts values to be consistent with the current policy (pushing V toward vπ). Improvement changes the policy to be greedy with respect to current values (pushing π toward greedy(V)). Each force makes the other "wrong" — changing π invalidates V, and changing V changes the greedy policy. But together they spiral toward the fixed point where both are satisfied: v* and π*.
Evaluation
Push V toward vπ (make values consistent with policy)
↓ interact ↓
Improvement
Push π toward greedy(V) (make policy consistent with values)
↻ converge to v*, π*

This pattern appears everywhere in RL:

AlgorithmEvaluation methodImprovement method
Policy IterationFull iterative sweepsGreedy
Value IterationOne Bellman optimality sweepImplicit (max)
Monte Carlo (Ch 5)Episode returnsε-greedy
SARSA (Ch 6)TD(0) on-policyε-greedy
Q-learning (Ch 6)TD(0) off-policyGreedy (max)
Actor-Critic (Ch 13)TD criticPolicy gradient actor

The lesson: don't memorize individual algorithms. Understand GPI, and every algorithm becomes a specific instantiation of "how do I evaluate?" and "how do I improve?" The rest of the book fills in this table row by row.

The fixed point: At the optimum, both evaluation and improvement are satisfied simultaneously: V = v* (values are correct for the current policy), and π = greedy(V) (the policy is greedy with respect to current values). Since v* satisfies the Bellman optimality equation, the greedy policy with respect to v* is π*. The two forces are in balance.
Check: What does GPI say about the relationship between evaluation and improvement?

Chapter 9: Bootstrapping

There is a subtle but profound idea running through all of DP: bootstrapping. When we update the value of state s, we use the estimated values of successor states s'. We are updating estimates from estimates — not from complete, observed returns.

v(s) ← … r + γ v(s') …     ← this is an estimate, not a known fact

This is powerful because it lets us update values after considering just one transition, without waiting for the episode to end. Information propagates quickly: a change to v(s') immediately influences the next update to v(s). But it introduces a dependency — if v(s') is wrong, the update to v(s) inherits that error.

The errors correct themselves through repeated sweeps. Each sweep reduces the maximum error by a factor of γ. But the initial estimates influence the path of convergence, even if the destination is the same.

DP bootstraps. MC does not. Monte Carlo methods (Chapter 5) wait until the episode ends, then use the actual total return Gt to update values. No estimates of estimates — just cold, hard data. This avoids the bias that bootstrapping introduces, but it is slower: you learn nothing until the episode finishes.

TD methods (Chapter 6) will be the synthesis: they bootstrap like DP (using estimated successor values) but learn from experience like MC (no model required). This is why DP is so important to understand — it introduces the bootstrapping idea that TD learning inherits.

There is a subtlety worth noting. In our simulations above, we use synchronous updates: we compute all new values from the old array, then swap. This is "two-array" style. Alternatively, in-place (one-array) updates modify V immediately, so later states in the sweep see partially updated values. Both converge, but in-place is typically faster because information from already-updated states propagates within the same sweep.

MethodBootstraps?Needs model?Updates from
DPYesYesExpected values via model
MCNoNoComplete episode returns
TDYesNoOne-step sample + bootstrap
A useful analogy: Bootstrapping is like building a bridge from both sides of a river. You trust that the other side is being built correctly even though you can't see it yet. DP's guarantee is that repeated sweeps ensure both sides meet in the middle. MC, by contrast, builds the entire bridge from one end to the other — slower, but no risk of misalignment.

To see bootstrapping concretely, consider our gridworld at k=1 of policy evaluation. State 1 computes its value using v(s0) = 0 (terminal, correct) and v(s2) = 0 (not correct — it should be −20). The update gives v(1) = −1, which is wrong (the true answer is −14). But it is less wrong than v(1) = 0, and at the next sweep it will get even closer. This self-correcting behavior is the hallmark of bootstrapping methods.

Looking ahead: The distinction between bootstrapping and non-bootstrapping methods is one of the three fundamental dimensions Sutton and Barto use to organize all of RL. The other two are: model-based vs. model-free, and sample updates vs. expected updates. DP sits at one extreme (bootstrapping, model-based, expected). Pure MC sits at the opposite (non-bootstrapping, model-free, sample). Understanding where each method sits on these axes is the roadmap for the rest of the book.
Check: What does "bootstrapping" mean in the context of DP?

Chapter 10: Summary

Dynamic programming gives us the theoretical backbone of reinforcement learning. The two key operations — evaluation and improvement — and their interplay through GPI, set the pattern for every algorithm that follows.

AlgorithmWhat it doesKey property
Policy EvaluationComputes vπ for a given πIterative Bellman expectation backup
Policy ImprovementMakes π greedy w.r.t. vπGuaranteed no worse (theorem)
Policy IterationAlternates eval & improveConverges to π* in finite steps
Value IterationOne-sweep eval + improveBellman optimality backup
GPIAny eval + any improveThe universal RL pattern
Efficiency: DP is polynomial in |S| and |A|, which is exponentially faster than brute-force search over all |A||S| deterministic policies. But it still requires full sweeps over the entire state space, which becomes impractical for large or continuous problems. This is the curse of dimensionality — a term, fittingly, also coined by Bellman.

Asynchronous DP: We don't have to sweep states in a fixed order. We can update states in any order, prioritizing those whose values are changing the most (prioritized sweeping). We can even run DP in real time, updating the states the agent actually visits. These variants can dramatically speed up convergence in practice and are explored further in Chapter 8 (Planning and Learning with Tabular Methods).

Strengths of DP:

• Guaranteed optimal solution
• Polynomial complexity
• Well-understood convergence
• Foundation for all RL algorithms

Limitations of DP:

• Requires a complete model
• Full sweeps over state space
• Curse of dimensionality
• Cannot handle continuous states directly

What comes next: DP requires a perfect model. What if you don't have one? Chapter 5 introduces Monte Carlo methods, which learn value functions directly from episodes of experience — no model needed. The price is that you must wait until an episode ends to learn. Then Chapter 6 (TD learning) will combine the best of both worlds: learning from experience like MC, but bootstrapping like DP, updating after every single step.

ChapterMethodWhat it adds
4 (this one)DPThe foundational algorithms (with a model)
5Monte CarloLearning from episodes (no model)
6TD LearningOne-step learning + bootstrapping (no model)
7n-stepBridging MC and TD with n-step returns
8PlanningUsing learned models + DP (Dyna)
The big picture: DP establishes what the correct updates look like. MC and TD will show how to perform those updates without a model, from real or simulated experience. But the GPI pattern — evaluate, improve, repeat — remains the beating heart throughout. Master GPI, and the rest of the book is variations on a theme.
"The word 'dynamic' was chosen to give the impression that we were
thinking about something that changes over time."
— Richard Bellman, on naming Dynamic Programming
Check: What is the main limitation of DP that motivates Monte Carlo methods?