Ch 4: Dynamic Programming — Sutton & Barto RL

Chapter 0: What is DP?

In Chapter 3 we wrote down the Bellman equations — elegant recursive relationships that the value function must satisfy. But writing down an equation and solving it are very different things. How do you actually compute the optimal value function for an MDP with thousands of states?

Dynamic programming (DP) answers this: take the Bellman equations and turn them into update rules. Start with any guess for the values, then iteratively apply the Bellman equation at every state, sweeping through the whole state space. Each sweep pushes your values closer to the truth.

There is one crucial requirement: DP needs a complete model of the environment — the full transition dynamics p(s',r|s,a). You must know every possible next state, its probability, and its reward. This is a strong assumption. Later chapters will relax it.

The core idea: DP turns the Bellman equation from a statement about values ("if the values are correct, they satisfy this relationship") into an algorithm ("repeatedly apply this rule and the values will become correct"). It is the computational backbone of everything that follows.

Bellman Equation

A system of |S| simultaneous equations

↓ turn into update rule

DP Update

Sweep through all states, apply the equation

↻ repeat until convergence

Optimal Values

v_* or v_π to any desired precision

The term "dynamic programming" was coined by Richard Bellman in the 1950s. The "programming" refers to optimization (as in "linear programming"), not computer code. The "dynamic" refers to problems that unfold over time — sequential decision making.

Think of DP as a planning method. You sit down with the complete blueprints of the environment (all transitions, all rewards) and compute the best strategy before you ever take a single action. It is the gold standard — if you have a model, DP gives you the answer. The rest of the book asks: what if you don't have a model?

What DP needs	What it produces
Complete model p(s',r\|s,a)	Optimal value function v_*
Reward function (embedded in model)	Optimal policy π_*
Discount factor γ	Action-value function q_*

The chapter ahead covers four tightly linked ideas. Policy evaluation computes how good a given policy is. Policy improvement makes the policy better. Policy iteration alternates the two until optimal. Value iteration combines them into one tight loop. And GPI reveals that all RL algorithms are variations on this pattern.

DP in the real world: Despite the model requirement, DP is not just a theoretical curiosity. Many practical problems do have known models: inventory management, network routing, elevator scheduling, game AIs (chess, Go endgames), and financial derivatives pricing. DP also serves as the inner loop of model-based RL — you learn a model, then plan with DP. Chapter 8 (Dyna) will explore this idea.

Check: What does DP require that later methods (MC, TD) will not?

A large number of episodes A complete model p(s',r|s,a) of the environment A neural network to approximate values

Chapter 1: Policy Evaluation

Given a policy π, how do we compute its value function v_π? This is the prediction problem. We know from Chapter 3 that v_π satisfies the Bellman expectation equation:

v_π(s) = ∑_a π(a|s) ∑_s',r p(s',r|s,a) [ r + γ v_π(s') ]

If we know p and π, this is a system of |S| linear equations in |S| unknowns. We could solve it directly with linear algebra, but that costs O(|S|³) and blows up for large state spaces. Instead, DP uses iterative policy evaluation.

Start with an arbitrary initial guess v₀(s) for all states (zeroes work fine). Then apply the Bellman equation as an update rule:

v_k+1(s) = ∑_a π(a|s) ∑_s',r p(s',r|s,a) [ r + γ v_k(s') ]

Each application of this update to every state is called a sweep. After enough sweeps, v_k converges to v_π. The Bellman operator is a contraction mapping — each sweep brings us strictly closer to the true values. The maximum error across all states shrinks by a factor of γ with every sweep.

Key insight: We replaced "solve this system of equations" with "iterate this update rule." The update uses the old values v_k on the right side to compute new values v_k+1 on the left. Convergence is guaranteed as long as γ < 1 or the episode terminates.

In practice, we typically use in-place updates: instead of maintaining two arrays (old and new), we update values one at a time in a single array. Each state immediately sees the freshest values of its neighbors. This converges faster on average, though the order of updates matters.

How do we know when to stop? We track the maximum change across all states during each sweep: Δ = max_s |v_k+1(s) − v_k(s)|. When Δ drops below a small threshold θ, we declare convergence. The smaller θ, the more precise our answer, but the more sweeps we need.

Convergence speed: With discount factor γ < 1, the error contracts by γ per sweep, so log(1/ε) / log(1/γ) sweeps suffice for precision ε. With γ = 0.9, that's about 22 sweeps for 10% accuracy, 44 for 1%. With γ = 1 (undiscounted, episodic), convergence is slower but still guaranteed because terminal states anchor the computation.

Symbol	Meaning
π(a\|s)	Probability of action a in state s under policy π
p(s',r\|s,a)	Transition dynamics (the model)
γ	Discount factor (0 ≤ γ ≤ 1)
v_k	Value estimate after k sweeps
θ	Convergence threshold (stop when max change < θ)

pseudocode
# Iterative Policy Evaluation
Input: policy π, threshold θ
Initialize V(s) = 0 for all s

repeat:
  Δ ← 0
  for each s:
    v ← V(s)
    V(s) ← ∑_a π(a|s) ∑_s',r p(s',r|s,a)[r + γ V(s')]
    Δ ← max(Δ, |v − V(s)|)
until Δ < θ

return V ≈ v_π

Here is a concrete Python implementation for the 4×4 gridworld:

python
import numpy as np

def policy_evaluation(grid=4, gamma=1.0, theta=1e-6):
    V = np.zeros(grid * grid)
    terminals = {0, grid * grid - 1}
    actions = [(-1,0), (1,0), (0,-1), (0,1)]  # up, down, left, right

    while True:
        delta = 0
        for s in range(grid * grid):
            if s in terminals: continue
            v = V[s]
            new_v = 0
            for dr, dc in actions:
                r, c = s // grid + dr, s % grid + dc
                sp = s if (r < 0 or r >= grid or
                         c < 0 or c >= grid) else r*grid+c
                new_v += 0.25 * (-1 + gamma * V[sp])
            V[s] = new_v
            delta = max(delta, abs(v - V[s]))
        if delta < theta: break
    return V.reshape(grid, grid)

Check: In iterative policy evaluation, what does one "sweep" consist of?

Applying the Bellman update to every state once Running one episode of the policy Changing the policy to be greedy

Chapter 2: Iterative Sweeps

Let's watch policy evaluation in action. Below is a 4×4 gridworld. The agent can move up, down, left, or right. Moving off the grid keeps you in place. Every move costs −1 reward. The top-left and bottom-right corners are terminal states (shaded in teal). We evaluate the equiprobable random policy — each action has probability 0.25.

All values start at zero. Each time you click Sweep, the algorithm applies one full Bellman update across all 14 non-terminal states. Watch the values evolve. States near the terminals settle first; states in the center, far from any exit, take the longest to converge to their (very negative) true values.

Policy Evaluation: 4×4 Gridworld

Random policy (π(a|s) = 0.25 for all a). Discount γ = 1. Terminal states at top-left and bottom-right. Click Sweep to advance one iteration, or Auto-run to animate.

k = 0

Speed300ms

What to notice: After just a few sweeps, values near the terminal states settle first. Values far from terminals take longer — information has to propagate across the grid one cell per sweep. By k ≈ 150, the values have converged: the center states (around −20) are the most negative because a random walk from there takes the longest to reach any terminal.

Try pausing at k=1 and inspecting the values. Only states adjacent to a terminal have changed from zero (to −1), because only they have a neighbor with a non-zero value. At k=2, the effect spreads one cell further. This wavefront of information is characteristic of synchronous sweeps.

The convergence chart at the bottom of the canvas tracks Δ — the maximum change across all states in each sweep. Watch how Δ drops rapidly at first (the easy states near terminals) then slows down (the center states are still adjusting). Convergence is geometric: each sweep reduces the maximum error by at most a factor of γ. With γ = 1, convergence is slower but still guaranteed because the terminal states anchor the values.

Sweep count vs. accuracy: You don't always need full convergence. Even 10-20 sweeps give values accurate enough for policy improvement to produce the optimal policy. This observation motivates truncated evaluation and ultimately value iteration (Chapter 7).

Check: Under the random policy on this grid, which states converge to the most negative values?

States adjacent to the terminals States farthest from both terminals All non-terminal states have equal values

Chapter 3: The Gridworld Example

This is Example 4.1 from Sutton & Barto. The 4×4 grid has states numbered 1–14 (state 0 and state 15 are terminal). Under the equiprobable random policy with γ = 1, the converged values are:

0.0	−14	−20	−22
−14	−18	−20	−20
−20	−20	−18	−14
−22	−20	−14	0.0

The pattern is symmetric about the diagonal connecting the two terminal corners. States equidistant from both terminals have equal values. The center states (around −20) are the "worst" — under random walk, it takes the longest to stumble into a terminal from there.

These values have a natural interpretation: v_π(s) = −(expected number of steps to termination from s under π). Since each step costs −1 with γ = 1, the value is exactly the negative expected hitting time. State 12 (bottom-left corner, value −22) is the worst — a random walker from there bounces around for about 22 steps before stumbling into a terminal corner.

Why this matters: These values tell you something useful even though the random policy is terrible. They quantify how bad each state is under this policy. That information is exactly what we need to improve the policy — which is the next step.

Let's verify the Bellman equation at one state. Take state 1 (top row, second cell, value −14). Its four neighbors under the random policy are:

Up: hits wall, stays at state 1. But wait — the problem states actions that go off the grid leave you in the same state. So the successor is state 1 itself: reward −1, then value −14.

Right: moves to state 2, value −20.

Down: moves to state 5, value −18.

Left: moves to state 0 (terminal), value 0.

v(1) = 0.25[−1 + (−14)] + 0.25[−1 + (−20)] + 0.25[−1 + (−18)] + 0.25[−1 + 0] = −14

Each term is: probability × [immediate reward + discounted successor value]. The four terms sum to exactly −14 — the Bellman equation is satisfied. You can verify this at any state in the converged grid.

Exercise: Try verifying state 5 (second row, second cell, value −18). Its neighbors are: up to state 1 (−14), right to state 6 (−20), down to state 9 (−20), left to state 4 (−14). Check: 0.25[−1 + (−14)] + 0.25[−1 + (−20)] + 0.25[−1 + (−20)] + 0.25[−1 + (−14)] = −18.

Check: Why is the converged value grid symmetric about the diagonal?

The two terminal states and random policy create a symmetric structure All grids are symmetric under policy evaluation It is a coincidence of this particular grid size

Chapter 4: Policy Improvement

We now know how to evaluate a policy. But knowing v_π is only half the story. The real goal is to find a better policy. Here is the idea: once you have the value function v_π, you can ask a simple question at each state: "What if I took a different action here, then followed π after that?"

That question is answered by the action-value function:

q_π(s, a) = ∑_s',r p(s',r|s,a) [ r + γ v_π(s') ]

This is the value of taking action a in state s, and then following π forever after. We can compute this for every action using v_π and the model — no new evaluation needed. If q_π(s, a) > v_π(s) for some action a, then switching to action a in state s (and following π elsewhere) would be strictly better than always following π.

The greedy policy takes this to the extreme — at every state, it picks the best action:

π'(s) = argmax_a q_π(s, a) = argmax_a ∑_s',r p(s',r|s,a) [ r + γ v_π(s') ]

Key insight: You don't need to search over all possible policies. You just compute values for the current policy, then act greedily with respect to those values. This "one-step lookahead" is enough to guarantee improvement. Greedy in the short term, optimal in the long run.

Going back to our gridworld example: the random policy gives value −14 at state 1. Let's compute q_π(1, a) for each action:

Action	Successor	q_π(1, a) = −1 + v_π(s')
Up (hit wall)	s1 (itself)	−1 + (−14) = −15
Right	s2	−1 + (−20) = −21
Down	s5	−1 + (−18) = −19
Left	s0 (terminal)	−1 + 0 = −1

The greedy choice is clear: go left (q = −1) instead of averaging all four (v = −14). That's a dramatic improvement — from an expected 14 more steps of random wandering to a single step to the terminal.

In fact, for this particular gridworld, the greedy policy with respect to the random policy's values is already optimal. One round of improvement is enough. This is not always the case, but it shows how powerful even a single improvement step can be.

One-step lookahead: Policy improvement doesn't plan far ahead. It just asks, "given where each action leads me, and the values I have, which single action is best?" Yet this myopic reasoning, repeated at every state, produces a globally better policy. The magic is in the values — they already encode the long-term consequences.

Check: How is the improved policy π' constructed from v_π?

By randomly perturbing the current policy By running more episodes By acting greedily — picking the action that maximizes the one-step lookahead

Chapter 5: The Policy Improvement Theorem

Why does acting greedily actually produce a better policy? This is guaranteed by the policy improvement theorem, one of the most important results in RL theory.

Policy Improvement Theorem: Let π and π' be two policies. If for all states s:

q_π(s, π'(s)) ≥ v_π(s)

then π' is at least as good as π. That is, v_π'(s) ≥ v_π(s) for all s.

The condition says: at every state, the action chosen by π' must be at least as valuable (under the old policy's q-values) as what π would do. If π' is greedy with respect to v_π, this is automatically satisfied — by definition, the greedy action has the highest q-value.

The proof chains inequalities forward through time. Start at state s:

v_π(s) ≤ q_π(s, π'(s))

Now expand q_π using the Bellman equation:

= E_π'[R_t+1 + γ v_π(S_t+1)] ≤ E_π'[R_t+1 + γ q_π(S_t+1, π'(S_t+1))]

We applied the same inequality at S_t+1. Continuing to expand gives:

≤ E_π'[R_t+1 + γ R_t+2 + γ² R_t+3 + …] = v_π'(s)

The chain telescopes: each step replaces v_π(S_t+k) with q_π(S_t+k, π'(S_t+k)), and the final result is the total return under π', which is v_π'(s).

When does equality hold? When π' equals π everywhere — meaning the greedy policy is the same as the original. In that case, v_π already satisfies the Bellman optimality equation: v_π(s) = max_a q_π(s, a). So π is already optimal. This is how we know policy iteration has converged.

The theorem also works for stochastic policies. If π' is not deterministic but instead mixes over actions, the condition q_π(s, π'(s)) ≥ v_π(s) generalizes to the expected q-value under π' being at least as large. This flexibility is important for methods like ε-greedy improvement, where we mostly act greedy but occasionally explore.

Why this theorem is so powerful: It means we never need to search over the entire policy space. We evaluate once, compute one greedy step, and we are guaranteed progress. Repeated application converges to the global optimum — not a local one. This is rare in optimization: most iterative methods only guarantee local optima. The special structure of MDPs (the Bellman equation) gives us this global guarantee for free.

Check: What does the policy improvement theorem guarantee?

If π' is greedy w.r.t. v_π, then v_π'(s) ≥ v_π(s) for all s The greedy policy is always the globally optimal policy Policy evaluation always converges in 10 sweeps

Chapter 6: Policy Iteration

Now we combine the two pieces. Policy iteration alternates between evaluation and improvement in a loop:

1. Evaluate

Compute v_π by iterative sweeps (until convergence)

↓

2. Improve

π' = greedy(v_π) at every state

↓

3. Check

If π' = π, stop — π is optimal. Otherwise set π ← π' and go to step 1.

Each cycle is guaranteed to strictly improve the policy (by the policy improvement theorem) unless we are already optimal. Since a finite MDP has a finite number of deterministic policies, the algorithm must terminate in a finite number of iterations. No infinite loops, no oscillation — strict monotonic improvement until optimality.

In practice, policy iteration is remarkably fast. For the 4×4 gridworld, the random policy improves to optimal in just one improvement step. For larger problems, convergence typically takes far fewer iterations than there are policies — often single digits, even for MDPs with millions of states.

How fast is "fast"? Jack's car rental problem (Example 4.2 in the book) has over 441 states and 21 possible actions per state. Policy iteration finds the optimal strategy in just 4 iterations. Each iteration does many evaluation sweeps (the inner loop), but only 4 policy changes are needed. That's 4 improvements out of 21⁴⁴¹ possible policies — an astronomically efficient search.

Key insight: Policy iteration works because evaluation and improvement bootstrap off each other. Evaluation says "here's how good your current plan is." Improvement says "here's a better plan given those values." Repeat. This dance converges to optimality.

pseudocode
Initialize V(s) arbitrarily, π(s) arbitrarily

loop:
  # 1. Policy Evaluation
  repeat:
    Δ ← 0
    for each s:
      v ← V(s)
      V(s) ← ∑_s',r p(s',r|s,π(s))[r + γ V(s')]
      Δ ← max(Δ, |v − V(s)|)
  until Δ < θ

  # 2. Policy Improvement
  stable ← true
  for each s:
    old ← π(s)
    π(s) ← argmax_a ∑_s',r p(s',r|s,a)[r + γ V(s')]
    if old ≠ π(s): stable ← false
  if stable: return V, π

Notice that the evaluation step for a deterministic policy is simpler than for a stochastic one: there is no sum over actions, just a single backup for π(s). This makes each evaluation sweep cheaper. The bottleneck is the number of sweeps needed for evaluation to converge before each improvement step.

Counting policies: A deterministic policy on our 4×4 grid (14 non-terminal states, 4 actions each) has 4¹⁴ ≈ 268 million possibilities. Brute-force search would evaluate each one. Policy iteration typically finds the optimum in 2–3 iterations — an astronomically faster path to the answer. This gap only widens with larger state spaces.

Here is the full cycle for our gridworld:

Start

π₀ = random policy (each action prob 0.25)

↓

Evaluate π₀

~150 sweeps → v_π0 (values from −14 to −22)

↓

Improve

π₁ = greedy(v_π0): each state points to nearest terminal

↓

Check

π₁ is already optimal — evaluating it gives the same greedy policy. Done!

Total: one evaluation, one improvement, done. The optimal policy is to always move toward the nearest terminal corner. For states equidistant from both corners (the anti-diagonal), any direction toward either terminal is optimal.

Check: Why must policy iteration terminate on a finite MDP?

Because the discount factor γ is less than 1 Because there are finitely many policies and each step strictly improves Because the evaluation step always converges in one sweep

Chapter 7: Value Iteration

Policy iteration has an inner loop (many sweeps for evaluation) inside an outer loop (policy improvement). Do we really need to wait for evaluation to fully converge? What if we do just one sweep of evaluation and then immediately improve?

It turns out this works perfectly. Value iteration combines one sweep of the Bellman optimality equation into a single update:

v_k+1(s) = max_a ∑_s',r p(s',r|s,a) [ r + γ v_k(s') ]

Notice the max over actions — this simultaneously evaluates and improves. Each sweep backs up the best action's value. You don't need to maintain a separate policy. After convergence, the optimal policy is simply the greedy policy with respect to v_*.

Value iteration = policy evaluation where you take the max. Compare the two updates side by side. Policy evaluation averages over π(a|s). Value iteration takes the max. That single change — replacing a weighted sum with a max — turns evaluation into optimization.

Value Iteration: 4×4 Gridworld

Watch the optimal values and policy arrows emerge sweep by sweep. Terminal states (shaded) at top-left and bottom-right. Every move costs −1. Click Step to advance one Bellman optimality backup, or Auto-run to animate. Policy arrows show the greedy action(s) at each state.

k = 0

Speed300ms

The arrows tell the story. At k=1, every state adjacent to a terminal already points toward it. By k=3 or 4, most arrows have stabilized into the optimal policy: each state's arrow points in the direction that gets it to a terminal in the fewest steps. Because both corners are terminal, states along the anti-diagonal have ties — going toward either corner is equally good, so they show two arrows.

Compare speeds: Policy evaluation of the random policy takes ~150 sweeps to converge. Value iteration converges in ~7 sweeps (for the same grid). The max operator is much more aggressive — it immediately locks in the best direction, and values converge to v_* directly rather than first computing v_π for a bad policy. The trade-off: each value iteration sweep does more work per state (computing max over all actions instead of a single action).

Policy Evaluation update:

v(s) ← ∑_a π(a|s) · [r + γ v(s')]

Weighted average over all actions

Value Iteration update:

v(s) ← max_a [r + γ v(s')]

Best action only

pseudocode
# Value Iteration
Initialize V(s) = 0 for all s

repeat:
  Δ ← 0
  for each s:
    v ← V(s)
    V(s) ← max_a ∑_s',r p(s',r|s,a)[r + γ V(s')]
    Δ ← max(Δ, |v − V(s)|)
until Δ < θ

# Extract policy
π(s) ← argmax_a ∑_s',r p(s',r|s,a)[r + γ V(s')]

Let's trace one sweep of value iteration at k=0 (all values start at 0). Consider state 5 (second row, second column):

Action	Lands on	r + γ v₀(s')
Up	s1	−1 + 0 = −1
Down	s9	−1 + 0 = −1
Left	s4	−1 + 0 = −1
Right	s6	−1 + 0 = −1

All actions tie at −1, so v₁(5) = −1. Now at k=1, state 1's neighbors include state 0 (terminal, value 0), so v₂(1) = max(−1+0, −1+(−1), ...) = −1. But state 5 now sees state 1 at −1, making the "up" direction lead to −1 + (−1) = −2, while "left" to state 4 is −1 + (−1) = −2. The values deepen each sweep until they stabilize at the true optimum — each state's value equals the negative of its shortest distance to the nearest terminal.

Optimal values for this grid: Under value iteration with γ = 1, the converged v_* values equal the negative Manhattan distance to the nearest terminal. State 1 (one step from s0) converges to −1. State 5 (two steps) converges to −2. State 10 (two steps from s15) converges to −2. The center state s5 is 2 steps away, so v_*(5) = −2. Much better than −18 under the random policy!

Check: How does the value iteration update differ from policy evaluation?

It uses max over actions instead of averaging under a fixed policy It uses a smaller discount factor It only updates terminal states

Chapter 8: Generalized Policy Iteration

Policy iteration and value iteration look like very different algorithms, but they are two extremes of the same spectrum. Generalized Policy Iteration (GPI) is the unifying framework that encompasses both — and nearly every RL algorithm you will ever encounter.

Policy iteration: evaluate to full convergence (many sweeps), then improve once. The inner loop runs until Δ < θ.

Value iteration: evaluate for exactly one sweep, then improve immediately (implicitly, via max). No inner loop at all.

GPI says: any interleaving of evaluation and improvement converges, as long as both processes continue to operate. You can do 3 sweeps of evaluation then improve. You can evaluate one state and improve one state. You can even update a single state's value and immediately act greedily at that state. It all works.

The continuum between the two extremes is a trade-off. More evaluation sweeps before improving means more accurate values, so improvement steps are better informed. Fewer evaluation sweeps means faster iteration but potentially less efficient improvement steps. In practice, truncating evaluation early (as value iteration does) is almost always preferred — the extra accuracy rarely justifies the cost.

The GPI dance: Think of two competing forces. Evaluation adjusts values to be consistent with the current policy (pushing V toward v_π). Improvement changes the policy to be greedy with respect to current values (pushing π toward greedy(V)). Each force makes the other "wrong" — changing π invalidates V, and changing V changes the greedy policy. But together they spiral toward the fixed point where both are satisfied: v_* and π_*.

Evaluation

Push V toward v_π (make values consistent with policy)

↓ interact ↓

Improvement

Push π toward greedy(V) (make policy consistent with values)

↻ converge to v_*, π_*

This pattern appears everywhere in RL:

Algorithm	Evaluation method	Improvement method
Policy Iteration	Full iterative sweeps	Greedy
Value Iteration	One Bellman optimality sweep	Implicit (max)
Monte Carlo (Ch 5)	Episode returns	ε-greedy
SARSA (Ch 6)	TD(0) on-policy	ε-greedy
Q-learning (Ch 6)	TD(0) off-policy	Greedy (max)
Actor-Critic (Ch 13)	TD critic	Policy gradient actor

The lesson: don't memorize individual algorithms. Understand GPI, and every algorithm becomes a specific instantiation of "how do I evaluate?" and "how do I improve?" The rest of the book fills in this table row by row.

The fixed point: At the optimum, both evaluation and improvement are satisfied simultaneously: V = v_* (values are correct for the current policy), and π = greedy(V) (the policy is greedy with respect to current values). Since v_* satisfies the Bellman optimality equation, the greedy policy with respect to v_* is π_*. The two forces are in balance.

Check: What does GPI say about the relationship between evaluation and improvement?

Any interleaving of evaluation and improvement converges to optimality Evaluation must fully converge before any improvement step Improvement must be done before evaluation

Chapter 9: Bootstrapping

There is a subtle but profound idea running through all of DP: bootstrapping. When we update the value of state s, we use the estimated values of successor states s'. We are updating estimates from estimates — not from complete, observed returns.

v(s) ← … r + γ v(s') … ← this is an estimate, not a known fact

This is powerful because it lets us update values after considering just one transition, without waiting for the episode to end. Information propagates quickly: a change to v(s') immediately influences the next update to v(s). But it introduces a dependency — if v(s') is wrong, the update to v(s) inherits that error.

The errors correct themselves through repeated sweeps. Each sweep reduces the maximum error by a factor of γ. But the initial estimates influence the path of convergence, even if the destination is the same.

DP bootstraps. MC does not. Monte Carlo methods (Chapter 5) wait until the episode ends, then use the actual total return G_t to update values. No estimates of estimates — just cold, hard data. This avoids the bias that bootstrapping introduces, but it is slower: you learn nothing until the episode finishes.

TD methods (Chapter 6) will be the synthesis: they bootstrap like DP (using estimated successor values) but learn from experience like MC (no model required). This is why DP is so important to understand — it introduces the bootstrapping idea that TD learning inherits.

There is a subtlety worth noting. In our simulations above, we use synchronous updates: we compute all new values from the old array, then swap. This is "two-array" style. Alternatively, in-place (one-array) updates modify V immediately, so later states in the sweep see partially updated values. Both converge, but in-place is typically faster because information from already-updated states propagates within the same sweep.

Method	Bootstraps?	Needs model?	Updates from
DP	Yes	Yes	Expected values via model
MC	No	No	Complete episode returns
TD	Yes	No	One-step sample + bootstrap

A useful analogy: Bootstrapping is like building a bridge from both sides of a river. You trust that the other side is being built correctly even though you can't see it yet. DP's guarantee is that repeated sweeps ensure both sides meet in the middle. MC, by contrast, builds the entire bridge from one end to the other — slower, but no risk of misalignment.

To see bootstrapping concretely, consider our gridworld at k=1 of policy evaluation. State 1 computes its value using v(s0) = 0 (terminal, correct) and v(s2) = 0 (not correct — it should be −20). The update gives v(1) = −1, which is wrong (the true answer is −14). But it is less wrong than v(1) = 0, and at the next sweep it will get even closer. This self-correcting behavior is the hallmark of bootstrapping methods.

Looking ahead: The distinction between bootstrapping and non-bootstrapping methods is one of the three fundamental dimensions Sutton and Barto use to organize all of RL. The other two are: model-based vs. model-free, and sample updates vs. expected updates. DP sits at one extreme (bootstrapping, model-based, expected). Pure MC sits at the opposite (non-bootstrapping, model-free, sample). Understanding where each method sits on these axes is the roadmap for the rest of the book.

Check: What does "bootstrapping" mean in the context of DP?

Initializing values from a previous run Updating value estimates using other value estimates Sampling episodes and computing returns

Chapter 10: Summary

Dynamic programming gives us the theoretical backbone of reinforcement learning. The two key operations — evaluation and improvement — and their interplay through GPI, set the pattern for every algorithm that follows.

Algorithm	What it does	Key property
Policy Evaluation	Computes v_π for a given π	Iterative Bellman expectation backup
Policy Improvement	Makes π greedy w.r.t. v_π	Guaranteed no worse (theorem)
Policy Iteration	Alternates eval & improve	Converges to π_* in finite steps
Value Iteration	One-sweep eval + improve	Bellman optimality backup
GPI	Any eval + any improve	The universal RL pattern

Efficiency: DP is polynomial in |S| and |A|, which is exponentially faster than brute-force search over all |A|^|S| deterministic policies. But it still requires full sweeps over the entire state space, which becomes impractical for large or continuous problems. This is the curse of dimensionality — a term, fittingly, also coined by Bellman.

Asynchronous DP: We don't have to sweep states in a fixed order. We can update states in any order, prioritizing those whose values are changing the most (prioritized sweeping). We can even run DP in real time, updating the states the agent actually visits. These variants can dramatically speed up convergence in practice and are explored further in Chapter 8 (Planning and Learning with Tabular Methods).

Strengths of DP:

• Guaranteed optimal solution
• Polynomial complexity
• Well-understood convergence
• Foundation for all RL algorithms

Limitations of DP:

• Requires a complete model
• Full sweeps over state space
• Curse of dimensionality
• Cannot handle continuous states directly

What comes next: DP requires a perfect model. What if you don't have one? Chapter 5 introduces Monte Carlo methods, which learn value functions directly from episodes of experience — no model needed. The price is that you must wait until an episode ends to learn. Then Chapter 6 (TD learning) will combine the best of both worlds: learning from experience like MC, but bootstrapping like DP, updating after every single step.

Chapter	Method	What it adds
4 (this one)	DP	The foundational algorithms (with a model)
5	Monte Carlo	Learning from episodes (no model)
6	TD Learning	One-step learning + bootstrapping (no model)
7	n-step	Bridging MC and TD with n-step returns
8	Planning	Using learned models + DP (Dyna)

The big picture: DP establishes what the correct updates look like. MC and TD will show how to perform those updates without a model, from real or simulated experience. But the GPI pattern — evaluate, improve, repeat — remains the beating heart throughout. Master GPI, and the rest of the book is variations on a theme.

"The word 'dynamic' was chosen to give the impression that we were
thinking about something that changes over time."
— Richard Bellman, on naming Dynamic Programming

Check: What is the main limitation of DP that motivates Monte Carlo methods?

DP requires a complete model of the environment DP is too slow to ever converge DP only works for episodic tasks