Computing optimal policies when you have a perfect model of the environment.
In Chapter 3 we wrote down the Bellman equations — elegant recursive relationships that the value function must satisfy. But writing down an equation and solving it are very different things. How do you actually compute the optimal value function for an MDP with thousands of states?
Dynamic programming (DP) answers this: take the Bellman equations and turn them into update rules. Start with any guess for the values, then iteratively apply the Bellman equation at every state, sweeping through the whole state space. Each sweep pushes your values closer to the truth.
There is one crucial requirement: DP needs a complete model of the environment — the full transition dynamics p(s',r|s,a). You must know every possible next state, its probability, and its reward. This is a strong assumption. Later chapters will relax it.
The term "dynamic programming" was coined by Richard Bellman in the 1950s. The "programming" refers to optimization (as in "linear programming"), not computer code. The "dynamic" refers to problems that unfold over time — sequential decision making.
Think of DP as a planning method. You sit down with the complete blueprints of the environment (all transitions, all rewards) and compute the best strategy before you ever take a single action. It is the gold standard — if you have a model, DP gives you the answer. The rest of the book asks: what if you don't have a model?
| What DP needs | What it produces |
|---|---|
| Complete model p(s',r|s,a) | Optimal value function v* |
| Reward function (embedded in model) | Optimal policy π* |
| Discount factor γ | Action-value function q* |
The chapter ahead covers four tightly linked ideas. Policy evaluation computes how good a given policy is. Policy improvement makes the policy better. Policy iteration alternates the two until optimal. Value iteration combines them into one tight loop. And GPI reveals that all RL algorithms are variations on this pattern.
Given a policy π, how do we compute its value function vπ? This is the prediction problem. We know from Chapter 3 that vπ satisfies the Bellman expectation equation:
If we know p and π, this is a system of |S| linear equations in |S| unknowns. We could solve it directly with linear algebra, but that costs O(|S|3) and blows up for large state spaces. Instead, DP uses iterative policy evaluation.
Start with an arbitrary initial guess v0(s) for all states (zeroes work fine). Then apply the Bellman equation as an update rule:
Each application of this update to every state is called a sweep. After enough sweeps, vk converges to vπ. The Bellman operator is a contraction mapping — each sweep brings us strictly closer to the true values. The maximum error across all states shrinks by a factor of γ with every sweep.
In practice, we typically use in-place updates: instead of maintaining two arrays (old and new), we update values one at a time in a single array. Each state immediately sees the freshest values of its neighbors. This converges faster on average, though the order of updates matters.
How do we know when to stop? We track the maximum change across all states during each sweep: Δ = maxs |vk+1(s) − vk(s)|. When Δ drops below a small threshold θ, we declare convergence. The smaller θ, the more precise our answer, but the more sweeps we need.
| Symbol | Meaning |
|---|---|
| π(a|s) | Probability of action a in state s under policy π |
| p(s',r|s,a) | Transition dynamics (the model) |
| γ | Discount factor (0 ≤ γ ≤ 1) |
| vk | Value estimate after k sweeps |
| θ | Convergence threshold (stop when max change < θ) |
pseudocode # Iterative Policy Evaluation Input: policy π, threshold θ Initialize V(s) = 0 for all s repeat: Δ ← 0 for each s: v ← V(s) V(s) ← ∑a π(a|s) ∑s',r p(s',r|s,a)[r + γ V(s')] Δ ← max(Δ, |v − V(s)|) until Δ < θ return V ≈ vπ
Here is a concrete Python implementation for the 4×4 gridworld:
python import numpy as np def policy_evaluation(grid=4, gamma=1.0, theta=1e-6): V = np.zeros(grid * grid) terminals = {0, grid * grid - 1} actions = [(-1,0), (1,0), (0,-1), (0,1)] # up, down, left, right while True: delta = 0 for s in range(grid * grid): if s in terminals: continue v = V[s] new_v = 0 for dr, dc in actions: r, c = s // grid + dr, s % grid + dc sp = s if (r < 0 or r >= grid or c < 0 or c >= grid) else r*grid+c new_v += 0.25 * (-1 + gamma * V[sp]) V[s] = new_v delta = max(delta, abs(v - V[s])) if delta < theta: break return V.reshape(grid, grid)
Let's watch policy evaluation in action. Below is a 4×4 gridworld. The agent can move up, down, left, or right. Moving off the grid keeps you in place. Every move costs −1 reward. The top-left and bottom-right corners are terminal states (shaded in teal). We evaluate the equiprobable random policy — each action has probability 0.25.
All values start at zero. Each time you click Sweep, the algorithm applies one full Bellman update across all 14 non-terminal states. Watch the values evolve. States near the terminals settle first; states in the center, far from any exit, take the longest to converge to their (very negative) true values.
Random policy (π(a|s) = 0.25 for all a). Discount γ = 1. Terminal states at top-left and bottom-right. Click Sweep to advance one iteration, or Auto-run to animate.
Try pausing at k=1 and inspecting the values. Only states adjacent to a terminal have changed from zero (to −1), because only they have a neighbor with a non-zero value. At k=2, the effect spreads one cell further. This wavefront of information is characteristic of synchronous sweeps.
The convergence chart at the bottom of the canvas tracks Δ — the maximum change across all states in each sweep. Watch how Δ drops rapidly at first (the easy states near terminals) then slows down (the center states are still adjusting). Convergence is geometric: each sweep reduces the maximum error by at most a factor of γ. With γ = 1, convergence is slower but still guaranteed because the terminal states anchor the values.
This is Example 4.1 from Sutton & Barto. The 4×4 grid has states numbered 1–14 (state 0 and state 15 are terminal). Under the equiprobable random policy with γ = 1, the converged values are:
| 0.0 | −14 | −20 | −22 |
| −14 | −18 | −20 | −20 |
| −20 | −20 | −18 | −14 |
| −22 | −20 | −14 | 0.0 |
The pattern is symmetric about the diagonal connecting the two terminal corners. States equidistant from both terminals have equal values. The center states (around −20) are the "worst" — under random walk, it takes the longest to stumble into a terminal from there.
These values have a natural interpretation: vπ(s) = −(expected number of steps to termination from s under π). Since each step costs −1 with γ = 1, the value is exactly the negative expected hitting time. State 12 (bottom-left corner, value −22) is the worst — a random walker from there bounces around for about 22 steps before stumbling into a terminal corner.
Let's verify the Bellman equation at one state. Take state 1 (top row, second cell, value −14). Its four neighbors under the random policy are:
Up: hits wall, stays at state 1. But wait — the problem states actions that go off the grid leave you in the same state. So the successor is state 1 itself: reward −1, then value −14.
Right: moves to state 2, value −20.
Down: moves to state 5, value −18.
Left: moves to state 0 (terminal), value 0.
Each term is: probability × [immediate reward + discounted successor value]. The four terms sum to exactly −14 — the Bellman equation is satisfied. You can verify this at any state in the converged grid.
We now know how to evaluate a policy. But knowing vπ is only half the story. The real goal is to find a better policy. Here is the idea: once you have the value function vπ, you can ask a simple question at each state: "What if I took a different action here, then followed π after that?"
That question is answered by the action-value function:
This is the value of taking action a in state s, and then following π forever after. We can compute this for every action using vπ and the model — no new evaluation needed. If qπ(s, a) > vπ(s) for some action a, then switching to action a in state s (and following π elsewhere) would be strictly better than always following π.
The greedy policy takes this to the extreme — at every state, it picks the best action:
Going back to our gridworld example: the random policy gives value −14 at state 1. Let's compute qπ(1, a) for each action:
| Action | Successor | qπ(1, a) = −1 + vπ(s') |
|---|---|---|
| Up (hit wall) | s1 (itself) | −1 + (−14) = −15 |
| Right | s2 | −1 + (−20) = −21 |
| Down | s5 | −1 + (−18) = −19 |
| Left | s0 (terminal) | −1 + 0 = −1 |
The greedy choice is clear: go left (q = −1) instead of averaging all four (v = −14). That's a dramatic improvement — from an expected 14 more steps of random wandering to a single step to the terminal.
In fact, for this particular gridworld, the greedy policy with respect to the random policy's values is already optimal. One round of improvement is enough. This is not always the case, but it shows how powerful even a single improvement step can be.
Why does acting greedily actually produce a better policy? This is guaranteed by the policy improvement theorem, one of the most important results in RL theory.
The condition says: at every state, the action chosen by π' must be at least as valuable (under the old policy's q-values) as what π would do. If π' is greedy with respect to vπ, this is automatically satisfied — by definition, the greedy action has the highest q-value.
The proof chains inequalities forward through time. Start at state s:
Now expand qπ using the Bellman equation:
We applied the same inequality at St+1. Continuing to expand gives:
The chain telescopes: each step replaces vπ(St+k) with qπ(St+k, π'(St+k)), and the final result is the total return under π', which is vπ'(s).
The theorem also works for stochastic policies. If π' is not deterministic but instead mixes over actions, the condition qπ(s, π'(s)) ≥ vπ(s) generalizes to the expected q-value under π' being at least as large. This flexibility is important for methods like ε-greedy improvement, where we mostly act greedy but occasionally explore.
Now we combine the two pieces. Policy iteration alternates between evaluation and improvement in a loop:
Each cycle is guaranteed to strictly improve the policy (by the policy improvement theorem) unless we are already optimal. Since a finite MDP has a finite number of deterministic policies, the algorithm must terminate in a finite number of iterations. No infinite loops, no oscillation — strict monotonic improvement until optimality.
In practice, policy iteration is remarkably fast. For the 4×4 gridworld, the random policy improves to optimal in just one improvement step. For larger problems, convergence typically takes far fewer iterations than there are policies — often single digits, even for MDPs with millions of states.
pseudocode Initialize V(s) arbitrarily, π(s) arbitrarily loop: # 1. Policy Evaluation repeat: Δ ← 0 for each s: v ← V(s) V(s) ← ∑s',r p(s',r|s,π(s))[r + γ V(s')] Δ ← max(Δ, |v − V(s)|) until Δ < θ # 2. Policy Improvement stable ← true for each s: old ← π(s) π(s) ← argmaxa ∑s',r p(s',r|s,a)[r + γ V(s')] if old ≠ π(s): stable ← false if stable: return V, π
Notice that the evaluation step for a deterministic policy is simpler than for a stochastic one: there is no sum over actions, just a single backup for π(s). This makes each evaluation sweep cheaper. The bottleneck is the number of sweeps needed for evaluation to converge before each improvement step.
Here is the full cycle for our gridworld:
Total: one evaluation, one improvement, done. The optimal policy is to always move toward the nearest terminal corner. For states equidistant from both corners (the anti-diagonal), any direction toward either terminal is optimal.
Policy iteration has an inner loop (many sweeps for evaluation) inside an outer loop (policy improvement). Do we really need to wait for evaluation to fully converge? What if we do just one sweep of evaluation and then immediately improve?
It turns out this works perfectly. Value iteration combines one sweep of the Bellman optimality equation into a single update:
Notice the max over actions — this simultaneously evaluates and improves. Each sweep backs up the best action's value. You don't need to maintain a separate policy. After convergence, the optimal policy is simply the greedy policy with respect to v*.
Watch the optimal values and policy arrows emerge sweep by sweep. Terminal states (shaded) at top-left and bottom-right. Every move costs −1. Click Step to advance one Bellman optimality backup, or Auto-run to animate. Policy arrows show the greedy action(s) at each state.
The arrows tell the story. At k=1, every state adjacent to a terminal already points toward it. By k=3 or 4, most arrows have stabilized into the optimal policy: each state's arrow points in the direction that gets it to a terminal in the fewest steps. Because both corners are terminal, states along the anti-diagonal have ties — going toward either corner is equally good, so they show two arrows.
Policy Evaluation update:
Weighted average over all actions
Value Iteration update:
Best action only
pseudocode # Value Iteration Initialize V(s) = 0 for all s repeat: Δ ← 0 for each s: v ← V(s) V(s) ← maxa ∑s',r p(s',r|s,a)[r + γ V(s')] Δ ← max(Δ, |v − V(s)|) until Δ < θ # Extract policy π(s) ← argmaxa ∑s',r p(s',r|s,a)[r + γ V(s')]
Let's trace one sweep of value iteration at k=0 (all values start at 0). Consider state 5 (second row, second column):
| Action | Lands on | r + γ v0(s') |
|---|---|---|
| Up | s1 | −1 + 0 = −1 |
| Down | s9 | −1 + 0 = −1 |
| Left | s4 | −1 + 0 = −1 |
| Right | s6 | −1 + 0 = −1 |
All actions tie at −1, so v1(5) = −1. Now at k=1, state 1's neighbors include state 0 (terminal, value 0), so v2(1) = max(−1+0, −1+(−1), ...) = −1. But state 5 now sees state 1 at −1, making the "up" direction lead to −1 + (−1) = −2, while "left" to state 4 is −1 + (−1) = −2. The values deepen each sweep until they stabilize at the true optimum — each state's value equals the negative of its shortest distance to the nearest terminal.
Policy iteration and value iteration look like very different algorithms, but they are two extremes of the same spectrum. Generalized Policy Iteration (GPI) is the unifying framework that encompasses both — and nearly every RL algorithm you will ever encounter.
Policy iteration: evaluate to full convergence (many sweeps), then improve once. The inner loop runs until Δ < θ.
Value iteration: evaluate for exactly one sweep, then improve immediately (implicitly, via max). No inner loop at all.
GPI says: any interleaving of evaluation and improvement converges, as long as both processes continue to operate. You can do 3 sweeps of evaluation then improve. You can evaluate one state and improve one state. You can even update a single state's value and immediately act greedily at that state. It all works.
The continuum between the two extremes is a trade-off. More evaluation sweeps before improving means more accurate values, so improvement steps are better informed. Fewer evaluation sweeps means faster iteration but potentially less efficient improvement steps. In practice, truncating evaluation early (as value iteration does) is almost always preferred — the extra accuracy rarely justifies the cost.
This pattern appears everywhere in RL:
| Algorithm | Evaluation method | Improvement method |
|---|---|---|
| Policy Iteration | Full iterative sweeps | Greedy |
| Value Iteration | One Bellman optimality sweep | Implicit (max) |
| Monte Carlo (Ch 5) | Episode returns | ε-greedy |
| SARSA (Ch 6) | TD(0) on-policy | ε-greedy |
| Q-learning (Ch 6) | TD(0) off-policy | Greedy (max) |
| Actor-Critic (Ch 13) | TD critic | Policy gradient actor |
The lesson: don't memorize individual algorithms. Understand GPI, and every algorithm becomes a specific instantiation of "how do I evaluate?" and "how do I improve?" The rest of the book fills in this table row by row.
There is a subtle but profound idea running through all of DP: bootstrapping. When we update the value of state s, we use the estimated values of successor states s'. We are updating estimates from estimates — not from complete, observed returns.
This is powerful because it lets us update values after considering just one transition, without waiting for the episode to end. Information propagates quickly: a change to v(s') immediately influences the next update to v(s). But it introduces a dependency — if v(s') is wrong, the update to v(s) inherits that error.
The errors correct themselves through repeated sweeps. Each sweep reduces the maximum error by a factor of γ. But the initial estimates influence the path of convergence, even if the destination is the same.
TD methods (Chapter 6) will be the synthesis: they bootstrap like DP (using estimated successor values) but learn from experience like MC (no model required). This is why DP is so important to understand — it introduces the bootstrapping idea that TD learning inherits.
There is a subtlety worth noting. In our simulations above, we use synchronous updates: we compute all new values from the old array, then swap. This is "two-array" style. Alternatively, in-place (one-array) updates modify V immediately, so later states in the sweep see partially updated values. Both converge, but in-place is typically faster because information from already-updated states propagates within the same sweep.
| Method | Bootstraps? | Needs model? | Updates from |
|---|---|---|---|
| DP | Yes | Yes | Expected values via model |
| MC | No | No | Complete episode returns |
| TD | Yes | No | One-step sample + bootstrap |
To see bootstrapping concretely, consider our gridworld at k=1 of policy evaluation. State 1 computes its value using v(s0) = 0 (terminal, correct) and v(s2) = 0 (not correct — it should be −20). The update gives v(1) = −1, which is wrong (the true answer is −14). But it is less wrong than v(1) = 0, and at the next sweep it will get even closer. This self-correcting behavior is the hallmark of bootstrapping methods.
Dynamic programming gives us the theoretical backbone of reinforcement learning. The two key operations — evaluation and improvement — and their interplay through GPI, set the pattern for every algorithm that follows.
| Algorithm | What it does | Key property |
|---|---|---|
| Policy Evaluation | Computes vπ for a given π | Iterative Bellman expectation backup |
| Policy Improvement | Makes π greedy w.r.t. vπ | Guaranteed no worse (theorem) |
| Policy Iteration | Alternates eval & improve | Converges to π* in finite steps |
| Value Iteration | One-sweep eval + improve | Bellman optimality backup |
| GPI | Any eval + any improve | The universal RL pattern |
Asynchronous DP: We don't have to sweep states in a fixed order. We can update states in any order, prioritizing those whose values are changing the most (prioritized sweeping). We can even run DP in real time, updating the states the agent actually visits. These variants can dramatically speed up convergence in practice and are explored further in Chapter 8 (Planning and Learning with Tabular Methods).
Strengths of DP:
• Guaranteed optimal solution
• Polynomial complexity
• Well-understood convergence
• Foundation for all RL algorithms
Limitations of DP:
• Requires a complete model
• Full sweeps over state space
• Curse of dimensionality
• Cannot handle continuous states directly
What comes next: DP requires a perfect model. What if you don't have one? Chapter 5 introduces Monte Carlo methods, which learn value functions directly from episodes of experience — no model needed. The price is that you must wait until an episode ends to learn. Then Chapter 6 (TD learning) will combine the best of both worlds: learning from experience like MC, but bootstrapping like DP, updating after every single step.
| Chapter | Method | What it adds |
|---|---|---|
| 4 (this one) | DP | The foundational algorithms (with a model) |
| 5 | Monte Carlo | Learning from episodes (no model) |
| 6 | TD Learning | One-step learning + bootstrapping (no model) |
| 7 | n-step | Bridging MC and TD with n-step returns |
| 8 | Planning | Using learned models + DP (Dyna) |