Kochenderfer, Wheeler & Wray, Chapter 7

Exact Solution Methods

MDPs, policy evaluation, value iteration, policy iteration, LQR — computing optimal decisions under uncertainty.

Prerequisites: Probability + Basic linear algebra. That’s it.
12
Chapters
8+
Simulations
12
Quizzes

Chapter 0: Why Sequential Decisions Are Hard

A robot is navigating a grid to reach a goal. It has a map, it has motors, and it knows its target. Easy enough. But every time it sends a “move north” command, there is a 20% chance the wheels slip and it drifts east instead. Now what?

Unlike deterministic planning — where you compute one path and execute it — stochastic environments force you to reason about sequences of decisions under uncertainty. The same action from the same state might lead to different outcomes. You cannot plan a fixed path; you need a policy that tells you what to do in any situation you might land in.

The key shift: We’re not planning a sequence of actions. We’re computing a function π(s) that maps every state to the best action. That function — the policy — is the solution to a stochastic sequential decision problem.

The standard mathematical model for this class of problems is a Markov decision process (MDP). An MDP packages five ingredients:

ComponentSymbolWhat it says
State spaceSThe finite set of all possible situations
Action spaceAThe finite set of all possible moves
Transition functionT(s′|s,a)Prob. of landing in s′ after doing a in s
Reward functionR(s,a)Expected reward for taking action a in state s
Discount factorγHow much we value future rewards (0 ≤ γ < 1)

The Markov property is the defining assumption: the distribution over next states T(s′|s,a) depends only on the current state and action — not on how we got there. No memory of past states is needed. This is what makes the math tractable.

Utility: What Are We Maximizing?

For a finite horizon of T steps, the total utility is just the sum of rewards: U = r1 + r2 + … + rT. But for an infinite horizon problem, this sum can blow up. We need to discount future rewards.

The discounted return is:

U = r1 + γr2 + γ2r3 + … = ∑t=1 γt−1 rt

If γ = 0.9, a reward 10 steps away is worth 0.910 ≈ 0.35 of its face value. A reward 20 steps away is worth only 0.920 ≈ 0.12. Discounting makes far-future rewards matter less — and crucially, it guarantees the infinite sum converges as long as rewards are bounded.

Exercise 7.1 worked out: Suppose you get the same reward r every step forever. What is the total discounted utility? Let S = r + γr + γ2r + … . Multiply both sides by γ: γS = γr + γ2r + … . Subtract: S − γS = r, so S(1−γ) = r, giving S = r/(1−γ). For r = 1 and γ = 0.9, the sum is 10. The discount factor acts as a geometric series multiplier.

A policy π(s) maps every state to an action. For infinite-horizon MDPs, the optimal policy is stationary (doesn’t change with time) and deterministic (picks one action per state). The value function Uπ(s) is the expected discounted return from starting in state s and following policy π forever:

Uπ(s) = 𝔼π[ r1 + γr2 + γ2r3 + … | s1 = s ]

The optimal value function U*(s) = maxπ Uπ(s) gives the best achievable expected return from each state. Finding U* — and the policy that achieves it — is what the rest of this chapter is about.

A Worked Example: Exercise 7.2

Consider a 5-state MDP: states s1…s5. Two actions: stay (remain in current state) and continue (move to the next state). The only nonzero reward is R(s4, continue) = 10, which takes you to s5.

If we follow the optimal policy from s1: always continue. We pass through s1 → s2 → s3 → s4, collecting zero reward at each step, then at step 4 we take the reward. The discounted value is:

U*(s1) = γ3 × 10 = 1

Solving for γ: γ3 = 0.1, so γ = 0.11/3 ≈ 0.464. A discount factor of about 46% makes a reward 3 steps away worth exactly 1 unit today.

MDP Grid World: Discount Factor Reach

A 6×6 grid. The goal (green) is at bottom-right with reward +20. All other cells show the discounted value of the goal reward as a function of shortest-path distance. Drag γ to see how discounting “reaches” across the grid.

γ 0.80

High γ: distant rewards stay large (blue). Low γ: only nearby cells have significant value (dark).

Real-world example: Aircraft Collision Avoidance

ACAS X, the FAA’s collision avoidance system, is literally solved as an MDP. States encode positions and velocities of both aircraft. Actions are climb, descend, or level. Collision gives a large negative reward (−10000). Unnecessary maneuvering gives a small negative reward (−10). The system pre-computes the optimal policy over millions of states offline, then looks up the right action in real time. It outperforms the previous rule-based system both in safety and pilot annoyance metrics.

python
from dataclasses import dataclass
from typing import List, Callable

@dataclass
class MDP:
    S: List          # state space (list of states)
    A: List          # action space (list of actions)
    T: Callable      # T(s, a, sp) -> transition probability
    R: Callable      # R(s, a) -> expected immediate reward
    gamma: float     # discount factor in [0, 1)

# Example: 3-state chain. Two actions: left (0), right (1).
# Right from state 2 gives reward 10. All else gives -1.
def chain_T(s, a, sp):
    if a == 1:  # right: move to s+1 (stay if at edge)
        target = min(s + 1, 2)
    else:      # left: move to s-1 (stay if at edge)
        target = max(s - 1, 0)
    return 1.0 if sp == target else 0.0

def chain_R(s, a):
    return 10.0 if (s == 1 and a == 1) else -1.0

mdp = MDP(S=[0,1,2], A=[0,1], T=chain_T, R=chain_R, gamma=0.9)
A robot gets reward 100 at a goal exactly 5 steps away, with γ = 0.9. What is the discounted value of this reward from the robot’s current state?

Chapter 1: Policy Evaluation — What Is a Policy Worth?

Before we can optimize, we need to evaluate. Given a fixed policy π, how good is it? The answer is the value function Uπ(s): the expected discounted return starting from s and following π forever.

Computing Uπ is called the prediction problem. It has a clean recursive structure. From any state s, following policy π gives immediate reward R(s, π(s)), then we land in state s′ with probability T(s′|s, π(s)), and get value Uπ(s′) from there. Putting it together, the Bellman expectation equation (eq. 7.6) must hold at every state simultaneously:

Uπ(s) = R(s, π(s)) + γ ∑s′ T(s′|s, π(s)) Uπ(s′)

This is a system of |S| equations in |S| unknowns. Two ways to solve it:

Iterative method (Algorithm 7.3): Start with U(s) = 0 for all s. Repeatedly apply the lookahead equation (eq. 7.5):

Uk+1π(s) = R(s,π(s)) + γ∑s′ T(s′|s,π(s)) Ukπ(s′)

Repeat until values stop changing. Cheap per iteration; needs many iterations.

Exact method (Algorithm 7.4): Write in matrix form:

Uπ = Rπ + γTπUπ

Solve: Uπ = (I − γTπ)−1 Rπ. Exact in one shot; costs O(|S|3) for matrix inversion.

Why Does the Iterative Method Converge?

The update is a contraction mapping. Define the Bellman operator Bπ as “apply one round of the update to every state.” For any two value functions U1 and U2:

||BπU1 − BπU2|| ≤ γ ||U1 − U2||

After one application, the gap between any two guesses shrinks by at least a factor of γ. Because γ < 1, repeated application squeezes the gap to zero — regardless of starting point. The fixed point they converge to is exactly Uπ.

Why the factor is γ exactly: In the update Uk+1(s) = R(s,π(s)) + γ∑T(s′|s,π(s))Uk(s′), the reward R cancels in the difference between two guesses (it’s the same for both). What’s left is γ times the sum of transition-weighted differences in Uk values. Since probabilities sum to 1, the weighted sum can’t exceed the max difference. So the gap shrinks by exactly γ. Not γ2, not something clever — just γ, directly from the formula.

Iterative Evaluation: Worked Trace

Consider the 3-state chain MDP from Chapter 0 (states 0, 1, 2; actions left/right). Policy π: always go right. γ = 0.9. R(1, right) = 10, all else −1.

Iteration 0: U = [0, 0, 0].

Iteration 1: State 0: R(0,right)=−1, T(1|0,right)=1, so U(0)=−1+0.9×0=−1. State 1: R(1,right)=10, T(2|1,right)=1, so U(1)=10+0.9×0=10. State 2: stays at 2 (wall), R=−1, U(2)=−1.

Iteration 2: U(0)=−1+0.9×10=8. U(1)=10+0.9×(−1)=9.1. U(2)=−1+0.9×(−1)=−1.9.

Iteration 3: U(0)=−1+0.9×9.1=7.19. U(1)=10+0.9×(−1.9)=8.29. U(2)=−1+0.9×(−1.9)=−2.71.

Values are propagating backward from the reward. State 0 is slowly learning it’s one step away from a big payoff. The exact fixed point (via matrix inversion) is Uπ = [−3.45, 6.36, −5.91] — which is what the iteration converges to over many steps. (The negative values at states 2 reflect the “dead end” of hitting the wall repeatedly.)

Convergence rate: After k iterations, the error is at most γk × ||U0 − Uπ||. For γ = 0.9, reducing error by 100× requires k ≥ log(0.01)/log(0.9) ≈ 44 iterations. For γ = 0.99, the same accuracy needs ~459 iterations. High discount factors make convergence painfully slow — the fundamental speed-accuracy tradeoff in all DP methods.
Iterative Policy Evaluation: Step by Step

A 4×4 grid. Policy: always move right (or stay if at wall). Goal: bottom-right cell (+10). All else: −1 reward. Click Step to apply one round of the Bellman update. Watch values propagate from right to left.

k = 0 δ = —
python
import numpy as np

def iterative_policy_evaluation(mdp, pi, k_max=100, eps=1e-6):
    """Evaluate policy pi by repeated Bellman updates (Algorithm 7.3)."""
    U = {s: 0.0 for s in mdp.S}
    for _ in range(k_max):
        U_new = {}
        for s in mdp.S:
            a = pi[s]
            U_new[s] = mdp.R(s, a) + mdp.gamma * sum(
                mdp.T(s, a, sp) * U[sp] for sp in mdp.S
            )
        delta = max(abs(U_new[s] - U[s]) for s in mdp.S)
        U = U_new
        if delta < eps:
            break
    return U

def exact_policy_evaluation(mdp, pi):
    """Exact matrix solve: U = (I - gamma*T_pi)^{-1} R_pi (Algorithm 7.4)."""
    n = len(mdp.S)
    idx = {s: i for i, s in enumerate(mdp.S)}
    R_pi = np.array([mdp.R(s, pi[s]) for s in mdp.S])
    T_pi = np.array([[mdp.T(s, pi[s], sp) for sp in mdp.S] for s in mdp.S])
    U_vec = np.linalg.solve(np.eye(n) - mdp.gamma * T_pi, R_pi)
    return {s: U_vec[idx[s]] for s in mdp.S}
Policy evaluation converges because the Bellman update is a contraction mapping. What property of γ is essential for this?

Chapter 2: Greedy Policies and Q-Functions

Suppose someone hands you a value function U. Maybe it’s an estimate, maybe it’s exact. Either way, you want to extract the best policy from it. The natural idea: at each state, pick the action that looks best over the next step.

Formally, the greedy policy with respect to U (eq. 7.11) is:

π(s) = arg maxa [ R(s,a) + γ ∑s′ T(s′|s,a) U(s′) ]

If U = U* (optimal), the greedy policy is optimal. If U is approximate, the greedy policy is approximately optimal — and we can bound exactly how suboptimal it is.

The Q-Function

The expression inside the argmax has a name: the action value function (or Q-function), defined in eq. 7.12:

Q(s,a) = R(s,a) + γ ∑s′ T(s′|s,a) U(s′)

Q(s,a) is the expected return of: (1) taking action a in state s, then (2) acting greedily forever after (using value function U to decide). From Q, we get everything for free:

FunctionFormulaPolicy extraction
Value U(s)Expected return under optimal actionsNeed T, R to compute lookahead first
Q(s,a)R(s,a) + γ∑T(s′|s,a)U(s′)Just argmaxa Q(s,a) — no T, R needed
U(s) from Qmaxa Q(s,a)Trivial once Q is known
Why Q matters in RL: In reinforcement learning, we often don’t know T and R explicitly. We only observe samples. The Q-function lets us extract the greedy policy from data alone — just take argmax over actions. This is why Q-learning and DQN operate on Q rather than U. The model-free advantage of Q is real: you bypass the transition model entirely at decision time.

The Advantage Function

The advantage function A(s,a) (eq. 7.15) measures how much better action a is compared to the best action:

A(s,a) = Q(s,a) − U(s)

Since U(s) = maxa Q(s,a), the greedy action has A(s, π(s)) = 0. Every other action has A(s,a) ≤ 0 — a negative advantage tells you exactly how much you lose by choosing that action.

Verified fact: A(s,a) ≤ 0 for all s, a. Equality holds (A=0) only for greedy actions. This means you can safely ignore any action with A << 0 — it’s far worse than the best. Advantage estimates are used in actor-critic algorithms (Ch. 13) to tell the actor which actions to reinforce.

Worked Example: Q-Table From U

Here is a toy 2×3 grid MDP with γ = 0.9. Goal state (bottom-right) has been solved; its value is U*(goal) = 0. All move costs are −1. Suppose the value function at state (0,0) is U(0,0) = −5, and from there you can move right to (0,1) with U(0,1) = −4, or down to (1,0) with U(1,0) = −4.

Q(s, right) = −1 + 0.9×(−4) = −1 − 3.6 = −4.6.

Q(s, down) = −1 + 0.9×(−4) = −4.6 (tie; both neighbors equally far from goal).

Q(s, left) = −1 + 0.9×U(wall=stay) = −1 + 0.9×(−5) = −5.5 (wall, loops to same state).

Q(s, up) = −1 + 0.9×U(wall=stay) = −5.5 (same; another wall).

So U(s) = max Q = −4.6. A(s, right) = −4.6−(−4.6) = 0. A(s, left) = −5.5−(−4.6) = −0.9. Going left wastes 0.9 units of expected return.

Interactive Q-Table Viewer

A 4×4 grid with precomputed Q-values. Each cell shows four arrows (up/down/left/right) sized by Q(s,a). The brightest arrow is the greedy action. Toggle to display U(s) or A(s,a) at each cell.

python
def lookahead(mdp, U, s, a):
    """Q(s,a): one-step lookahead value (Algorithm 7.2)."""
    return mdp.R(s, a) + mdp.gamma * sum(
        mdp.T(s, a, sp) * U[sp] for sp in mdp.S
    )

def greedy(mdp, U, s):
    """Extract greedy action at state s."""
    return max(mdp.A, key=lambda a: lookahead(mdp, U, s, a))

def Q_table(mdp, U):
    """Full Q-table: Q[s][a] for all states and actions."""
    return {s: {a: lookahead(mdp, U, s, a) for a in mdp.A} for s in mdp.S}

def advantage(mdp, U, s, a):
    """A(s,a) = Q(s,a) - U(s). Always <= 0."""
    return lookahead(mdp, U, s, a) - U[s]
If Q(s,up)=5, Q(s,down)=3, Q(s,left)=7, Q(s,right)=2, what is U(s) and A(s,down)?

Chapter 3: Policy Iteration

We know how to evaluate a policy (Chapter 1) and how to improve it (Chapter 2). The obvious question: what if we alternated? Evaluate, then improve, then evaluate again — until the policy stops changing?

That’s exactly policy iteration (Algorithm 7.6). Starting from any policy π, it cycles:

1. Policy Evaluation
Compute Uπ exactly or iteratively — the true value of following π from every state
2. Policy Improvement
π′(s) = argmaxa[R(s,a) + γ∑T(s′|s,a)Uπ(s′)] — be greedy everywhere
3. Check
If π′ = π, stop. Otherwise set π ← π′ and go back to step 1.

Why Does It Converge?

Three facts lock together to guarantee convergence:

(1) Monotonic improvement. The policy improvement theorem says: if π′ is greedy with respect to Uπ, then Uπ′(s) ≥ Uπ(s) for all states. The new policy is never worse. Proof: at each state, π′ picks the best one-step lookahead, which is at least as large as what π would have gotten. Chain this inequality forward through all future states (just like we did in Chapter 2) and you get Uπ′(s) ≥ Uπ(s) globally.

(2) Finiteness. A finite MDP with |S| states and |A| actions has at most |A||S| deterministic policies (each state independently chooses one of |A| actions). Each improvement step either finds a strictly better policy or terminates. No cycles are possible — we can’t visit the same policy twice, because each visit strictly improves it.

(3) Optimality at termination. If π′ = π, then the greedy policy equals the current one, which means Uπ(s) = maxa[R(s,a) + γ∑T(s′|s,a)Uπ(s′)] for every state. That IS the Bellman optimality equation. So Uπ = U* and π = π*.

Speed in practice: In principle there are up to 425 ≈ 1015 policies for a 5×5 grid with 4 actions. In practice, policy iteration on that problem converges in 2–4 outer iterations. The exponential bound is catastrophically pessimistic. Policy iteration typically solves small-to-medium MDPs in single-digit outer iterations — even though each iteration costs O(|S|3) for the exact solve.

The Modified Policy Iteration Spectrum

Using full matrix evaluation at each step is expensive. What if we only run a few sweeps of iterative evaluation before improving? This gives modified policy iteration:

Evaluation sweeps per improvementAlgorithmPer-iteration cost
Until convergence (∞ sweeps)Standard policy iterationO(|S|3) (matrix solve)
m sweeps (m > 1)Modified policy iterationO(m×|S|2×|A|)
1 sweepValue iterationO(|S|2×|A|)
Generalized Policy Iteration (GPI): Any interleaving of partial evaluation and improvement converges to the optimal policy and value function. This insight — formalized by Sutton & Barto as GPI — unifies all DP methods. PI and VI are just two extremes of the same family. Most practical algorithms (actor-critic, Q-learning, TD(λ)) are GPI in disguise.

Hex World Example (Figure 7.5 from the book)

Start with the policy “always go east.” Iteration 1: evaluate (values propagate from reward state), improve (greedy extraction changes some arrows near the reward). Iteration 2: values update again, more arrows change. Iteration 3: converged. Only 3 full iterations for the exact optimal policy.

The early iterations change many arrows. Later iterations change only a few. Convergence is front-loaded — the policy improves fastest at the start, then makes tiny refinements near the end.

Policy Iteration: Step by Step on a 5×5 Grid

Goal (green, bottom-right): +10. All moves: −1. γ = 0.9. Initial policy: always right. Click Evaluate to compute Uπ, then Improve to extract the greedy policy. Watch arrows change.

Step 1: click Evaluate
python
def policy_iteration(mdp, pi_init=None, k_eval=50):
    """Algorithm 7.6: Policy iteration."""
    # Initialize to first action if not provided
    pi = pi_init or {s: mdp.A[0] for s in mdp.S}
    while True:
        # 1. Evaluate current policy
        U = iterative_policy_evaluation(mdp, pi, k_max=k_eval)
        # 2. Improve greedily
        pi_new = {s: greedy(mdp, U, s) for s in mdp.S}
        # 3. Check convergence
        if all(pi_new[s] == pi[s] for s in mdp.S):
            return pi, U
        pi = pi_new
Policy iteration is guaranteed to converge in at most how many outer iterations for an MDP with |S| states and |A| actions?

Chapter 4: Value Iteration (Showcase)

What if we skipped the separate evaluation step entirely? Instead of first computing Uπ and then extracting the greedy policy, we combine them into a single update. At every state, we immediately apply the optimal Bellman equation — the max over all actions — and use that as the new value estimate.

This is the Bellman backup (eq. 7.16), the engine of value iteration:

Uk+1(s) = maxa [ R(s,a) + γ ∑s′ T(s′|s,a) Uk(s′) ]

Starting from U(s) = 0 everywhere and applying this update to every state repeatedly, we converge to the optimal value function U*, which satisfies the Bellman optimality equation (eq. 7.17):

U*(s) = maxa [ R(s,a) + γ ∑s′ T(s′|s,a) U*(s′) ]

Why Does It Converge?

The max operator does not break the contraction. For any two value functions U1 and U2, applying the Bellman optimality operator T*:

||T*U1 − T*U2|| ≤ γ ||U1 − U2||

The key identity is |maxa f(a) − maxa g(a)| ≤ maxa |f(a) − g(a)|. Both players pick their best action; the gap between their best values can’t exceed the worst-case gap across all actions. And that worst-case gap is bounded by γ times the original difference in U1, U2. Since γ < 1, repeated application squeezes any initial error to zero.

Stopping and Error Bounds

We can’t measure ||Uk − U*|| directly (we don’t know U*). Instead, we track the Bellman residual:

δ = maxs |Uk+1(s) − Uk(s)|

The true error is bounded by: ||Uk − U*|| ≤ δγ/(1−γ). And the policy loss — how much worse the extracted policy is than optimal — is bounded by: ||Uπ − U*|| ≤ 2εγ/(1−γ) where ε is the value error.

The (1−γ) problem: For γ = 0.99, the bound blows up by factor 99. A residual of δ = 0.01 only guarantees true error ≤ 0.01 × 0.99/0.01 = 0.99. You need δ ≈ 0.0001 to get true error below 0.01. This is why high discount factors make value iteration slow: the algorithm needs to run much longer before the residual criterion is tight.

Worked Example: 3-State MDP by Hand

States: s0, s1, s2 (chain). Actions: stay (s), advance (a). R(s1, advance) = 10; all else −1. Advance from s0 goes to s1; advance from s1 goes to s2; s2 is absorbing (loops). Discount γ = 0.9.

k=0: U = [0, 0, 0].

k=1:
 s0: max(stay: −1+0.9×0=−1, advance: −1+0.9×0=−1) = −1
 s1: max(stay: −1+0.9×0=−1, advance: 10+0.9×0=10) = 10
 s2: max(stay: −1+0.9×0=−1, advance: −1+0.9×0=−1) = −1

k=2:
 s0: max(−1+0.9×(−1)=−1.9, −1+0.9×10=8) = 8 ← advance is now clearly better!
 s1: max(−1+0.9×10=8, 10+0.9×(−1)=9.1) = 9.1
 s2: max(−1+0.9×(−1)=−1.9, −1+0.9×(−1)=−1.9) = −1.9

k=3:
 s0: advance: −1+0.9×9.1=7.19
 s1: advance: 10+0.9×(−1.9)=8.29
 s2: — = −2.71

Already at k=2, value iteration “learned” that s0 should advance (reaching s1 gives 10). The backup propagates optimism: once s1’s value is updated to 10, that knowledge immediately ripples to s0 on the next step.

Value Iteration: Full Interactive Showcase

5×5 grid world. Drag sliders to change γ (discount) and slip probability (stochastic transitions). Step advances one Bellman backup. Run animates. Watch the Bellman residual decay and optimal policy arrows emerge.

k = 0 δ = —
γ0.90
slip %10%

Bellman residual δ over iterations (log scale). Convergence criterion: δ < 0.001.

python
def bellman_backup(mdp, U, s):
    """One Bellman backup at state s (Algorithm 7.7)."""
    return max(
        mdp.R(s, a) + mdp.gamma * sum(mdp.T(s, a, sp) * U[sp] for sp in mdp.S)
        for a in mdp.A
    )

def value_iteration(mdp, eps=1e-6, k_max=1000):
    """Algorithm 7.8: Value iteration with Bellman residual stopping."""
    U = {s: 0.0 for s in mdp.S}
    for k in range(k_max):
        U_new = {s: bellman_backup(mdp, U, s) for s in mdp.S}
        delta = max(abs(U_new[s] - U[s]) for s in mdp.S)
        U = U_new
        if delta < eps * (1 - mdp.gamma) / mdp.gamma:
            # Tighter stop: guarantees true error < eps
            break
    policy = {s: greedy(mdp, U, s) for s in mdp.S}
    return policy, U
Value iteration with γ = 0.99 takes many more iterations than γ = 0.5. Why?

Chapter 5: Asynchronous Value Iteration

Standard value iteration is synchronous: in each iteration, we compute new values for ALL states simultaneously, using only the old values. Then we swap in the new table and repeat. Clean, correct — but it ignores useful information. When we compute U(s5), we already have a fresh U(s1). Why not use it?

Asynchronous value iteration (Algorithm 7.9) updates states one at a time, in place, using the most recent value of every state already updated this sweep:

U(s) ← maxa [ R(s,a) + γ ∑s′ T(s′|s,a) U(s′) ]

The difference: when we update U(s5), we use the already-fresh U(s1), U(s2), etc. This is the Gauss-Seidel method, adapted from linear algebra to dynamic programming.

Why this helps: Imagine a linear chain 0 → 1 → 2 → 3 → [reward]. In synchronous VI, after iteration 1 only state 3 has a nonzero value. After iteration 2, only states 2 and 3. The reward propagates one step per iteration. In Gauss-Seidel VI sweeping left-to-right (3, 2, 1, 0), the same sweep propagates the reward from 3 all the way back to 0 in a single pass. It converges in 1 sweep instead of 4.

State Ordering Matters

The order in which we update states controls how fast information flows. There are three natural orderings:

OrderingBehaviorBest for
Toward reward (backward)Values propagate in one passChain MDPs with known reward location
Away from reward (forward)Values propagate slowly, one step per sweepNo benefit over synchronous VI
RandomIntermediate; converges but unpredictablyUnknown structure; still guaranteed to converge
Backward induction as a special case: If the state includes a time step t = 1, 2, …, H (finite horizon), then we can sweep backward from t=H to t=1. At t=H the values are just immediate rewards (no future). At t=H−1 we look one step ahead to the fresh t=H values. This propagates optimally in exactly H sweeps — the classic finite-horizon DP. Backward induction IS Gauss-Seidel VI with the perfect ordering.

One more key fact: even with a bad ordering, asynchronous VI cannot converge slower than synchronous VI. It can only be the same or faster. The only requirement for convergence: every state must be updated infinitely often. Any ordering that visits every state repeatedly works.

Worked Comparison: Same Problem, Two Orderings

Linear chain of 5 states (s0→s1→s2→s3→s4). Reward at s4: +10. γ = 0.9. All transitions deterministic (always advance).

Left-to-right ordering (s0, s1, s2, s3, s4): Sweep 1 updates s0 first — it sees U(s1)=0, stays at 0. Then s1 sees U(s2)=0 → stays 0. … Only s4 has the reward. After 1 sweep: same as synchronous VI. Needs ~5 sweeps to fully propagate.

Right-to-left ordering (s4, s3, s2, s1, s0): Sweep 1 updates s4 first: U(s4) = 10. Then s3 sees fresh U(s4)=10: U(s3) = −1 + 0.9×10 = 8. Then s2 sees fresh U(s3)=8: U(s2) = 6.2. s1: 4.58. s0: 3.12. Full propagation in 1 sweep!

Synchronous vs. Gauss-Seidel Value Iteration

Same 6-cell chain. Left half: synchronous VI (updates all cells from old table each step). Right half: Gauss-Seidel VI. Choose sweep direction. Watch Gauss-Seidel converge faster with right-to-left ordering.

Sweep 0
python
def gauss_seidel_vi(mdp, ordering=None, k_max=1000, eps=1e-6):
    """Algorithm 7.9: Gauss-Seidel (asynchronous) value iteration."""
    # ordering: list of states in update order. Default: mdp.S as-is.
    ordering = ordering or mdp.S
    U = {s: 0.0 for s in mdp.S}
    for _ in range(k_max):
        delta = 0.0
        for s in ordering:           # update in-place, use fresh values immediately
            u_new = bellman_backup(mdp, U, s)
            delta = max(delta, abs(u_new - U[s]))
            U[s] = u_new              # write back immediately (in-place)
        if delta < eps:
            break
    return {s: greedy(mdp, U, s) for s in mdp.S}, U

# Backward induction: sweep from last time step to first
def backward_induction(mdp_h, H):
    """Finite-horizon DP: sweep backward from t=H. Returns policy for each t.
    mdp_h.S assumed to contain (state, time) tuples.
    """
    U = {(s, H): mdp_h.terminal_reward(s) for s in mdp_h.base_states}
    policies = {}
    for t in range(H - 1, -1, -1):    # sweep backward
        for s in mdp_h.base_states:
            best_a, best_v = None, -float('inf')
            for a in mdp_h.A:
                v = mdp_h.R(s, a) + mdp_h.gamma * sum(
                    mdp_h.T(s, a, sp) * U.get((sp, t+1), 0)
                    for sp in mdp_h.base_states
                )
                if v > best_v: best_a, best_v = a, v
            U[(s, t)] = best_v
            policies[(s, t)] = best_a
    return policies, U
In Gauss-Seidel value iteration, why does the order of state updates affect convergence speed?

Chapter 6: Linear Program Formulation

Value iteration and policy iteration are iterative algorithms — they converge asymptotically. Is there a way to compute U* in one shot, as a direct optimization? Yes: rewrite the Bellman optimality equation as a linear program.

The Bellman optimality equation requires U*(s) = maxa[R(s,a) + γ∑T(s′|s,a)U*(s′)] for every state s. The “max” is the hard part — it makes the equation nonlinear. The LP trick: replace the max with a constraint. Instead of saying “U(s) equals the best lookahead,” we say “U(s) is at least as large as every lookahead”:

U(s) ≥ R(s,a) + γ ∑s′ T(s′|s,a) U(s′)     ∀s, a

This is a system of |S|×|A| linear constraints in |S| variables. The feasible set is a convex polyhedron. But which point in this polyhedron should we pick? Any U satisfying the constraints is an overestimate of U*. The smallest feasible U is exactly U*.

So the LP is (eq. 7.19–7.20):

minimize   ∑s U(s)
subject to   U(s) ≥ R(s,a) + γ ∑s′ T(s′|s,a) U(s′)   ∀ s ∈ S, a ∈ A
Why minimize the sum? Any solution satisfying all the constraints is an overestimate of U*. The objective “minimize ∑U(s)” pushes U as far down as possible while still satisfying every constraint. When U can’t be pushed any lower without violating some constraint, the active constraint at each state is U(s) = maxa[…] — the Bellman optimality equation. So the LP optimum IS U*.

Complexity: MDPs Are in P

The LP has |S| variables and |S|×|A| constraints. LP is solvable in polynomial time (via interior-point methods). Therefore, MDPs with finite state and action spaces are solvable in polynomial time — they lie in the complexity class P.

PropertyValue
Variables|S|
Constraints|S| × |A|
ComplexityPolynomial (LP is in P)
Practical vs. VIVI is usually faster in practice; LP matters theoretically
Why LP matters theoretically: Before this result it was conceivable that MDPs might require exponential time — there are |A||S| policies to search. The LP formulation proves that’s not necessary. The optimal value function is the vertex of a convex polytope that LP can find efficiently. This places exact MDP solution in a very different class from, say, general game-playing, which is PSPACE-hard.

A 2-State Example by Hand

Two states s0, s1. Two actions: left (L), right (R). γ = 0.9. Transitions: L always stays; R from s0 goes to s1; R from s1 stays. Rewards: R(s0,L)=−1, R(s0,R)=−1, R(s1,L)=−1, R(s1,R)=10.

LP constraints (variables: U0, U1):

From s0, action L: U0 ≥ −1 + 0.9 U0 → 0.1 U0 ≥ −1 → U0 ≥ −10

From s0, action R: U0 ≥ −1 + 0.9 U1

From s1, action L: U1 ≥ −1 + 0.9 U1 → 0.1 U1 ≥ −1 → U1 ≥ −10

From s1, action R: U1 ≥ 10 + 0.9 U1 → 0.1 U1 ≥ 10 → U1 ≥ 100

The binding constraint on U1 is U1 = 100. Then U0 ≥ −1 + 0.9×100 = 89. Minimize: U0 = 89, U1 = 100. The optimal policy: in s0, go right (worth 89); in s1, go right (keeps collecting reward 10, discounted sum = 100).

LP Feasible Region: 2-State MDP

The two LP variables are U0 and U1. Each constraint is a half-plane. The feasible region (shaded) is their intersection. The objective “minimize U0+U1” sweeps a line; the LP optimum is where it first enters the feasible region. Drag γ to see constraints shift.

γ 0.80
python
from scipy.optimize import linprog
import numpy as np

def lp_solve(mdp):
    """Algorithm 7.10: Solve MDP via linear programming."""
    S, A, gamma = mdp.S, mdp.A, mdp.gamma
    n = len(S)
    idx = {s: i for i, s in enumerate(S)}

    # Objective: minimize sum of U(s)
    c = np.ones(n)

    # Constraints: U(s) >= R(s,a) + gamma * sum_s' T(s'|s,a) U(s')
    # Rearranged: -U(s) + gamma * sum_s' T(s'|s,a) U(s') <= -R(s,a)
    rows = []
    rhs = []
    for s in S:
        for a in A:
            row = np.zeros(n)
            row[idx[s]] = -1.0
            for sp in S:
                row[idx[sp]] += gamma * mdp.T(s, a, sp)
            rows.append(row)
            rhs.append(-mdp.R(s, a))

    res = linprog(c, A_ub=np.array(rows), b_ub=np.array(rhs), method='highs')
    U = {s: res.x[idx[s]] for s in S}
    return {s: greedy(mdp, U, s) for s in S}, U
The LP formulation minimizes ∑U(s). Why minimize rather than maximize?

Chapter 7: Linear Quadratic Regulator

Every method so far assumed a finite state and action space. But most real physical systems — a pendulum, a drone, a robotic arm — have continuous states and actions. You can’t enumerate them. Can we still compute an exact optimal policy?

Yes — for one important special case: when dynamics are linear and rewards are quadratic. This is the Linear Quadratic Regulator (LQR), and it has a closed-form solution.

The Setup

Linear dynamics (eq. 7.22): The next state is a linear function of the current state and action, plus Gaussian noise:

s′ = Ts · s + Ta · a + w,    w ∼ N(0, Σw)

Here Ts is the state transition matrix (n×n), Ta is the action transition matrix (n×m), and w is zero-mean Gaussian noise.

Quadratic reward (eq. 7.23): The reward penalizes both large states and large actions:

R(s,a) = sT Rs s + aT Ra a

Rs is negative semidefinite (penalizes large states); Ra is negative definite (penalizes large actions). The goal: keep the system near zero with minimal control effort.

Why quadratic? Any smooth reward can be approximated as quadratic near its optimum. LQR is the second-order Taylor expansion of general optimal control. When you linearize a nonlinear system and quadratically approximate its cost, you get LQR — which is why it’s the foundation of iLQR, DDP, and most trajectory optimization methods used in robotics today.

The Solution

The optimal value function turns out to be quadratic: Uh(s) = sTVhs + qh. The optimal policy turns out to be linear: πh(s) = Lh · s.

The gain matrix Lh (eq. 7.29):

Lh = −(TaT Vh−1 Ta + Ra)−1 TaT Vh−1 Ts

Vh satisfies the discrete-time Riccati equation (eq. 7.30):

Vh+1 = Rs + TsTVhTs − TsTVhTa(TaTVhTa + Ra)−1TaTVhTs
Certainty equivalence: Lh and Vh don’t depend on the noise covariance Σw at all. The optimal control policy for a stochastic LQR problem is identical to the optimal policy for the deterministic version (w = 0). You can design the controller as if you had no noise, and it remains optimal. The noise affects qh (the expected cost), but not the gains.

Worked Example 7.4: Position-Velocity Control

State: s = [x, v]. Action: scalar acceleration a. Timestep Δt = 0.1. Dynamics: x′ = x + vΔt + ½aΔt2, v′ = v + aΔt. Reward: R(s,a) = −x2 − v2 − 0.5a2.

In matrix form: Ts = [[1, 0.1],[0, 1]], Ta = [[0.005],[0.1]], Rs = diag(−1,−1), Ra = [−0.5].

Starting at s0 = [−10, 0], the Riccati recursion gives gains:

Horizon hGain Lh
h = 1[0, 0] — last step, zero gain
h = 2[−0.286, −0.857]
h = 3[−0.462, −1.077]
h = 4[−0.496, −1.112]
h = 5[−0.504, −1.124] ← essentially converged

With a longer horizon the controller pushes harder early: at x=−10, h=5 gives initial action ≈ +5.04, rapidly accelerating toward the origin, while h=2 gives only +2.86.

LQR Phase Plane Simulation

Position-velocity phase plane. The trajectory starts at the dot and converges to the origin under LQR control. Drag the starting dot. Change horizon h and watch the gain matrix update and the trajectory reshape.

horizon h 5
L = [−0.504, −1.124]
python
import numpy as np

def lqr(Ts, Ta, Rs, Ra, h_max=50):
    """Algorithm 7.11: LQR via backward Riccati iteration."""
    n = Ts.shape[0]
    V = np.zeros((n, n))    # V_0 = 0 at terminal step
    gains = []
    for _ in range(h_max):
        M = Ta.T @ V @ Ta + Ra
        L = -np.linalg.solve(M, Ta.T @ V @ Ts)   # eq 7.29
        gains.append(L)
        V = Rs + Ts.T @ V @ Ts + Ts.T @ V @ Ta @ L  # Riccati, eq 7.30
    return gains   # gains[h-1] = L_h

# Example 7.4 setup
dt = 0.1
Ts = np.array([[1.0, dt], [0.0, 1.0]])
Ta = np.array([[0.5*dt**2], [dt]])
Rs = np.diag([-1.0, -1.0])
Ra = np.array([[-0.5]])
gains = lqr(Ts, Ta, Rs, Ra, h_max=10)
print("L_5 =", gains[4])    # [[-0.504, -1.124]]

# Simulate trajectory from s0 = [-10, 0]
s = np.array([-10.0, 0.0])
L = gains[-1]   # use converged gain
trajectory = [s.copy()]
for _ in range(50):
    a = L @ s        # linear policy: a = L*s
    s = Ts @ s + Ta.reshape(2) * a[0]
    trajectory.append(s.copy())
In LQR, the optimal gain matrix Lh does not depend on the noise covariance Σw. This result is called:

Chapter 8: Value Iteration Sandbox

You’ve seen value iteration on fixed examples. Now you’re in control. Below is a fully configurable grid world: choose the size, place rewards and walls by clicking, set the discount factor and stochastic slip probability. Run value iteration step by step or let it animate to convergence. Watch the Bellman residual decay. Extract and display optimal policy arrows.

This is THE showcase. Every concept from Chapters 1–5 is visible here simultaneously: value propagation from rewards backward, the effect of γ on how far values reach, stochasticity spreading value to neighbors, the residual plot showing contraction, policy arrows emerging from the value function.
Value Iteration: Full Interactive Sandbox

Left-click a cell to cycle: empty → wall → reward (+10) → penalty (−5) → empty. Then run value iteration and watch optimal arrows emerge. The residual plot (below) shows δ on a log scale.

Bellman residual δk = maxs|Uk+1(s)−Uk(s)| (log scale). Algorithm stops when δ < 0.01.

k = 0 δ = —
grid
γ0.90
slip %10%
Experiments to try: (1) Set γ = 0.5 and place a reward far from center — watch values fade to near zero before they reach the start. (2) Set slip = 40% and watch how values “spread” laterally. (3) Place a −5 penalty adjacent to the +10 reward — watch the optimal policy route around it. (4) Create a bottleneck corridor — all paths must squeeze through — and watch values drop as you force longer detours.
You run value iteration on a 6×6 grid with γ = 0.9 and a single reward at the goal. After 1 step, which cells have nonzero values?

Chapter 9: Policy vs. Value Iteration Side by Side

Both methods solve the same problem and both produce the same answer: the optimal policy. But they take very different paths to get there. The comparison below runs them simultaneously on an identical MDP so you can see the tradeoff directly.

Policy Iteration: Each outer iteration is expensive — a full policy evaluation before each improvement. But the number of outer iterations is tiny (often 2–5 even for large grids).

Value Iteration: Each iteration is cheap — just one Bellman backup per state. But many iterations are needed for the Bellman residual to shrink below the stopping threshold.

The fundamental tradeoff: Policy iteration does more work per step (policy evaluation = solving a linear system) but fewer steps. Value iteration does less work per step (one backup) but more steps. For small state spaces where O(|S|3) matrix solves are cheap, PI wins. For larger spaces where each linear solve is expensive, VI wins. The crossover depends on the specific MDP.
PI vs VI: Head-to-Head on the Same Grid

Both algorithms run on the same 5×5 grid. Left: policy iteration (shows current policy arrows). Right: value iteration (shows current value heatmap). Step both simultaneously and compare iteration counts.

PI outer iters: 0 VI total iters: 0

Per-Iteration Cost Breakdown

For a grid with N = |S| states and A = |A| actions:

AlgorithmCost per outer iterationNumber of outer iterations
Policy IterationO(N3) for linear solve + O(N2A) for improvementOften 2–10 in practice
Modified PI (m sweeps)O(m · N2A)More than PI, fewer than VI
Value IterationO(N2A)O(log(1/ε) / log(1/γ))

For the 5×5 grid above (N=25, A=4): PI costs 253 ≈ 15,000 operations per outer iter but typically finishes in 3 outer iters. VI costs 252×4 = 2,500 operations per iter but needs ∼50–200 iters for γ=0.9. Total: PI ≈ 45,000 vs. VI ≈ 125,000–500,000. PI wins for small grids.

For a very large MDP with |S| = 10,000 states, why is value iteration usually preferred over policy iteration?

Chapter 10: Method Comparison

Five exact methods, each with its own niche. Here’s the map:

MethodPer iterationOuter iterationsMemoryBest for
Policy IterationO(|S|3) + O(|S|2|A|)2–10 in practiceO(|S|2)Small state spaces (<1,000 states)
Value IterationO(|S|2|A|)O(log(1/ε)/log(1/γ))O(|S|)Medium state spaces
Async VI (Gauss-Seidel)O(|S||A|) per sweepDepends on orderingO(|S|)When good ordering is known
Linear ProgramPolynomial (interior point)One solveO(|S|2)Theoretical guarantee; rarely practical
LQRO(n3) per Riccati steph steps (horizon)O(n2)Linear dynamics, quadratic cost

When Each Method Breaks

Policy Iteration breaks when |S| is large (thousands+). The O(|S|3) linear solve per iteration becomes prohibitive. Even sparse matrix solves have poor scaling for grid worlds beyond ∼100×100.

Value Iteration breaks when γ is close to 1. The convergence rate is γk, so γ = 0.999 needs ∼6,900 iterations to reduce error by 1000×. Every digit of γ closer to 1 roughly doubles the iteration count.

Async VI breaks when there’s no good state ordering. For ergodic MDPs without obvious structure, random ordering may not be much faster than synchronous VI.

LQR breaks when dynamics are nonlinear or the reward is non-quadratic. Even small nonlinearities can make the true optimal policy dramatically different from the LQR policy. Extensions like iLQR and DDP address this by iterating LQR around a nominal trajectory.

The Curse of Dimensionality

All exact methods share a fatal flaw: they require enumerating all states. A robot with 10 joints, each discretized to 100 positions, has 10010 = 1020 states. No computer can enumerate them. This is the curse of dimensionality — the fundamental barrier that drives the entire field of approximate dynamic programming (Chapter 8) and model-based RL (Chapter 9).

ACAS X scale: The aircraft collision avoidance system has ~500 million states. It was solved by value iteration on a supercomputer, exploiting the structured (grid-like) nature of the state space. That’s near the practical limit for exact methods. For reference, a full humanoid robot has roughly 1050 states when discretized at millimeter resolution — exact methods are irrelevant at that scale.
A continuous state space robot arm has 6 joints, each ranging 0–360°. If you discretize each joint to 1° resolution, how many states are there?

Chapter 11: Connections and What’s Next

Chapter 7 is the foundation. Everything that follows in the book is an answer to one of three problems that exact methods cannot handle: state spaces that are too large, models that are unknown, and partial observability.

Ch 7: Exact Methods (this chapter)
Value iteration, policy iteration, LQR — optimal when state space is small and model is known
↓ when |S| is too large ↓
Ch 8: Approximate Value Functions
Parametric approximators (linear, neural networks) replace the full value table. LSTD, DQN, fitted VI.
↓ when the model is unknown ↓
Ch 9: Online Planning
MCTS, rollout policy, sparse sampling — plan at decision time using Monte Carlo estimates instead of precomputed U*.
↓ when we want to learn the policy directly ↓
Ch 10–13: Policy Gradient Methods
REINFORCE, actor-critic, PPO — directly optimize the policy parameters πθ by gradient ascent on expected return. No value function required (or use one as a baseline).
↓ when state is partially observed ↓
Ch 6: POMDPs (partially observable MDPs)
The agent can’t see s directly. Ch 6 covers belief state MDPs, point-based solvers — all built on the same Bellman structure as Ch 7.

Equation Cheat Sheet

NameEquation
Bellman ExpectationUπ(s) = R(s,π(s)) + γ∑s′T(s′|s,π(s))Uπ(s′)
Bellman OptimalityU*(s) = maxa[R(s,a) + γ∑s′T(s′|s,a)U*(s′)]
Q-FunctionQ(s,a) = R(s,a) + γ∑s′T(s′|s,a)U(s′)
Greedy Policyπ(s) = arg maxa Q(s,a)
Policy IterationEvaluate Uπ exactly, then π ← greedy(Uπ), repeat
Value IterationU(s) ← maxa[R(s,a)+γ∑T(s′|s,a)U(s′)], repeat
VI Error Bound||Uk−U*|| ≤ δγ/(1−γ) where δ = max|Uk+1−Uk|
LQR GainLh = −(TaTVh−1Ta+Ra)−1TaTVh−1Ts
Riccati EquationVh+1 = Rs + TsTVhTs − TsTVhTa(TaTVhTa+Ra)−1TaTVhTs

Related Lessons

These micro-lessons complement Chapter 7:

The key insight of this chapter: The Bellman equation — “the value of a state equals the immediate reward plus the discounted expected value of the next state” — is a fixed-point equation. All exact algorithms (PI, VI, async VI, LP) are different strategies for finding that fixed point. LQR is the special case where the fixed point has a closed form. Everything downstream is an approximation of this fixed point when it can’t be computed exactly.
“An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.”
— Richard Bellman, Dynamic Programming, 1957
Why can’t any of the exact methods in Chapter 7 solve a humanoid robot’s locomotion problem directly?