MDPs, policy evaluation, value iteration, policy iteration, LQR — computing optimal decisions under uncertainty.
A robot is navigating a grid to reach a goal. It has a map, it has motors, and it knows its target. Easy enough. But every time it sends a “move north” command, there is a 20% chance the wheels slip and it drifts east instead. Now what?
Unlike deterministic planning — where you compute one path and execute it — stochastic environments force you to reason about sequences of decisions under uncertainty. The same action from the same state might lead to different outcomes. You cannot plan a fixed path; you need a policy that tells you what to do in any situation you might land in.
The standard mathematical model for this class of problems is a Markov decision process (MDP). An MDP packages five ingredients:
| Component | Symbol | What it says |
|---|---|---|
| State space | S | The finite set of all possible situations |
| Action space | A | The finite set of all possible moves |
| Transition function | T(s′|s,a) | Prob. of landing in s′ after doing a in s |
| Reward function | R(s,a) | Expected reward for taking action a in state s |
| Discount factor | γ | How much we value future rewards (0 ≤ γ < 1) |
The Markov property is the defining assumption: the distribution over next states T(s′|s,a) depends only on the current state and action — not on how we got there. No memory of past states is needed. This is what makes the math tractable.
For a finite horizon of T steps, the total utility is just the sum of rewards: U = r1 + r2 + … + rT. But for an infinite horizon problem, this sum can blow up. We need to discount future rewards.
The discounted return is:
If γ = 0.9, a reward 10 steps away is worth 0.910 ≈ 0.35 of its face value. A reward 20 steps away is worth only 0.920 ≈ 0.12. Discounting makes far-future rewards matter less — and crucially, it guarantees the infinite sum converges as long as rewards are bounded.
A policy π(s) maps every state to an action. For infinite-horizon MDPs, the optimal policy is stationary (doesn’t change with time) and deterministic (picks one action per state). The value function Uπ(s) is the expected discounted return from starting in state s and following policy π forever:
The optimal value function U*(s) = maxπ Uπ(s) gives the best achievable expected return from each state. Finding U* — and the policy that achieves it — is what the rest of this chapter is about.
Consider a 5-state MDP: states s1…s5. Two actions: stay (remain in current state) and continue (move to the next state). The only nonzero reward is R(s4, continue) = 10, which takes you to s5.
If we follow the optimal policy from s1: always continue. We pass through s1 → s2 → s3 → s4, collecting zero reward at each step, then at step 4 we take the reward. The discounted value is:
Solving for γ: γ3 = 0.1, so γ = 0.11/3 ≈ 0.464. A discount factor of about 46% makes a reward 3 steps away worth exactly 1 unit today.
A 6×6 grid. The goal (green) is at bottom-right with reward +20. All other cells show the discounted value of the goal reward as a function of shortest-path distance. Drag γ to see how discounting “reaches” across the grid.
High γ: distant rewards stay large (blue). Low γ: only nearby cells have significant value (dark).
ACAS X, the FAA’s collision avoidance system, is literally solved as an MDP. States encode positions and velocities of both aircraft. Actions are climb, descend, or level. Collision gives a large negative reward (−10000). Unnecessary maneuvering gives a small negative reward (−10). The system pre-computes the optimal policy over millions of states offline, then looks up the right action in real time. It outperforms the previous rule-based system both in safety and pilot annoyance metrics.
python from dataclasses import dataclass from typing import List, Callable @dataclass class MDP: S: List # state space (list of states) A: List # action space (list of actions) T: Callable # T(s, a, sp) -> transition probability R: Callable # R(s, a) -> expected immediate reward gamma: float # discount factor in [0, 1) # Example: 3-state chain. Two actions: left (0), right (1). # Right from state 2 gives reward 10. All else gives -1. def chain_T(s, a, sp): if a == 1: # right: move to s+1 (stay if at edge) target = min(s + 1, 2) else: # left: move to s-1 (stay if at edge) target = max(s - 1, 0) return 1.0 if sp == target else 0.0 def chain_R(s, a): return 10.0 if (s == 1 and a == 1) else -1.0 mdp = MDP(S=[0,1,2], A=[0,1], T=chain_T, R=chain_R, gamma=0.9)
Before we can optimize, we need to evaluate. Given a fixed policy π, how good is it? The answer is the value function Uπ(s): the expected discounted return starting from s and following π forever.
Computing Uπ is called the prediction problem. It has a clean recursive structure. From any state s, following policy π gives immediate reward R(s, π(s)), then we land in state s′ with probability T(s′|s, π(s)), and get value Uπ(s′) from there. Putting it together, the Bellman expectation equation (eq. 7.6) must hold at every state simultaneously:
This is a system of |S| equations in |S| unknowns. Two ways to solve it:
Iterative method (Algorithm 7.3): Start with U(s) = 0 for all s. Repeatedly apply the lookahead equation (eq. 7.5):
Repeat until values stop changing. Cheap per iteration; needs many iterations.
Exact method (Algorithm 7.4): Write in matrix form:
Solve: Uπ = (I − γTπ)−1 Rπ. Exact in one shot; costs O(|S|3) for matrix inversion.
The update is a contraction mapping. Define the Bellman operator Bπ as “apply one round of the update to every state.” For any two value functions U1 and U2:
After one application, the gap between any two guesses shrinks by at least a factor of γ. Because γ < 1, repeated application squeezes the gap to zero — regardless of starting point. The fixed point they converge to is exactly Uπ.
Consider the 3-state chain MDP from Chapter 0 (states 0, 1, 2; actions left/right). Policy π: always go right. γ = 0.9. R(1, right) = 10, all else −1.
Iteration 0: U = [0, 0, 0].
Iteration 1: State 0: R(0,right)=−1, T(1|0,right)=1, so U(0)=−1+0.9×0=−1. State 1: R(1,right)=10, T(2|1,right)=1, so U(1)=10+0.9×0=10. State 2: stays at 2 (wall), R=−1, U(2)=−1.
Iteration 2: U(0)=−1+0.9×10=8. U(1)=10+0.9×(−1)=9.1. U(2)=−1+0.9×(−1)=−1.9.
Iteration 3: U(0)=−1+0.9×9.1=7.19. U(1)=10+0.9×(−1.9)=8.29. U(2)=−1+0.9×(−1.9)=−2.71.
Values are propagating backward from the reward. State 0 is slowly learning it’s one step away from a big payoff. The exact fixed point (via matrix inversion) is Uπ = [−3.45, 6.36, −5.91] — which is what the iteration converges to over many steps. (The negative values at states 2 reflect the “dead end” of hitting the wall repeatedly.)
A 4×4 grid. Policy: always move right (or stay if at wall). Goal: bottom-right cell (+10). All else: −1 reward. Click Step to apply one round of the Bellman update. Watch values propagate from right to left.
python import numpy as np def iterative_policy_evaluation(mdp, pi, k_max=100, eps=1e-6): """Evaluate policy pi by repeated Bellman updates (Algorithm 7.3).""" U = {s: 0.0 for s in mdp.S} for _ in range(k_max): U_new = {} for s in mdp.S: a = pi[s] U_new[s] = mdp.R(s, a) + mdp.gamma * sum( mdp.T(s, a, sp) * U[sp] for sp in mdp.S ) delta = max(abs(U_new[s] - U[s]) for s in mdp.S) U = U_new if delta < eps: break return U def exact_policy_evaluation(mdp, pi): """Exact matrix solve: U = (I - gamma*T_pi)^{-1} R_pi (Algorithm 7.4).""" n = len(mdp.S) idx = {s: i for i, s in enumerate(mdp.S)} R_pi = np.array([mdp.R(s, pi[s]) for s in mdp.S]) T_pi = np.array([[mdp.T(s, pi[s], sp) for sp in mdp.S] for s in mdp.S]) U_vec = np.linalg.solve(np.eye(n) - mdp.gamma * T_pi, R_pi) return {s: U_vec[idx[s]] for s in mdp.S}
Suppose someone hands you a value function U. Maybe it’s an estimate, maybe it’s exact. Either way, you want to extract the best policy from it. The natural idea: at each state, pick the action that looks best over the next step.
Formally, the greedy policy with respect to U (eq. 7.11) is:
If U = U* (optimal), the greedy policy is optimal. If U is approximate, the greedy policy is approximately optimal — and we can bound exactly how suboptimal it is.
The expression inside the argmax has a name: the action value function (or Q-function), defined in eq. 7.12:
Q(s,a) is the expected return of: (1) taking action a in state s, then (2) acting greedily forever after (using value function U to decide). From Q, we get everything for free:
| Function | Formula | Policy extraction |
|---|---|---|
| Value U(s) | Expected return under optimal actions | Need T, R to compute lookahead first |
| Q(s,a) | R(s,a) + γ∑T(s′|s,a)U(s′) | Just argmaxa Q(s,a) — no T, R needed |
| U(s) from Q | maxa Q(s,a) | Trivial once Q is known |
The advantage function A(s,a) (eq. 7.15) measures how much better action a is compared to the best action:
Since U(s) = maxa Q(s,a), the greedy action has A(s, π(s)) = 0. Every other action has A(s,a) ≤ 0 — a negative advantage tells you exactly how much you lose by choosing that action.
Here is a toy 2×3 grid MDP with γ = 0.9. Goal state (bottom-right) has been solved; its value is U*(goal) = 0. All move costs are −1. Suppose the value function at state (0,0) is U(0,0) = −5, and from there you can move right to (0,1) with U(0,1) = −4, or down to (1,0) with U(1,0) = −4.
Q(s, right) = −1 + 0.9×(−4) = −1 − 3.6 = −4.6.
Q(s, down) = −1 + 0.9×(−4) = −4.6 (tie; both neighbors equally far from goal).
Q(s, left) = −1 + 0.9×U(wall=stay) = −1 + 0.9×(−5) = −5.5 (wall, loops to same state).
Q(s, up) = −1 + 0.9×U(wall=stay) = −5.5 (same; another wall).
So U(s) = max Q = −4.6. A(s, right) = −4.6−(−4.6) = 0. A(s, left) = −5.5−(−4.6) = −0.9. Going left wastes 0.9 units of expected return.
A 4×4 grid with precomputed Q-values. Each cell shows four arrows (up/down/left/right) sized by Q(s,a). The brightest arrow is the greedy action. Toggle to display U(s) or A(s,a) at each cell.
python def lookahead(mdp, U, s, a): """Q(s,a): one-step lookahead value (Algorithm 7.2).""" return mdp.R(s, a) + mdp.gamma * sum( mdp.T(s, a, sp) * U[sp] for sp in mdp.S ) def greedy(mdp, U, s): """Extract greedy action at state s.""" return max(mdp.A, key=lambda a: lookahead(mdp, U, s, a)) def Q_table(mdp, U): """Full Q-table: Q[s][a] for all states and actions.""" return {s: {a: lookahead(mdp, U, s, a) for a in mdp.A} for s in mdp.S} def advantage(mdp, U, s, a): """A(s,a) = Q(s,a) - U(s). Always <= 0.""" return lookahead(mdp, U, s, a) - U[s]
We know how to evaluate a policy (Chapter 1) and how to improve it (Chapter 2). The obvious question: what if we alternated? Evaluate, then improve, then evaluate again — until the policy stops changing?
That’s exactly policy iteration (Algorithm 7.6). Starting from any policy π, it cycles:
Three facts lock together to guarantee convergence:
(1) Monotonic improvement. The policy improvement theorem says: if π′ is greedy with respect to Uπ, then Uπ′(s) ≥ Uπ(s) for all states. The new policy is never worse. Proof: at each state, π′ picks the best one-step lookahead, which is at least as large as what π would have gotten. Chain this inequality forward through all future states (just like we did in Chapter 2) and you get Uπ′(s) ≥ Uπ(s) globally.
(2) Finiteness. A finite MDP with |S| states and |A| actions has at most |A||S| deterministic policies (each state independently chooses one of |A| actions). Each improvement step either finds a strictly better policy or terminates. No cycles are possible — we can’t visit the same policy twice, because each visit strictly improves it.
(3) Optimality at termination. If π′ = π, then the greedy policy equals the current one, which means Uπ(s) = maxa[R(s,a) + γ∑T(s′|s,a)Uπ(s′)] for every state. That IS the Bellman optimality equation. So Uπ = U* and π = π*.
Using full matrix evaluation at each step is expensive. What if we only run a few sweeps of iterative evaluation before improving? This gives modified policy iteration:
| Evaluation sweeps per improvement | Algorithm | Per-iteration cost |
|---|---|---|
| Until convergence (∞ sweeps) | Standard policy iteration | O(|S|3) (matrix solve) |
| m sweeps (m > 1) | Modified policy iteration | O(m×|S|2×|A|) |
| 1 sweep | Value iteration | O(|S|2×|A|) |
Start with the policy “always go east.” Iteration 1: evaluate (values propagate from reward state), improve (greedy extraction changes some arrows near the reward). Iteration 2: values update again, more arrows change. Iteration 3: converged. Only 3 full iterations for the exact optimal policy.
The early iterations change many arrows. Later iterations change only a few. Convergence is front-loaded — the policy improves fastest at the start, then makes tiny refinements near the end.
Goal (green, bottom-right): +10. All moves: −1. γ = 0.9. Initial policy: always right. Click Evaluate to compute Uπ, then Improve to extract the greedy policy. Watch arrows change.
python def policy_iteration(mdp, pi_init=None, k_eval=50): """Algorithm 7.6: Policy iteration.""" # Initialize to first action if not provided pi = pi_init or {s: mdp.A[0] for s in mdp.S} while True: # 1. Evaluate current policy U = iterative_policy_evaluation(mdp, pi, k_max=k_eval) # 2. Improve greedily pi_new = {s: greedy(mdp, U, s) for s in mdp.S} # 3. Check convergence if all(pi_new[s] == pi[s] for s in mdp.S): return pi, U pi = pi_new
What if we skipped the separate evaluation step entirely? Instead of first computing Uπ and then extracting the greedy policy, we combine them into a single update. At every state, we immediately apply the optimal Bellman equation — the max over all actions — and use that as the new value estimate.
This is the Bellman backup (eq. 7.16), the engine of value iteration:
Starting from U(s) = 0 everywhere and applying this update to every state repeatedly, we converge to the optimal value function U*, which satisfies the Bellman optimality equation (eq. 7.17):
The max operator does not break the contraction. For any two value functions U1 and U2, applying the Bellman optimality operator T*:
The key identity is |maxa f(a) − maxa g(a)| ≤ maxa |f(a) − g(a)|. Both players pick their best action; the gap between their best values can’t exceed the worst-case gap across all actions. And that worst-case gap is bounded by γ times the original difference in U1, U2. Since γ < 1, repeated application squeezes any initial error to zero.
We can’t measure ||Uk − U*|| directly (we don’t know U*). Instead, we track the Bellman residual:
The true error is bounded by: ||Uk − U*||∞ ≤ δγ/(1−γ). And the policy loss — how much worse the extracted policy is than optimal — is bounded by: ||Uπ − U*||∞ ≤ 2εγ/(1−γ) where ε is the value error.
States: s0, s1, s2 (chain). Actions: stay (s), advance (a). R(s1, advance) = 10; all else −1. Advance from s0 goes to s1; advance from s1 goes to s2; s2 is absorbing (loops). Discount γ = 0.9.
k=0: U = [0, 0, 0].
k=1:
s0: max(stay: −1+0.9×0=−1, advance: −1+0.9×0=−1) = −1
s1: max(stay: −1+0.9×0=−1, advance: 10+0.9×0=10) = 10
s2: max(stay: −1+0.9×0=−1, advance: −1+0.9×0=−1) = −1
k=2:
s0: max(−1+0.9×(−1)=−1.9, −1+0.9×10=8) = 8 ← advance is now clearly better!
s1: max(−1+0.9×10=8, 10+0.9×(−1)=9.1) = 9.1
s2: max(−1+0.9×(−1)=−1.9, −1+0.9×(−1)=−1.9) = −1.9
k=3:
s0: advance: −1+0.9×9.1=7.19
s1: advance: 10+0.9×(−1.9)=8.29
s2: — = −2.71
Already at k=2, value iteration “learned” that s0 should advance (reaching s1 gives 10). The backup propagates optimism: once s1’s value is updated to 10, that knowledge immediately ripples to s0 on the next step.
5×5 grid world. Drag sliders to change γ (discount) and slip probability (stochastic transitions). Step advances one Bellman backup. Run animates. Watch the Bellman residual decay and optimal policy arrows emerge.
Bellman residual δ over iterations (log scale). Convergence criterion: δ < 0.001.
python def bellman_backup(mdp, U, s): """One Bellman backup at state s (Algorithm 7.7).""" return max( mdp.R(s, a) + mdp.gamma * sum(mdp.T(s, a, sp) * U[sp] for sp in mdp.S) for a in mdp.A ) def value_iteration(mdp, eps=1e-6, k_max=1000): """Algorithm 7.8: Value iteration with Bellman residual stopping.""" U = {s: 0.0 for s in mdp.S} for k in range(k_max): U_new = {s: bellman_backup(mdp, U, s) for s in mdp.S} delta = max(abs(U_new[s] - U[s]) for s in mdp.S) U = U_new if delta < eps * (1 - mdp.gamma) / mdp.gamma: # Tighter stop: guarantees true error < eps break policy = {s: greedy(mdp, U, s) for s in mdp.S} return policy, U
Standard value iteration is synchronous: in each iteration, we compute new values for ALL states simultaneously, using only the old values. Then we swap in the new table and repeat. Clean, correct — but it ignores useful information. When we compute U(s5), we already have a fresh U(s1). Why not use it?
Asynchronous value iteration (Algorithm 7.9) updates states one at a time, in place, using the most recent value of every state already updated this sweep:
The difference: when we update U(s5), we use the already-fresh U(s1), U(s2), etc. This is the Gauss-Seidel method, adapted from linear algebra to dynamic programming.
The order in which we update states controls how fast information flows. There are three natural orderings:
| Ordering | Behavior | Best for |
|---|---|---|
| Toward reward (backward) | Values propagate in one pass | Chain MDPs with known reward location |
| Away from reward (forward) | Values propagate slowly, one step per sweep | No benefit over synchronous VI |
| Random | Intermediate; converges but unpredictably | Unknown structure; still guaranteed to converge |
One more key fact: even with a bad ordering, asynchronous VI cannot converge slower than synchronous VI. It can only be the same or faster. The only requirement for convergence: every state must be updated infinitely often. Any ordering that visits every state repeatedly works.
Linear chain of 5 states (s0→s1→s2→s3→s4). Reward at s4: +10. γ = 0.9. All transitions deterministic (always advance).
Left-to-right ordering (s0, s1, s2, s3, s4): Sweep 1 updates s0 first — it sees U(s1)=0, stays at 0. Then s1 sees U(s2)=0 → stays 0. … Only s4 has the reward. After 1 sweep: same as synchronous VI. Needs ~5 sweeps to fully propagate.
Right-to-left ordering (s4, s3, s2, s1, s0): Sweep 1 updates s4 first: U(s4) = 10. Then s3 sees fresh U(s4)=10: U(s3) = −1 + 0.9×10 = 8. Then s2 sees fresh U(s3)=8: U(s2) = 6.2. s1: 4.58. s0: 3.12. Full propagation in 1 sweep!
Same 6-cell chain. Left half: synchronous VI (updates all cells from old table each step). Right half: Gauss-Seidel VI. Choose sweep direction. Watch Gauss-Seidel converge faster with right-to-left ordering.
python def gauss_seidel_vi(mdp, ordering=None, k_max=1000, eps=1e-6): """Algorithm 7.9: Gauss-Seidel (asynchronous) value iteration.""" # ordering: list of states in update order. Default: mdp.S as-is. ordering = ordering or mdp.S U = {s: 0.0 for s in mdp.S} for _ in range(k_max): delta = 0.0 for s in ordering: # update in-place, use fresh values immediately u_new = bellman_backup(mdp, U, s) delta = max(delta, abs(u_new - U[s])) U[s] = u_new # write back immediately (in-place) if delta < eps: break return {s: greedy(mdp, U, s) for s in mdp.S}, U # Backward induction: sweep from last time step to first def backward_induction(mdp_h, H): """Finite-horizon DP: sweep backward from t=H. Returns policy for each t. mdp_h.S assumed to contain (state, time) tuples. """ U = {(s, H): mdp_h.terminal_reward(s) for s in mdp_h.base_states} policies = {} for t in range(H - 1, -1, -1): # sweep backward for s in mdp_h.base_states: best_a, best_v = None, -float('inf') for a in mdp_h.A: v = mdp_h.R(s, a) + mdp_h.gamma * sum( mdp_h.T(s, a, sp) * U.get((sp, t+1), 0) for sp in mdp_h.base_states ) if v > best_v: best_a, best_v = a, v U[(s, t)] = best_v policies[(s, t)] = best_a return policies, U
Value iteration and policy iteration are iterative algorithms — they converge asymptotically. Is there a way to compute U* in one shot, as a direct optimization? Yes: rewrite the Bellman optimality equation as a linear program.
The Bellman optimality equation requires U*(s) = maxa[R(s,a) + γ∑T(s′|s,a)U*(s′)] for every state s. The “max” is the hard part — it makes the equation nonlinear. The LP trick: replace the max with a constraint. Instead of saying “U(s) equals the best lookahead,” we say “U(s) is at least as large as every lookahead”:
This is a system of |S|×|A| linear constraints in |S| variables. The feasible set is a convex polyhedron. But which point in this polyhedron should we pick? Any U satisfying the constraints is an overestimate of U*. The smallest feasible U is exactly U*.
So the LP is (eq. 7.19–7.20):
The LP has |S| variables and |S|×|A| constraints. LP is solvable in polynomial time (via interior-point methods). Therefore, MDPs with finite state and action spaces are solvable in polynomial time — they lie in the complexity class P.
| Property | Value |
|---|---|
| Variables | |S| |
| Constraints | |S| × |A| |
| Complexity | Polynomial (LP is in P) |
| Practical vs. VI | VI is usually faster in practice; LP matters theoretically |
Two states s0, s1. Two actions: left (L), right (R). γ = 0.9. Transitions: L always stays; R from s0 goes to s1; R from s1 stays. Rewards: R(s0,L)=−1, R(s0,R)=−1, R(s1,L)=−1, R(s1,R)=10.
LP constraints (variables: U0, U1):
From s0, action L: U0 ≥ −1 + 0.9 U0 → 0.1 U0 ≥ −1 → U0 ≥ −10
From s0, action R: U0 ≥ −1 + 0.9 U1
From s1, action L: U1 ≥ −1 + 0.9 U1 → 0.1 U1 ≥ −1 → U1 ≥ −10
From s1, action R: U1 ≥ 10 + 0.9 U1 → 0.1 U1 ≥ 10 → U1 ≥ 100
The binding constraint on U1 is U1 = 100. Then U0 ≥ −1 + 0.9×100 = 89. Minimize: U0 = 89, U1 = 100. The optimal policy: in s0, go right (worth 89); in s1, go right (keeps collecting reward 10, discounted sum = 100).
The two LP variables are U0 and U1. Each constraint is a half-plane. The feasible region (shaded) is their intersection. The objective “minimize U0+U1” sweeps a line; the LP optimum is where it first enters the feasible region. Drag γ to see constraints shift.
python from scipy.optimize import linprog import numpy as np def lp_solve(mdp): """Algorithm 7.10: Solve MDP via linear programming.""" S, A, gamma = mdp.S, mdp.A, mdp.gamma n = len(S) idx = {s: i for i, s in enumerate(S)} # Objective: minimize sum of U(s) c = np.ones(n) # Constraints: U(s) >= R(s,a) + gamma * sum_s' T(s'|s,a) U(s') # Rearranged: -U(s) + gamma * sum_s' T(s'|s,a) U(s') <= -R(s,a) rows = [] rhs = [] for s in S: for a in A: row = np.zeros(n) row[idx[s]] = -1.0 for sp in S: row[idx[sp]] += gamma * mdp.T(s, a, sp) rows.append(row) rhs.append(-mdp.R(s, a)) res = linprog(c, A_ub=np.array(rows), b_ub=np.array(rhs), method='highs') U = {s: res.x[idx[s]] for s in S} return {s: greedy(mdp, U, s) for s in S}, U
Every method so far assumed a finite state and action space. But most real physical systems — a pendulum, a drone, a robotic arm — have continuous states and actions. You can’t enumerate them. Can we still compute an exact optimal policy?
Yes — for one important special case: when dynamics are linear and rewards are quadratic. This is the Linear Quadratic Regulator (LQR), and it has a closed-form solution.
Linear dynamics (eq. 7.22): The next state is a linear function of the current state and action, plus Gaussian noise:
Here Ts is the state transition matrix (n×n), Ta is the action transition matrix (n×m), and w is zero-mean Gaussian noise.
Quadratic reward (eq. 7.23): The reward penalizes both large states and large actions:
Rs is negative semidefinite (penalizes large states); Ra is negative definite (penalizes large actions). The goal: keep the system near zero with minimal control effort.
The optimal value function turns out to be quadratic: Uh(s) = sTVhs + qh. The optimal policy turns out to be linear: πh(s) = Lh · s.
The gain matrix Lh (eq. 7.29):
Vh satisfies the discrete-time Riccati equation (eq. 7.30):
State: s = [x, v]. Action: scalar acceleration a. Timestep Δt = 0.1. Dynamics: x′ = x + vΔt + ½aΔt2, v′ = v + aΔt. Reward: R(s,a) = −x2 − v2 − 0.5a2.
In matrix form: Ts = [[1, 0.1],[0, 1]], Ta = [[0.005],[0.1]], Rs = diag(−1,−1), Ra = [−0.5].
Starting at s0 = [−10, 0], the Riccati recursion gives gains:
| Horizon h | Gain Lh |
|---|---|
| h = 1 | [0, 0] — last step, zero gain |
| h = 2 | [−0.286, −0.857] |
| h = 3 | [−0.462, −1.077] |
| h = 4 | [−0.496, −1.112] |
| h = 5 | [−0.504, −1.124] ← essentially converged |
With a longer horizon the controller pushes harder early: at x=−10, h=5 gives initial action ≈ +5.04, rapidly accelerating toward the origin, while h=2 gives only +2.86.
Position-velocity phase plane. The trajectory starts at the dot and converges to the origin under LQR control. Drag the starting dot. Change horizon h and watch the gain matrix update and the trajectory reshape.
python import numpy as np def lqr(Ts, Ta, Rs, Ra, h_max=50): """Algorithm 7.11: LQR via backward Riccati iteration.""" n = Ts.shape[0] V = np.zeros((n, n)) # V_0 = 0 at terminal step gains = [] for _ in range(h_max): M = Ta.T @ V @ Ta + Ra L = -np.linalg.solve(M, Ta.T @ V @ Ts) # eq 7.29 gains.append(L) V = Rs + Ts.T @ V @ Ts + Ts.T @ V @ Ta @ L # Riccati, eq 7.30 return gains # gains[h-1] = L_h # Example 7.4 setup dt = 0.1 Ts = np.array([[1.0, dt], [0.0, 1.0]]) Ta = np.array([[0.5*dt**2], [dt]]) Rs = np.diag([-1.0, -1.0]) Ra = np.array([[-0.5]]) gains = lqr(Ts, Ta, Rs, Ra, h_max=10) print("L_5 =", gains[4]) # [[-0.504, -1.124]] # Simulate trajectory from s0 = [-10, 0] s = np.array([-10.0, 0.0]) L = gains[-1] # use converged gain trajectory = [s.copy()] for _ in range(50): a = L @ s # linear policy: a = L*s s = Ts @ s + Ta.reshape(2) * a[0] trajectory.append(s.copy())
You’ve seen value iteration on fixed examples. Now you’re in control. Below is a fully configurable grid world: choose the size, place rewards and walls by clicking, set the discount factor and stochastic slip probability. Run value iteration step by step or let it animate to convergence. Watch the Bellman residual decay. Extract and display optimal policy arrows.
Left-click a cell to cycle: empty → wall → reward (+10) → penalty (−5) → empty. Then run value iteration and watch optimal arrows emerge. The residual plot (below) shows δ on a log scale.
Bellman residual δk = maxs|Uk+1(s)−Uk(s)| (log scale). Algorithm stops when δ < 0.01.
Both methods solve the same problem and both produce the same answer: the optimal policy. But they take very different paths to get there. The comparison below runs them simultaneously on an identical MDP so you can see the tradeoff directly.
Policy Iteration: Each outer iteration is expensive — a full policy evaluation before each improvement. But the number of outer iterations is tiny (often 2–5 even for large grids).
Value Iteration: Each iteration is cheap — just one Bellman backup per state. But many iterations are needed for the Bellman residual to shrink below the stopping threshold.
Both algorithms run on the same 5×5 grid. Left: policy iteration (shows current policy arrows). Right: value iteration (shows current value heatmap). Step both simultaneously and compare iteration counts.
For a grid with N = |S| states and A = |A| actions:
| Algorithm | Cost per outer iteration | Number of outer iterations |
|---|---|---|
| Policy Iteration | O(N3) for linear solve + O(N2A) for improvement | Often 2–10 in practice |
| Modified PI (m sweeps) | O(m · N2A) | More than PI, fewer than VI |
| Value Iteration | O(N2A) | O(log(1/ε) / log(1/γ)) |
For the 5×5 grid above (N=25, A=4): PI costs 253 ≈ 15,000 operations per outer iter but typically finishes in 3 outer iters. VI costs 252×4 = 2,500 operations per iter but needs ∼50–200 iters for γ=0.9. Total: PI ≈ 45,000 vs. VI ≈ 125,000–500,000. PI wins for small grids.
Five exact methods, each with its own niche. Here’s the map:
| Method | Per iteration | Outer iterations | Memory | Best for |
|---|---|---|---|---|
| Policy Iteration | O(|S|3) + O(|S|2|A|) | 2–10 in practice | O(|S|2) | Small state spaces (<1,000 states) |
| Value Iteration | O(|S|2|A|) | O(log(1/ε)/log(1/γ)) | O(|S|) | Medium state spaces |
| Async VI (Gauss-Seidel) | O(|S||A|) per sweep | Depends on ordering | O(|S|) | When good ordering is known |
| Linear Program | Polynomial (interior point) | One solve | O(|S|2) | Theoretical guarantee; rarely practical |
| LQR | O(n3) per Riccati step | h steps (horizon) | O(n2) | Linear dynamics, quadratic cost |
Policy Iteration breaks when |S| is large (thousands+). The O(|S|3) linear solve per iteration becomes prohibitive. Even sparse matrix solves have poor scaling for grid worlds beyond ∼100×100.
Value Iteration breaks when γ is close to 1. The convergence rate is γk, so γ = 0.999 needs ∼6,900 iterations to reduce error by 1000×. Every digit of γ closer to 1 roughly doubles the iteration count.
Async VI breaks when there’s no good state ordering. For ergodic MDPs without obvious structure, random ordering may not be much faster than synchronous VI.
LQR breaks when dynamics are nonlinear or the reward is non-quadratic. Even small nonlinearities can make the true optimal policy dramatically different from the LQR policy. Extensions like iLQR and DDP address this by iterating LQR around a nominal trajectory.
All exact methods share a fatal flaw: they require enumerating all states. A robot with 10 joints, each discretized to 100 positions, has 10010 = 1020 states. No computer can enumerate them. This is the curse of dimensionality — the fundamental barrier that drives the entire field of approximate dynamic programming (Chapter 8) and model-based RL (Chapter 9).
Chapter 7 is the foundation. Everything that follows in the book is an answer to one of three problems that exact methods cannot handle: state spaces that are too large, models that are unknown, and partial observability.
| Name | Equation |
|---|---|
| Bellman Expectation | Uπ(s) = R(s,π(s)) + γ∑s′T(s′|s,π(s))Uπ(s′) |
| Bellman Optimality | U*(s) = maxa[R(s,a) + γ∑s′T(s′|s,a)U*(s′)] |
| Q-Function | Q(s,a) = R(s,a) + γ∑s′T(s′|s,a)U(s′) |
| Greedy Policy | π(s) = arg maxa Q(s,a) |
| Policy Iteration | Evaluate Uπ exactly, then π ← greedy(Uπ), repeat |
| Value Iteration | U(s) ← maxa[R(s,a)+γ∑T(s′|s,a)U(s′)], repeat |
| VI Error Bound | ||Uk−U*||∞ ≤ δγ/(1−γ) where δ = max|Uk+1−Uk| |
| LQR Gain | Lh = −(TaTVh−1Ta+Ra)−1TaTVh−1Ts |
| Riccati Equation | Vh+1 = Rs + TsTVhTs − TsTVhTa(TaTVhTa+Ra)−1TaTVhTs |
These micro-lessons complement Chapter 7:
“An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.”
— Richard Bellman, Dynamic Programming, 1957