Ch 7: Exact Solution Methods — Kochenderfer et al.

Chapter 0: Why Sequential Decisions Are Hard

A robot is navigating a grid to reach a goal. It has a map, it has motors, and it knows its target. Easy enough. But every time it sends a “move north” command, there is a 20% chance the wheels slip and it drifts east instead. Now what?

Unlike deterministic planning — where you compute one path and execute it — stochastic environments force you to reason about sequences of decisions under uncertainty. The same action from the same state might lead to different outcomes. You cannot plan a fixed path; you need a policy that tells you what to do in any situation you might land in.

The key shift: We’re not planning a sequence of actions. We’re computing a function π(s) that maps every state to the best action. That function — the policy — is the solution to a stochastic sequential decision problem.

The standard mathematical model for this class of problems is a Markov decision process (MDP). An MDP packages five ingredients:

Component	Symbol	What it says
State space	S	The finite set of all possible situations
Action space	A	The finite set of all possible moves
Transition function	T(s′\|s,a)	Prob. of landing in s′ after doing a in s
Reward function	R(s,a)	Expected reward for taking action a in state s
Discount factor	γ	How much we value future rewards (0 ≤ γ < 1)

The Markov property is the defining assumption: the distribution over next states T(s′|s,a) depends only on the current state and action — not on how we got there. No memory of past states is needed. This is what makes the math tractable.

Utility: What Are We Maximizing?

For a finite horizon of T steps, the total utility is just the sum of rewards: U = r₁ + r₂ + … + r_T. But for an infinite horizon problem, this sum can blow up. We need to discount future rewards.

The discounted return is:

U = r₁ + γr₂ + γ²r₃ + … = ∑_t=1^∞ γ^t−1 r_t

If γ = 0.9, a reward 10 steps away is worth 0.9¹⁰ ≈ 0.35 of its face value. A reward 20 steps away is worth only 0.9²⁰ ≈ 0.12. Discounting makes far-future rewards matter less — and crucially, it guarantees the infinite sum converges as long as rewards are bounded.

Exercise 7.1 worked out: Suppose you get the same reward r every step forever. What is the total discounted utility? Let S = r + γr + γ²r + … . Multiply both sides by γ: γS = γr + γ²r + … . Subtract: S − γS = r, so S(1−γ) = r, giving S = r/(1−γ). For r = 1 and γ = 0.9, the sum is 10. The discount factor acts as a geometric series multiplier.

A policy π(s) maps every state to an action. For infinite-horizon MDPs, the optimal policy is stationary (doesn’t change with time) and deterministic (picks one action per state). The value function U^π(s) is the expected discounted return from starting in state s and following policy π forever:

U^π(s) = 𝔼_π[ r₁ + γr₂ + γ²r₃ + … | s₁ = s ]

The optimal value function U*(s) = max_π U^π(s) gives the best achievable expected return from each state. Finding U* — and the policy that achieves it — is what the rest of this chapter is about.

A Worked Example: Exercise 7.2

Consider a 5-state MDP: states s₁…s₅. Two actions: stay (remain in current state) and continue (move to the next state). The only nonzero reward is R(s₄, continue) = 10, which takes you to s₅.

If we follow the optimal policy from s₁: always continue. We pass through s₁ → s₂ → s₃ → s₄, collecting zero reward at each step, then at step 4 we take the reward. The discounted value is:

U*(s₁) = γ³ × 10 = 1

Solving for γ: γ³ = 0.1, so γ = 0.1^1/3 ≈ 0.464. A discount factor of about 46% makes a reward 3 steps away worth exactly 1 unit today.

MDP Grid World: Discount Factor Reach

A 6×6 grid. The goal (green) is at bottom-right with reward +20. All other cells show the discounted value of the goal reward as a function of shortest-path distance. Drag γ to see how discounting “reaches” across the grid.

γ 0.80

High γ: distant rewards stay large (blue). Low γ: only nearby cells have significant value (dark).

Real-world example: Aircraft Collision Avoidance

ACAS X, the FAA’s collision avoidance system, is literally solved as an MDP. States encode positions and velocities of both aircraft. Actions are climb, descend, or level. Collision gives a large negative reward (−10000). Unnecessary maneuvering gives a small negative reward (−10). The system pre-computes the optimal policy over millions of states offline, then looks up the right action in real time. It outperforms the previous rule-based system both in safety and pilot annoyance metrics.

python
from dataclasses import dataclass
from typing import List, Callable

@dataclass
class MDP:
    S: List          # state space (list of states)
    A: List          # action space (list of actions)
    T: Callable      # T(s, a, sp) -> transition probability
    R: Callable      # R(s, a) -> expected immediate reward
    gamma: float     # discount factor in [0, 1)

# Example: 3-state chain. Two actions: left (0), right (1).
# Right from state 2 gives reward 10. All else gives -1.
def chain_T(s, a, sp):
    if a == 1:  # right: move to s+1 (stay if at edge)
        target = min(s + 1, 2)
    else:      # left: move to s-1 (stay if at edge)
        target = max(s - 1, 0)
    return 1.0 if sp == target else 0.0

def chain_R(s, a):
    return 10.0 if (s == 1 and a == 1) else -1.0

mdp = MDP(S=[0,1,2], A=[0,1], T=chain_T, R=chain_R, gamma=0.9)

A robot gets reward 100 at a goal exactly 5 steps away, with γ = 0.9. What is the discounted value of this reward from the robot’s current state?

100 (discount only applies to infinite horizon) 0.9⁵ × 100 ≈ 59.05 0.9 × 100 = 90 (discount applied once) 5 × 0.9 × 100 = 450

Chapter 1: Policy Evaluation — What Is a Policy Worth?

Before we can optimize, we need to evaluate. Given a fixed policy π, how good is it? The answer is the value function U^π(s): the expected discounted return starting from s and following π forever.

Computing U^π is called the prediction problem. It has a clean recursive structure. From any state s, following policy π gives immediate reward R(s, π(s)), then we land in state s′ with probability T(s′|s, π(s)), and get value U^π(s′) from there. Putting it together, the Bellman expectation equation (eq. 7.6) must hold at every state simultaneously:

U^π(s) = R(s, π(s)) + γ ∑_s′ T(s′|s, π(s)) U^π(s′)

This is a system of |S| equations in |S| unknowns. Two ways to solve it:

Iterative method (Algorithm 7.3): Start with U(s) = 0 for all s. Repeatedly apply the lookahead equation (eq. 7.5):

U_k+1^π(s) = R(s,π(s)) + γ∑_s′ T(s′|s,π(s)) U_k^π(s′)

Repeat until values stop changing. Cheap per iteration; needs many iterations.

Exact method (Algorithm 7.4): Write in matrix form:

U^π = R^π + γT^πU^π

Solve: U^π = (I − γT^π)⁻¹ R^π. Exact in one shot; costs O(|S|³) for matrix inversion.

Why Does the Iterative Method Converge?

The update is a contraction mapping. Define the Bellman operator B^π as “apply one round of the update to every state.” For any two value functions U₁ and U₂:

||B^πU₁ − B^πU₂||_∞ ≤ γ ||U₁ − U₂||_∞

After one application, the gap between any two guesses shrinks by at least a factor of γ. Because γ < 1, repeated application squeezes the gap to zero — regardless of starting point. The fixed point they converge to is exactly U^π.

Why the factor is γ exactly: In the update U_k+1(s) = R(s,π(s)) + γ∑T(s′|s,π(s))U_k(s′), the reward R cancels in the difference between two guesses (it’s the same for both). What’s left is γ times the sum of transition-weighted differences in U_k values. Since probabilities sum to 1, the weighted sum can’t exceed the max difference. So the gap shrinks by exactly γ. Not γ², not something clever — just γ, directly from the formula.

Iterative Evaluation: Worked Trace

Consider the 3-state chain MDP from Chapter 0 (states 0, 1, 2; actions left/right). Policy π: always go right. γ = 0.9. R(1, right) = 10, all else −1.

Iteration 0: U = [0, 0, 0].

Iteration 1: State 0: R(0,right)=−1, T(1|0,right)=1, so U(0)=−1+0.9×0=−1. State 1: R(1,right)=10, T(2|1,right)=1, so U(1)=10+0.9×0=10. State 2: stays at 2 (wall), R=−1, U(2)=−1.

Iteration 2: U(0)=−1+0.9×10=8. U(1)=10+0.9×(−1)=9.1. U(2)=−1+0.9×(−1)=−1.9.

Iteration 3: U(0)=−1+0.9×9.1=7.19. U(1)=10+0.9×(−1.9)=8.29. U(2)=−1+0.9×(−1.9)=−2.71.

Values are propagating backward from the reward. State 0 is slowly learning it’s one step away from a big payoff. The exact fixed point (via matrix inversion) is U^π = [−3.45, 6.36, −5.91] — which is what the iteration converges to over many steps. (The negative values at states 2 reflect the “dead end” of hitting the wall repeatedly.)

Convergence rate: After k iterations, the error is at most γ^k × ||U₀ − U^π||_∞. For γ = 0.9, reducing error by 100× requires k ≥ log(0.01)/log(0.9) ≈ 44 iterations. For γ = 0.99, the same accuracy needs ~459 iterations. High discount factors make convergence painfully slow — the fundamental speed-accuracy tradeoff in all DP methods.

Iterative Policy Evaluation: Step by Step

A 4×4 grid. Policy: always move right (or stay if at wall). Goal: bottom-right cell (+10). All else: −1 reward. Click Step to apply one round of the Bellman update. Watch values propagate from right to left.

k = 0 δ = —

python
import numpy as np

def iterative_policy_evaluation(mdp, pi, k_max=100, eps=1e-6):
    """Evaluate policy pi by repeated Bellman updates (Algorithm 7.3)."""
    U = {s: 0.0 for s in mdp.S}
    for _ in range(k_max):
        U_new = {}
        for s in mdp.S:
            a = pi[s]
            U_new[s] = mdp.R(s, a) + mdp.gamma * sum(
                mdp.T(s, a, sp) * U[sp] for sp in mdp.S
            )
        delta = max(abs(U_new[s] - U[s]) for s in mdp.S)
        U = U_new
        if delta < eps:
            break
    return U

def exact_policy_evaluation(mdp, pi):
    """Exact matrix solve: U = (I - gamma*T_pi)^{-1} R_pi (Algorithm 7.4)."""
    n = len(mdp.S)
    idx = {s: i for i, s in enumerate(mdp.S)}
    R_pi = np.array([mdp.R(s, pi[s]) for s in mdp.S])
    T_pi = np.array([[mdp.T(s, pi[s], sp) for sp in mdp.S] for s in mdp.S])
    U_vec = np.linalg.solve(np.eye(n) - mdp.gamma * T_pi, R_pi)
    return {s: U_vec[idx[s]] for s in mdp.S}

Policy evaluation converges because the Bellman update is a contraction mapping. What property of γ is essential for this?

γ must equal 1 exactly to preserve the reward scale γ must be larger than 0.5 so the rewards don’t vanish 0 ≤ γ < 1, making the max change shrink by factor γ each iteration γ must equal 0 to make evaluation instant

Chapter 2: Greedy Policies and Q-Functions

Suppose someone hands you a value function U. Maybe it’s an estimate, maybe it’s exact. Either way, you want to extract the best policy from it. The natural idea: at each state, pick the action that looks best over the next step.

Formally, the greedy policy with respect to U (eq. 7.11) is:

π(s) = arg max_a [ R(s,a) + γ ∑_s′ T(s′|s,a) U(s′) ]

If U = U* (optimal), the greedy policy is optimal. If U is approximate, the greedy policy is approximately optimal — and we can bound exactly how suboptimal it is.

The Q-Function

The expression inside the argmax has a name: the action value function (or Q-function), defined in eq. 7.12:

Q(s,a) = R(s,a) + γ ∑_s′ T(s′|s,a) U(s′)

Q(s,a) is the expected return of: (1) taking action a in state s, then (2) acting greedily forever after (using value function U to decide). From Q, we get everything for free:

Function	Formula	Policy extraction
Value U(s)	Expected return under optimal actions	Need T, R to compute lookahead first
Q(s,a)	R(s,a) + γ∑T(s′\|s,a)U(s′)	Just argmax_a Q(s,a) — no T, R needed
U(s) from Q	max_a Q(s,a)	Trivial once Q is known

Why Q matters in RL: In reinforcement learning, we often don’t know T and R explicitly. We only observe samples. The Q-function lets us extract the greedy policy from data alone — just take argmax over actions. This is why Q-learning and DQN operate on Q rather than U. The model-free advantage of Q is real: you bypass the transition model entirely at decision time.

The Advantage Function

The advantage function A(s,a) (eq. 7.15) measures how much better action a is compared to the best action:

A(s,a) = Q(s,a) − U(s)

Since U(s) = max_a Q(s,a), the greedy action has A(s, π(s)) = 0. Every other action has A(s,a) ≤ 0 — a negative advantage tells you exactly how much you lose by choosing that action.

Verified fact: A(s,a) ≤ 0 for all s, a. Equality holds (A=0) only for greedy actions. This means you can safely ignore any action with A << 0 — it’s far worse than the best. Advantage estimates are used in actor-critic algorithms (Ch. 13) to tell the actor which actions to reinforce.

Worked Example: Q-Table From U

Here is a toy 2×3 grid MDP with γ = 0.9. Goal state (bottom-right) has been solved; its value is U*(goal) = 0. All move costs are −1. Suppose the value function at state (0,0) is U(0,0) = −5, and from there you can move right to (0,1) with U(0,1) = −4, or down to (1,0) with U(1,0) = −4.

Q(s, right) = −1 + 0.9×(−4) = −1 − 3.6 = −4.6.

Q(s, down) = −1 + 0.9×(−4) = −4.6 (tie; both neighbors equally far from goal).

Q(s, left) = −1 + 0.9×U(wall=stay) = −1 + 0.9×(−5) = −5.5 (wall, loops to same state).

Q(s, up) = −1 + 0.9×U(wall=stay) = −5.5 (same; another wall).

So U(s) = max Q = −4.6. A(s, right) = −4.6−(−4.6) = 0. A(s, left) = −5.5−(−4.6) = −0.9. Going left wastes 0.9 units of expected return.

Interactive Q-Table Viewer

A 4×4 grid with precomputed Q-values. Each cell shows four arrows (up/down/left/right) sized by Q(s,a). The brightest arrow is the greedy action. Toggle to display U(s) or A(s,a) at each cell.

python
def lookahead(mdp, U, s, a):
    """Q(s,a): one-step lookahead value (Algorithm 7.2)."""
    return mdp.R(s, a) + mdp.gamma * sum(
        mdp.T(s, a, sp) * U[sp] for sp in mdp.S
    )

def greedy(mdp, U, s):
    """Extract greedy action at state s."""
    return max(mdp.A, key=lambda a: lookahead(mdp, U, s, a))

def Q_table(mdp, U):
    """Full Q-table: Q[s][a] for all states and actions."""
    return {s: {a: lookahead(mdp, U, s, a) for a in mdp.A} for s in mdp.S}

def advantage(mdp, U, s, a):
    """A(s,a) = Q(s,a) - U(s). Always <= 0."""
    return lookahead(mdp, U, s, a) - U[s]

If Q(s,up)=5, Q(s,down)=3, Q(s,left)=7, Q(s,right)=2, what is U(s) and A(s,down)?

U(s)=5, A(s,down)=−2 U(s)=7, A(s,down)=3−7=−4 U(s)=7, A(s,down)=3 U(s)=17/4=4.25 (average), A(s,down)=−1.25

Chapter 3: Policy Iteration

We know how to evaluate a policy (Chapter 1) and how to improve it (Chapter 2). The obvious question: what if we alternated? Evaluate, then improve, then evaluate again — until the policy stops changing?

That’s exactly policy iteration (Algorithm 7.6). Starting from any policy π, it cycles:

1. Policy Evaluation

Compute U^π exactly or iteratively — the true value of following π from every state

↓

2. Policy Improvement

π′(s) = argmax_a[R(s,a) + γ∑T(s′|s,a)U^π(s′)] — be greedy everywhere

↓

3. Check

If π′ = π, stop. Otherwise set π ← π′ and go back to step 1.

Why Does It Converge?

Three facts lock together to guarantee convergence:

(1) Monotonic improvement. The policy improvement theorem says: if π′ is greedy with respect to U^π, then U^π′(s) ≥ U^π(s) for all states. The new policy is never worse. Proof: at each state, π′ picks the best one-step lookahead, which is at least as large as what π would have gotten. Chain this inequality forward through all future states (just like we did in Chapter 2) and you get U^π′(s) ≥ U^π(s) globally.

(2) Finiteness. A finite MDP with |S| states and |A| actions has at most |A|^|S| deterministic policies (each state independently chooses one of |A| actions). Each improvement step either finds a strictly better policy or terminates. No cycles are possible — we can’t visit the same policy twice, because each visit strictly improves it.

(3) Optimality at termination. If π′ = π, then the greedy policy equals the current one, which means U^π(s) = max_a[R(s,a) + γ∑T(s′|s,a)U^π(s′)] for every state. That IS the Bellman optimality equation. So U^π = U* and π = π*.

Speed in practice: In principle there are up to 4²⁵ ≈ 10¹⁵ policies for a 5×5 grid with 4 actions. In practice, policy iteration on that problem converges in 2–4 outer iterations. The exponential bound is catastrophically pessimistic. Policy iteration typically solves small-to-medium MDPs in single-digit outer iterations — even though each iteration costs O(|S|³) for the exact solve.

The Modified Policy Iteration Spectrum

Using full matrix evaluation at each step is expensive. What if we only run a few sweeps of iterative evaluation before improving? This gives modified policy iteration:

Evaluation sweeps per improvement	Algorithm	Per-iteration cost
Until convergence (∞ sweeps)	Standard policy iteration	O(\|S\|³) (matrix solve)
m sweeps (m > 1)	Modified policy iteration	O(m×\|S\|²×\|A\|)
1 sweep	Value iteration	O(\|S\|²×\|A\|)

Generalized Policy Iteration (GPI): Any interleaving of partial evaluation and improvement converges to the optimal policy and value function. This insight — formalized by Sutton & Barto as GPI — unifies all DP methods. PI and VI are just two extremes of the same family. Most practical algorithms (actor-critic, Q-learning, TD(λ)) are GPI in disguise.

Hex World Example (Figure 7.5 from the book)

Start with the policy “always go east.” Iteration 1: evaluate (values propagate from reward state), improve (greedy extraction changes some arrows near the reward). Iteration 2: values update again, more arrows change. Iteration 3: converged. Only 3 full iterations for the exact optimal policy.

The early iterations change many arrows. Later iterations change only a few. Convergence is front-loaded — the policy improves fastest at the start, then makes tiny refinements near the end.

Policy Iteration: Step by Step on a 5×5 Grid

Goal (green, bottom-right): +10. All moves: −1. γ = 0.9. Initial policy: always right. Click Evaluate to compute U^π, then Improve to extract the greedy policy. Watch arrows change.

Step 1: click Evaluate

python
def policy_iteration(mdp, pi_init=None, k_eval=50):
    """Algorithm 7.6: Policy iteration."""
    # Initialize to first action if not provided
    pi = pi_init or {s: mdp.A[0] for s in mdp.S}
    while True:
        # 1. Evaluate current policy
        U = iterative_policy_evaluation(mdp, pi, k_max=k_eval)
        # 2. Improve greedily
        pi_new = {s: greedy(mdp, U, s) for s in mdp.S}
        # 3. Check convergence
        if all(pi_new[s] == pi[s] for s in mdp.S):
            return pi, U
        pi = pi_new

Policy iteration is guaranteed to converge in at most how many outer iterations for an MDP with |S| states and |A| actions?

|S| iterations (one per state) |A| × |S| iterations |A|^|S| iterations — the total number of deterministic policies (in practice, far fewer) It may not converge for some MDPs

Chapter 4: Value Iteration (Showcase)

What if we skipped the separate evaluation step entirely? Instead of first computing U^π and then extracting the greedy policy, we combine them into a single update. At every state, we immediately apply the optimal Bellman equation — the max over all actions — and use that as the new value estimate.

This is the Bellman backup (eq. 7.16), the engine of value iteration:

U_k+1(s) = max_a [ R(s,a) + γ ∑_s′ T(s′|s,a) U_k(s′) ]

Starting from U(s) = 0 everywhere and applying this update to every state repeatedly, we converge to the optimal value function U*, which satisfies the Bellman optimality equation (eq. 7.17):

U*(s) = max_a [ R(s,a) + γ ∑_s′ T(s′|s,a) U*(s′) ]

Why Does It Converge?

The max operator does not break the contraction. For any two value functions U₁ and U₂, applying the Bellman optimality operator T*:

||T*U₁ − T*U₂||_∞ ≤ γ ||U₁ − U₂||_∞

The key identity is |max_a f(a) − max_a g(a)| ≤ max_a |f(a) − g(a)|. Both players pick their best action; the gap between their best values can’t exceed the worst-case gap across all actions. And that worst-case gap is bounded by γ times the original difference in U₁, U₂. Since γ < 1, repeated application squeezes any initial error to zero.

Stopping and Error Bounds

We can’t measure ||U_k − U*|| directly (we don’t know U*). Instead, we track the Bellman residual:

δ = max_s |U_k+1(s) − U_k(s)|

The true error is bounded by: ||U_k − U*||_∞ ≤ δγ/(1−γ). And the policy loss — how much worse the extracted policy is than optimal — is bounded by: ||U^π − U*||_∞ ≤ 2εγ/(1−γ) where ε is the value error.

The (1−γ) problem: For γ = 0.99, the bound blows up by factor 99. A residual of δ = 0.01 only guarantees true error ≤ 0.01 × 0.99/0.01 = 0.99. You need δ ≈ 0.0001 to get true error below 0.01. This is why high discount factors make value iteration slow: the algorithm needs to run much longer before the residual criterion is tight.

Worked Example: 3-State MDP by Hand

States: s₀, s₁, s₂ (chain). Actions: stay (s), advance (a). R(s₁, advance) = 10; all else −1. Advance from s₀ goes to s₁; advance from s₁ goes to s₂; s₂ is absorbing (loops). Discount γ = 0.9.

k=0: U = [0, 0, 0].

k=1:
s₀: max(stay: −1+0.9×0=−1, advance: −1+0.9×0=−1) = −1
s₁: max(stay: −1+0.9×0=−1, advance: 10+0.9×0=10) = 10
s₂: max(stay: −1+0.9×0=−1, advance: −1+0.9×0=−1) = −1

k=2:
s₀: max(−1+0.9×(−1)=−1.9, −1+0.9×10=8) = 8 ← advance is now clearly better!
s₁: max(−1+0.9×10=8, 10+0.9×(−1)=9.1) = 9.1
s₂: max(−1+0.9×(−1)=−1.9, −1+0.9×(−1)=−1.9) = −1.9

k=3:
s₀: advance: −1+0.9×9.1=7.19
s₁: advance: 10+0.9×(−1.9)=8.29
s₂: — = −2.71

Already at k=2, value iteration “learned” that s₀ should advance (reaching s₁ gives 10). The backup propagates optimism: once s₁’s value is updated to 10, that knowledge immediately ripples to s₀ on the next step.

Value Iteration: Full Interactive Showcase

5×5 grid world. Drag sliders to change γ (discount) and slip probability (stochastic transitions). Step advances one Bellman backup. Run animates. Watch the Bellman residual decay and optimal policy arrows emerge.

k = 0 δ = —

γ0.90

slip %10%

Bellman residual δ over iterations (log scale). Convergence criterion: δ < 0.001.

python
def bellman_backup(mdp, U, s):
    """One Bellman backup at state s (Algorithm 7.7)."""
    return max(
        mdp.R(s, a) + mdp.gamma * sum(mdp.T(s, a, sp) * U[sp] for sp in mdp.S)
        for a in mdp.A
    )

def value_iteration(mdp, eps=1e-6, k_max=1000):
    """Algorithm 7.8: Value iteration with Bellman residual stopping."""
    U = {s: 0.0 for s in mdp.S}
    for k in range(k_max):
        U_new = {s: bellman_backup(mdp, U, s) for s in mdp.S}
        delta = max(abs(U_new[s] - U[s]) for s in mdp.S)
        U = U_new
        if delta < eps * (1 - mdp.gamma) / mdp.gamma:
            # Tighter stop: guarantees true error < eps
            break
    policy = {s: greedy(mdp, U, s) for s in mdp.S}
    return policy, U

Value iteration with γ = 0.99 takes many more iterations than γ = 0.5. Why?

High γ makes the reward function more complex The error bound is δγ/(1−γ); for γ=0.99 the denominator is 0.01, inflating the bound by 99×, requiring much tighter residual for the same accuracy guarantee High γ causes numerical instability in the Bellman backup Value iteration only works for γ < 0.9

Chapter 5: Asynchronous Value Iteration

Standard value iteration is synchronous: in each iteration, we compute new values for ALL states simultaneously, using only the old values. Then we swap in the new table and repeat. Clean, correct — but it ignores useful information. When we compute U(s₅), we already have a fresh U(s₁). Why not use it?

Asynchronous value iteration (Algorithm 7.9) updates states one at a time, in place, using the most recent value of every state already updated this sweep:

U(s) ← max_a [ R(s,a) + γ ∑_s′ T(s′|s,a) U(s′) ]

The difference: when we update U(s₅), we use the already-fresh U(s₁), U(s₂), etc. This is the Gauss-Seidel method, adapted from linear algebra to dynamic programming.

Why this helps: Imagine a linear chain 0 → 1 → 2 → 3 → [reward]. In synchronous VI, after iteration 1 only state 3 has a nonzero value. After iteration 2, only states 2 and 3. The reward propagates one step per iteration. In Gauss-Seidel VI sweeping left-to-right (3, 2, 1, 0), the same sweep propagates the reward from 3 all the way back to 0 in a single pass. It converges in 1 sweep instead of 4.

State Ordering Matters

The order in which we update states controls how fast information flows. There are three natural orderings:

Ordering	Behavior	Best for
Toward reward (backward)	Values propagate in one pass	Chain MDPs with known reward location
Away from reward (forward)	Values propagate slowly, one step per sweep	No benefit over synchronous VI
Random	Intermediate; converges but unpredictably	Unknown structure; still guaranteed to converge

Backward induction as a special case: If the state includes a time step t = 1, 2, …, H (finite horizon), then we can sweep backward from t=H to t=1. At t=H the values are just immediate rewards (no future). At t=H−1 we look one step ahead to the fresh t=H values. This propagates optimally in exactly H sweeps — the classic finite-horizon DP. Backward induction IS Gauss-Seidel VI with the perfect ordering.

One more key fact: even with a bad ordering, asynchronous VI cannot converge slower than synchronous VI. It can only be the same or faster. The only requirement for convergence: every state must be updated infinitely often. Any ordering that visits every state repeatedly works.

Worked Comparison: Same Problem, Two Orderings

Linear chain of 5 states (s₀→s₁→s₂→s₃→s₄). Reward at s₄: +10. γ = 0.9. All transitions deterministic (always advance).

Left-to-right ordering (s₀, s₁, s₂, s₃, s₄): Sweep 1 updates s₀ first — it sees U(s₁)=0, stays at 0. Then s₁ sees U(s₂)=0 → stays 0. … Only s₄ has the reward. After 1 sweep: same as synchronous VI. Needs ~5 sweeps to fully propagate.

Right-to-left ordering (s₄, s₃, s₂, s₁, s₀): Sweep 1 updates s₄ first: U(s₄) = 10. Then s₃ sees fresh U(s₄)=10: U(s₃) = −1 + 0.9×10 = 8. Then s₂ sees fresh U(s₃)=8: U(s₂) = 6.2. s₁: 4.58. s₀: 3.12. Full propagation in 1 sweep!

Synchronous vs. Gauss-Seidel Value Iteration

Same 6-cell chain. Left half: synchronous VI (updates all cells from old table each step). Right half: Gauss-Seidel VI. Choose sweep direction. Watch Gauss-Seidel converge faster with right-to-left ordering.

Sweep 0

python
def gauss_seidel_vi(mdp, ordering=None, k_max=1000, eps=1e-6):
    """Algorithm 7.9: Gauss-Seidel (asynchronous) value iteration."""
    # ordering: list of states in update order. Default: mdp.S as-is.
    ordering = ordering or mdp.S
    U = {s: 0.0 for s in mdp.S}
    for _ in range(k_max):
        delta = 0.0
        for s in ordering:           # update in-place, use fresh values immediately
            u_new = bellman_backup(mdp, U, s)
            delta = max(delta, abs(u_new - U[s]))
            U[s] = u_new              # write back immediately (in-place)
        if delta < eps:
            break
    return {s: greedy(mdp, U, s) for s in mdp.S}, U

# Backward induction: sweep from last time step to first
def backward_induction(mdp_h, H):
    """Finite-horizon DP: sweep backward from t=H. Returns policy for each t.
    mdp_h.S assumed to contain (state, time) tuples.
    """
    U = {(s, H): mdp_h.terminal_reward(s) for s in mdp_h.base_states}
    policies = {}
    for t in range(H - 1, -1, -1):    # sweep backward
        for s in mdp_h.base_states:
            best_a, best_v = None, -float('inf')
            for a in mdp_h.A:
                v = mdp_h.R(s, a) + mdp_h.gamma * sum(
                    mdp_h.T(s, a, sp) * U.get((sp, t+1), 0)
                    for sp in mdp_h.base_states
                )
                if v > best_v: best_a, best_v = a, v
            U[(s, t)] = best_v
            policies[(s, t)] = best_a
    return policies, U

In Gauss-Seidel value iteration, why does the order of state updates affect convergence speed?

Some orderings violate the Markov property, causing incorrect values Updated states immediately influence their neighbors; updating in the direction of value propagation lets information flow through the whole chain in one sweep instead of one step per sweep State ordering affects the discount factor applied at each state The ordering determines which states receive the reward signal first

Chapter 6: Linear Program Formulation

Value iteration and policy iteration are iterative algorithms — they converge asymptotically. Is there a way to compute U* in one shot, as a direct optimization? Yes: rewrite the Bellman optimality equation as a linear program.

The Bellman optimality equation requires U*(s) = max_a[R(s,a) + γ∑T(s′|s,a)U*(s′)] for every state s. The “max” is the hard part — it makes the equation nonlinear. The LP trick: replace the max with a constraint. Instead of saying “U(s) equals the best lookahead,” we say “U(s) is at least as large as every lookahead”:

U(s) ≥ R(s,a) + γ ∑_s′ T(s′|s,a) U(s′) ∀s, a

This is a system of |S|×|A| linear constraints in |S| variables. The feasible set is a convex polyhedron. But which point in this polyhedron should we pick? Any U satisfying the constraints is an overestimate of U*. The smallest feasible U is exactly U*.

So the LP is (eq. 7.19–7.20):

minimize ∑_s U(s)
subject to U(s) ≥ R(s,a) + γ ∑_s′ T(s′|s,a) U(s′) ∀ s ∈ S, a ∈ A

Why minimize the sum? Any solution satisfying all the constraints is an overestimate of U*. The objective “minimize ∑U(s)” pushes U as far down as possible while still satisfying every constraint. When U can’t be pushed any lower without violating some constraint, the active constraint at each state is U(s) = max_a[…] — the Bellman optimality equation. So the LP optimum IS U*.

Complexity: MDPs Are in P

The LP has |S| variables and |S|×|A| constraints. LP is solvable in polynomial time (via interior-point methods). Therefore, MDPs with finite state and action spaces are solvable in polynomial time — they lie in the complexity class P.

Property	Value
Variables	\|S\|
Constraints	\|S\| × \|A\|
Complexity	Polynomial (LP is in P)
Practical vs. VI	VI is usually faster in practice; LP matters theoretically

Why LP matters theoretically: Before this result it was conceivable that MDPs might require exponential time — there are |A|^|S| policies to search. The LP formulation proves that’s not necessary. The optimal value function is the vertex of a convex polytope that LP can find efficiently. This places exact MDP solution in a very different class from, say, general game-playing, which is PSPACE-hard.

A 2-State Example by Hand

Two states s₀, s₁. Two actions: left (L), right (R). γ = 0.9. Transitions: L always stays; R from s₀ goes to s₁; R from s₁ stays. Rewards: R(s₀,L)=−1, R(s₀,R)=−1, R(s₁,L)=−1, R(s₁,R)=10.

LP constraints (variables: U₀, U₁):

From s₀, action L: U₀ ≥ −1 + 0.9 U₀ → 0.1 U₀ ≥ −1 → U₀ ≥ −10

From s₀, action R: U₀ ≥ −1 + 0.9 U₁

From s₁, action L: U₁ ≥ −1 + 0.9 U₁ → 0.1 U₁ ≥ −1 → U₁ ≥ −10

From s₁, action R: U₁ ≥ 10 + 0.9 U₁ → 0.1 U₁ ≥ 10 → U₁ ≥ 100

The binding constraint on U₁ is U₁ = 100. Then U₀ ≥ −1 + 0.9×100 = 89. Minimize: U₀ = 89, U₁ = 100. The optimal policy: in s₀, go right (worth 89); in s₁, go right (keeps collecting reward 10, discounted sum = 100).

LP Feasible Region: 2-State MDP

The two LP variables are U₀ and U₁. Each constraint is a half-plane. The feasible region (shaded) is their intersection. The objective “minimize U₀+U₁” sweeps a line; the LP optimum is where it first enters the feasible region. Drag γ to see constraints shift.

γ 0.80

python
from scipy.optimize import linprog
import numpy as np

def lp_solve(mdp):
    """Algorithm 7.10: Solve MDP via linear programming."""
    S, A, gamma = mdp.S, mdp.A, mdp.gamma
    n = len(S)
    idx = {s: i for i, s in enumerate(S)}

    # Objective: minimize sum of U(s)
    c = np.ones(n)

    # Constraints: U(s) >= R(s,a) + gamma * sum_s' T(s'|s,a) U(s')
    # Rearranged: -U(s) + gamma * sum_s' T(s'|s,a) U(s') <= -R(s,a)
    rows = []
    rhs = []
    for s in S:
        for a in A:
            row = np.zeros(n)
            row[idx[s]] = -1.0
            for sp in S:
                row[idx[sp]] += gamma * mdp.T(s, a, sp)
            rows.append(row)
            rhs.append(-mdp.R(s, a))

    res = linprog(c, A_ub=np.array(rows), b_ub=np.array(rhs), method='highs')
    U = {s: res.x[idx[s]] for s in S}
    return {s: greedy(mdp, U, s) for s in S}, U

The LP formulation minimizes ∑U(s). Why minimize rather than maximize?

Minimizing gives the pessimistic policy, which is safer Any U satisfying all constraints is an overestimate of U*; minimizing pushes U down until each constraint becomes tight — exactly the Bellman optimality equation LP solvers require minimization; we negate to convert maximization problems Smaller U values correspond to more conservative policies

Chapter 7: Linear Quadratic Regulator

Every method so far assumed a finite state and action space. But most real physical systems — a pendulum, a drone, a robotic arm — have continuous states and actions. You can’t enumerate them. Can we still compute an exact optimal policy?

Yes — for one important special case: when dynamics are linear and rewards are quadratic. This is the Linear Quadratic Regulator (LQR), and it has a closed-form solution.

The Setup

Linear dynamics (eq. 7.22): The next state is a linear function of the current state and action, plus Gaussian noise:

s′ = T_s · s + T_a · a + w, w ∼ N(0, Σ_w)

Here T_s is the state transition matrix (n×n), T_a is the action transition matrix (n×m), and w is zero-mean Gaussian noise.

Quadratic reward (eq. 7.23): The reward penalizes both large states and large actions:

R(s,a) = s^T R_s s + a^T R_a a

R_s is negative semidefinite (penalizes large states); R_a is negative definite (penalizes large actions). The goal: keep the system near zero with minimal control effort.

Why quadratic? Any smooth reward can be approximated as quadratic near its optimum. LQR is the second-order Taylor expansion of general optimal control. When you linearize a nonlinear system and quadratically approximate its cost, you get LQR — which is why it’s the foundation of iLQR, DDP, and most trajectory optimization methods used in robotics today.

The Solution

The optimal value function turns out to be quadratic: U_h(s) = s^TV_hs + q_h. The optimal policy turns out to be linear: π_h(s) = L_h · s.

The gain matrix L_h (eq. 7.29):

L_h = −(T_a^T V_h−1 T_a + R_a)⁻¹ T_a^T V_h−1 T_s

V_h satisfies the discrete-time Riccati equation (eq. 7.30):

V_h+1 = R_s + T_s^TV_hT_s − T_s^TV_hT_a(T_a^TV_hT_a + R_a)⁻¹T_a^TV_hT_s

Certainty equivalence: L_h and V_h don’t depend on the noise covariance Σ_w at all. The optimal control policy for a stochastic LQR problem is identical to the optimal policy for the deterministic version (w = 0). You can design the controller as if you had no noise, and it remains optimal. The noise affects q_h (the expected cost), but not the gains.

Worked Example 7.4: Position-Velocity Control

State: s = [x, v]. Action: scalar acceleration a. Timestep Δt = 0.1. Dynamics: x′ = x + vΔt + ½aΔt², v′ = v + aΔt. Reward: R(s,a) = −x² − v² − 0.5a².

In matrix form: T_s = [[1, 0.1],[0, 1]], T_a = [[0.005],[0.1]], R_s = diag(−1,−1), R_a = [−0.5].

Starting at s₀ = [−10, 0], the Riccati recursion gives gains:

Horizon h	Gain L_h
h = 1	[0, 0] — last step, zero gain
h = 2	[−0.286, −0.857]
h = 3	[−0.462, −1.077]
h = 4	[−0.496, −1.112]
h = 5	[−0.504, −1.124] ← essentially converged

With a longer horizon the controller pushes harder early: at x=−10, h=5 gives initial action ≈ +5.04, rapidly accelerating toward the origin, while h=2 gives only +2.86.

LQR Phase Plane Simulation

Position-velocity phase plane. The trajectory starts at the dot and converges to the origin under LQR control. Drag the starting dot. Change horizon h and watch the gain matrix update and the trajectory reshape.

horizon h 5

L = [−0.504, −1.124]

python
import numpy as np

def lqr(Ts, Ta, Rs, Ra, h_max=50):
    """Algorithm 7.11: LQR via backward Riccati iteration."""
    n = Ts.shape[0]
    V = np.zeros((n, n))    # V_0 = 0 at terminal step
    gains = []
    for _ in range(h_max):
        M = Ta.T @ V @ Ta + Ra
        L = -np.linalg.solve(M, Ta.T @ V @ Ts)   # eq 7.29
        gains.append(L)
        V = Rs + Ts.T @ V @ Ts + Ts.T @ V @ Ta @ L  # Riccati, eq 7.30
    return gains   # gains[h-1] = L_h

# Example 7.4 setup
dt = 0.1
Ts = np.array([[1.0, dt], [0.0, 1.0]])
Ta = np.array([[0.5*dt**2], [dt]])
Rs = np.diag([-1.0, -1.0])
Ra = np.array([[-0.5]])
gains = lqr(Ts, Ta, Rs, Ra, h_max=10)
print("L_5 =", gains[4])    # [[-0.504, -1.124]]

# Simulate trajectory from s0 = [-10, 0]
s = np.array([-10.0, 0.0])
L = gains[-1]   # use converged gain
trajectory = [s.copy()]
for _ in range(50):
    a = L @ s        # linear policy: a = L*s
    s = Ts @ s + Ta.reshape(2) * a[0]
    trajectory.append(s.copy())

In LQR, the optimal gain matrix L_h does not depend on the noise covariance Σ_w. This result is called:

The Riccati property — the matrix equation absorbs noise automatically The separation principle — control and estimation objectives decouple Certainty equivalence — the optimal stochastic policy equals the deterministic optimal policy Linear duality — noise and reward are exchangeable in the Riccati equation

Chapter 8: Value Iteration Sandbox

You’ve seen value iteration on fixed examples. Now you’re in control. Below is a fully configurable grid world: choose the size, place rewards and walls by clicking, set the discount factor and stochastic slip probability. Run value iteration step by step or let it animate to convergence. Watch the Bellman residual decay. Extract and display optimal policy arrows.

This is THE showcase. Every concept from Chapters 1–5 is visible here simultaneously: value propagation from rewards backward, the effect of γ on how far values reach, stochasticity spreading value to neighbors, the residual plot showing contraction, policy arrows emerging from the value function.

Value Iteration: Full Interactive Sandbox

Left-click a cell to cycle: empty → wall → reward (+10) → penalty (−5) → empty. Then run value iteration and watch optimal arrows emerge. The residual plot (below) shows δ on a log scale.

Bellman residual δ_k = max_s|U_k+1(s)−U_k(s)| (log scale). Algorithm stops when δ < 0.01.

k = 0 δ = —

grid

γ0.90

slip %10%

Experiments to try: (1) Set γ = 0.5 and place a reward far from center — watch values fade to near zero before they reach the start. (2) Set slip = 40% and watch how values “spread” laterally. (3) Place a −5 penalty adjacent to the +10 reward — watch the optimal policy route around it. (4) Create a bottleneck corridor — all paths must squeeze through — and watch values drop as you force longer detours.

You run value iteration on a 6×6 grid with γ = 0.9 and a single reward at the goal. After 1 step, which cells have nonzero values?

All cells that can reach the goal in one step have nonzero values Only the goal cell itself — the Bellman backup at the goal gives a nonzero value, all others still see zero neighbors All cells get equal fractional values from the discount No cells have nonzero values after 1 step because the initial estimate is zero

Chapter 9: Policy vs. Value Iteration Side by Side

Both methods solve the same problem and both produce the same answer: the optimal policy. But they take very different paths to get there. The comparison below runs them simultaneously on an identical MDP so you can see the tradeoff directly.

Policy Iteration: Each outer iteration is expensive — a full policy evaluation before each improvement. But the number of outer iterations is tiny (often 2–5 even for large grids).

Value Iteration: Each iteration is cheap — just one Bellman backup per state. But many iterations are needed for the Bellman residual to shrink below the stopping threshold.

The fundamental tradeoff: Policy iteration does more work per step (policy evaluation = solving a linear system) but fewer steps. Value iteration does less work per step (one backup) but more steps. For small state spaces where O(|S|³) matrix solves are cheap, PI wins. For larger spaces where each linear solve is expensive, VI wins. The crossover depends on the specific MDP.

PI vs VI: Head-to-Head on the Same Grid

Both algorithms run on the same 5×5 grid. Left: policy iteration (shows current policy arrows). Right: value iteration (shows current value heatmap). Step both simultaneously and compare iteration counts.

      PI outer iters: 0
      VI total iters: 0

Per-Iteration Cost Breakdown

For a grid with N = |S| states and A = |A| actions:

Algorithm	Cost per outer iteration	Number of outer iterations
Policy Iteration	O(N³) for linear solve + O(N²A) for improvement	Often 2–10 in practice
Modified PI (m sweeps)	O(m · N²A)	More than PI, fewer than VI
Value Iteration	O(N²A)	O(log(1/ε) / log(1/γ))

For the 5×5 grid above (N=25, A=4): PI costs 25³ ≈ 15,000 operations per outer iter but typically finishes in 3 outer iters. VI costs 25²×4 = 2,500 operations per iter but needs ∼50–200 iters for γ=0.9. Total: PI ≈ 45,000 vs. VI ≈ 125,000–500,000. PI wins for small grids.

For a very large MDP with |S| = 10,000 states, why is value iteration usually preferred over policy iteration?

Value iteration always converges in fewer total steps Policy iteration cannot handle more than 1,000 states Policy evaluation requires solving a 10,000×10,000 linear system (O(N³) = 10¹² operations per outer iteration) — far more expensive than O(N²A) VI steps Value iteration uses less memory

Chapter 10: Method Comparison

Five exact methods, each with its own niche. Here’s the map:

Method	Per iteration	Outer iterations	Memory	Best for
Policy Iteration	O(\|S\|³) + O(\|S\|²\|A\|)	2–10 in practice	O(\|S\|²)	Small state spaces (<1,000 states)
Value Iteration	O(\|S\|²\|A\|)	O(log(1/ε)/log(1/γ))	O(\|S\|)	Medium state spaces
Async VI (Gauss-Seidel)	O(\|S\|\|A\|) per sweep	Depends on ordering	O(\|S\|)	When good ordering is known
Linear Program	Polynomial (interior point)	One solve	O(\|S\|²)	Theoretical guarantee; rarely practical
LQR	O(n³) per Riccati step	h steps (horizon)	O(n²)	Linear dynamics, quadratic cost

When Each Method Breaks

Policy Iteration breaks when |S| is large (thousands+). The O(|S|³) linear solve per iteration becomes prohibitive. Even sparse matrix solves have poor scaling for grid worlds beyond ∼100×100.

Value Iteration breaks when γ is close to 1. The convergence rate is γ^k, so γ = 0.999 needs ∼6,900 iterations to reduce error by 1000×. Every digit of γ closer to 1 roughly doubles the iteration count.

Async VI breaks when there’s no good state ordering. For ergodic MDPs without obvious structure, random ordering may not be much faster than synchronous VI.

LQR breaks when dynamics are nonlinear or the reward is non-quadratic. Even small nonlinearities can make the true optimal policy dramatically different from the LQR policy. Extensions like iLQR and DDP address this by iterating LQR around a nominal trajectory.

The Curse of Dimensionality

All exact methods share a fatal flaw: they require enumerating all states. A robot with 10 joints, each discretized to 100 positions, has 100¹⁰ = 10²⁰ states. No computer can enumerate them. This is the curse of dimensionality — the fundamental barrier that drives the entire field of approximate dynamic programming (Chapter 8) and model-based RL (Chapter 9).

ACAS X scale: The aircraft collision avoidance system has ~500 million states. It was solved by value iteration on a supercomputer, exploiting the structured (grid-like) nature of the state space. That’s near the practical limit for exact methods. For reference, a full humanoid robot has roughly 10⁵⁰ states when discretized at millimeter resolution — exact methods are irrelevant at that scale.

A continuous state space robot arm has 6 joints, each ranging 0–360°. If you discretize each joint to 1° resolution, how many states are there?

6 × 360 = 2,160 states 360⁶ / 6 ≈ 7.1 × 10¹³ states 360⁶ ≈ 2.2 × 10¹⁵ states — exact DP is infeasible 6³⁶⁰ states — one for each degree of each joint

Chapter 11: Connections and What’s Next

Chapter 7 is the foundation. Everything that follows in the book is an answer to one of three problems that exact methods cannot handle: state spaces that are too large, models that are unknown, and partial observability.

Ch 7: Exact Methods (this chapter)

Value iteration, policy iteration, LQR — optimal when state space is small and model is known

↓ when |S| is too large ↓

Ch 8: Approximate Value Functions

Parametric approximators (linear, neural networks) replace the full value table. LSTD, DQN, fitted VI.

↓ when the model is unknown ↓

Ch 9: Online Planning

MCTS, rollout policy, sparse sampling — plan at decision time using Monte Carlo estimates instead of precomputed U*.

↓ when we want to learn the policy directly ↓

Ch 10–13: Policy Gradient Methods

REINFORCE, actor-critic, PPO — directly optimize the policy parameters π_θ by gradient ascent on expected return. No value function required (or use one as a baseline).

↓ when state is partially observed ↓

Ch 6: POMDPs (partially observable MDPs)

The agent can’t see s directly. Ch 6 covers belief state MDPs, point-based solvers — all built on the same Bellman structure as Ch 7.

Equation Cheat Sheet

Name	Equation
Bellman Expectation	U^π(s) = R(s,π(s)) + γ∑_s′T(s′\|s,π(s))U^π(s′)
Bellman Optimality	U(s) = max_a[R(s,a) + γ∑_s′T(s′\|s,a)U(s′)]
Q-Function	Q(s,a) = R(s,a) + γ∑_s′T(s′\|s,a)U(s′)
Greedy Policy	π(s) = arg max_a Q(s,a)
Policy Iteration	Evaluate U^π exactly, then π ← greedy(U^π), repeat
Value Iteration	U(s) ← max_a[R(s,a)+γ∑T(s′\|s,a)U(s′)], repeat
VI Error Bound	\|\|U_k−U*\|\|_∞ ≤ δγ/(1−γ) where δ = max\|U_k+1−U_k\|
LQR Gain	L_h = −(T_a^TV_h−1T_a+R_a)⁻¹T_a^TV_h−1T_s
Riccati Equation	V_h+1 = R_s + T_s^TV_hT_s − T_s^TV_hT_a(T_a^TV_hT_a+R_a)⁻¹T_a^TV_hT_s

Related Lessons

These micro-lessons complement Chapter 7:

microLearning: Markov Decision Processes — visual intro to MDPs and value iteration from scratch
microLearning: RL Algorithms — Q-learning, SARSA, policy gradient when the model is unknown

The key insight of this chapter: The Bellman equation — “the value of a state equals the immediate reward plus the discounted expected value of the next state” — is a fixed-point equation. All exact algorithms (PI, VI, async VI, LP) are different strategies for finding that fixed point. LQR is the special case where the fixed point has a closed form. Everything downstream is an approximation of this fixed point when it can’t be computed exactly.

“An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.”
— Richard Bellman, Dynamic Programming, 1957

Why can’t any of the exact methods in Chapter 7 solve a humanoid robot’s locomotion problem directly?

The Bellman optimality equation doesn’t apply to continuous action spaces The state space is astronomical (curse of dimensionality) — even storing one value per state requires more memory than exists, let alone iterating over them Policy iteration diverges for non-convex reward functions LQR only applies when there’s a single action dimension