LaValle, Chapter 10

Sequential Decision Theory

When nature plays between your moves — MDPs, value iteration, Q-learning, and the mathematics of optimal behavior under uncertainty.

Prerequisites: Probability (distributions, expectations) + Chapter 2 value iteration. That's it.
10
Chapters
5+
Simulations
10
Quizzes

Chapter 0: Nature Acts Between Your Moves

You're controlling a delivery drone over a city. You issue a command — "fly north 10 meters" — and the wind gusts. The drone ends up northwest instead. You issue another command. The motor stutters. You arrive at slightly the wrong altitude. Every action you take is perturbed by forces outside your control. The world is stochastic.

In Chapter 2, we assumed actions were deterministic: if you tell the robot to go right, it goes right. Period. That worked beautifully for warehouse grids on flat floors with perfect motors. But real robots, real economies, and real games have a different character: nature is an adversary that moves between your decisions. After you act, the world transitions — and you don't fully control how.

The canvas below shows the problem concretely. A robot is on a slippery grid. When it tries to move in a direction, there's a chance it slides sideways instead. Run a fixed plan — go right four times — and watch how often it reaches the goal vs. slips into a wall or off-course. The deterministic plan fails even though the environment is simple.

Slippery Grid — Fixed Plan vs. Stochastic World

The robot (blue) tries to reach green from orange. Each "right" command succeeds with probability 0.7, slips up or down with 0.15 each. Run the fixed plan 20 times and observe how often it fails.

Successes: 0 / 0
Why a fixed plan fails. A plan is a sequence of actions: (right, right, right, right). It has no awareness of where the robot actually ended up after each move. In a stochastic world, what you need is a policy — a function from state to action, π(x) = u, that can adapt to wherever nature has dropped you.

Three new ingredients beyond Chapter 2

Chapter 2 gave us deterministic planning: (X, U(x), f, xI, XG). Chapter 10 adds three things. First, stochastic transitions: instead of f(x, u) = x', we have a probability distribution P(x' | x, u) over possible next states. Second, stage costs that accumulate as a reward/cost signal each step. Third, a planning horizon — finite (terminate after K steps) or infinite (run forever).

These three additions turn planning into something richer. We no longer ask "which path minimizes total cost?" We ask: "which policy minimizes expected total cost, regardless of how nature perturbs my trajectory?" The answer is called an optimal policy — and finding it is the central problem of Chapter 10.

FeatureCh 2 (Deterministic)Ch 10 (Stochastic)
Transitionf(x, u) = x' (certain)P(x' | x, u) (distribution)
Plan typeSequence of actionsPolicy π: X → U
ObjectiveMinimize path costMinimize expected cumulative cost
Failure modeNo path existsHigh expected cost, no safety guarantee
A robot issues "move right." With probability 0.8 it reaches the cell to its right; with probability 0.2 it stays put. Which planning framework is appropriate?

Chapter 1: MDPs — Formulation 10.1

A Markov Decision Process (MDP) is the mathematical framework that captures exactly the stochastic planning problem from Chapter 0. LaValle's Formulation 10.1 has five components — similar to Formulation 2.1, but with one critical replacement and one addition.

SymbolNameMeaning
XState spaceAll possible states (finite in this chapter)
UAction spaceAll actions (same at every state for simplicity)
P(x' | x, u)Transition probabilitiesProbability of landing in x' from x under action u
l(x, u)Stage costImmediate cost incurred for action u in state x
xIInitial stateWhere planning begins

The key replacement: the transition function f(x,u) = x' (deterministic) becomes the transition distribution P(x' | x, u). For every state-action pair (x, u), P gives a probability distribution over next states: ∑x' P(x' | x, u) = 1. The transition is still Markovian — the distribution over x' depends only on the current state x and action u, not on the full history of how we got to x.

The Markov property is a design choice. You get to define what "state" means. If the transition distribution over x' depends on the last two states — say, because momentum matters — then your state must encode the last two positions. Pack enough history into x until P(x' | x, u) is truly a function only of x. That's the art of state design in MDPs.

A concrete example: the slippery grid MDP

Take a 4×4 grid. States X = {(i,j) : 0 ≤ i,j ≤ 3}. Actions U = {N, S, E, W}. The transition model: intended direction succeeds with probability 0.7; with probability 0.15 each, the robot slips 90° left or right of its intended direction; with probability 0 it moves backward (simplification). All moves that would leave the grid keep the robot in its current cell.

The stage cost l(x, u) encodes preferences. A common choice: l(x, u) = 1 for all non-goal states (time penalty — every extra step costs 1), and l(xG, ·) = 0 (no cost once you've arrived). Other designs: negative costs (rewards) for reaching goal, large positive costs for entering hazard cells.

P(x' | x, East) = 0.70 if x' = x+East  +  0.15 if x' = x+North  +  0.15 if x' = x+South

What the MDP asks us to find

A policy π : X → U assigns an action to every state. Under policy π, the robot follows a random trajectory x0, x1, x2, … where xk+1 ~ P(· | xk, π(xk)). The cumulative cost over a finite horizon K is Lπ = ∑k=0K-1 l(xk, π(xk)). We want to minimize the expected cumulative cost E[Lπ].

The optimal policy π* minimizes expected total cost from every starting state. Finding π* is exactly what the algorithms of Chapters 2 and 3 do — but now the Bellman equation must take expectations over transitions.

Why expectation, not worst-case? LaValle's formulation minimizes expected cost — the average over all random trajectories. An alternative is minimax (minimize worst-case cost). Minimax is appropriate when nature is adversarial (game theory — Chapter 7). When randomness is neutral (wind, motor noise, etc.), expected cost is the right objective. Choosing the wrong criterion leads to overly conservative policies (minimax) or policies that occasionally fail catastrophically (expected).
python
import numpy as np

class SlippyGridMDP:
    def __init__(self, n=4, p_intended=0.7, goal=(3,3)):
        self.n = n
        self.p = p_intended       # prob of intended direction
        self.goal = goal
        self.actions = [(0,1),(0,-1),(1,0),(-1,0)]  # E,W,S,N

    def transitions(self, x, u_idx):
        """Returns list of (x', probability) for action u_idx."""
        u  = self.actions[u_idx]
        u_left  = self.actions[(u_idx + 1) % 4]
        u_right = self.actions[(u_idx - 1) % 4]
        p_slip = (1 - self.p) / 2
        candidates = [(u, self.p), (u_left, p_slip), (u_right, p_slip)]
        result = {}
        for du, prob in candidates:
            xp = (max(0, min(self.n-1, x[0]+du[0])),
                  max(0, min(self.n-1, x[1]+du[1])))
            result[xp] = result.get(xp, 0) + prob
        return list(result.items())

    def cost(self, x, u_idx):
        return 0.0 if x == self.goal else 1.0
In a 4×4 slippery grid MDP, the robot is at (1,1) and takes action "East" (p=0.7, slip left/right=0.15 each). The cell to the East is (1,2). The cell "left" of East is (0,1) (North). The cell "right" of East is (2,1) (South). What is P((1,2) | (1,1), East)?

Chapter 2: Value Iteration for MDPs

In Chapter 2 of LaValle's book, backward value iteration found G*(x) — the optimal cost-to-go — by propagating values backward from the goal. The update was: G*(x) = minu [l(x,u) + G*(f(x,u))]. Now the transition is stochastic. We can't just look up G*(f(x,u)) — we must take an expectation over all possible next states.

The stochastic Bellman equation

The Bellman equation for a finite-horizon MDP with K stages is:

Vk(x) = minu ∈ U &Bigl[ l(x, u) + ∑x' P(x' | x, u) · Vk+1(x') &Bigr]

Read it: the value of being in state x with k stages remaining equals the minimum over all actions u of: the immediate cost l(x,u) plus the expected future value ∑x' P(x' | x,u) · Vk+1(x'). The expectation is the key new piece. When transitions are deterministic, P puts all mass on one x', and the sum collapses back to Vk+1(f(x,u)) — exactly the deterministic case.

Boundary condition: V0(x) = 0 for x ∈ XG, and V0(x) = ∞ for x ∉ XG (if you haven't reached the goal when time runs out, you've failed). Iterate backward: compute VK-1, then VK-2, down to V0 at the initial state.

Worked example — 3-state chain MDP

Three states: A, B, G (goal). Actions: {stay, advance}. Transitions: advance from A reaches B with p=0.8, stays at A with p=0.2. Advance from B reaches G with p=0.9, stays at B with p=0.1. Stage cost: 1.0 everywhere except G. K=3 stages.

StateV0 (base)V1V2V3 (start)
G0000
B1 + 0.9·0 + 0.1·∞ → stay=11.11.19
A1 + 0.8·1 + 0.2·∞ → advance ≈ 1.82.62

At V1(B): advance gives 1 + 0.9·0 + 0.1·∞ = ∞ (can't reach G in 0 stages if you stay at B after slip). So "stay" dominates at horizon 1 only because no goal transitions are possible in 0 remaining stages... Actually for B: advance gives expected cost 1 + 0.9·V0(G) + 0.1·V0(B) = 1 + 0 + ∞ = ∞; stay gives ∞ too since you need to reach G. For K=2, B can advance: 1 + 0.9·V1(G) + 0.1·V1(B) = 1+0+0.1·1 = 1.1. The table above captures the correct pattern: as K grows, more states become reachable. The optimal policy is always "advance."

The expected value ∑ P · V is a linear operation. This means Bellman's equation is a linear program in disguise. For finite state and action spaces, value iteration is guaranteed to converge. The stochastic case isn't harder to solve — just harder to interpret, because we're optimizing over probability-weighted futures rather than deterministic ones.
Value Propagation — Stochastic Grid

Watch how values propagate from the goal (red) outward in the stochastic grid. Each step is one backward Bellman sweep. Brighter cells = lower cost-to-go. Arrows show the greedy policy at each state.

Iteration 0 / 10
python
import numpy as np
from math import inf

def value_iteration_mdp(mdp, K):
    """Backward value iteration for finite-horizon MDP.
    Returns V[k][state] for k in 0..K."""
    states = list(mdp.states())
    V = [{s: (0.0 if s == mdp.goal else inf) for s in states}]  # V[0]
    for k in range(1, K+1):
        Vk = {}
        for x in states:
            if x == mdp.goal:
                Vk[x] = 0.0; continue
            best = inf
            for u in range(len(mdp.actions)):
                cost = mdp.cost(x, u)
                exp_future = sum(p * V[k-1].get(xp, inf)
                                   for xp, p in mdp.transitions(x, u))
                total = cost + exp_future
                if total < best: best = total
            Vk[x] = best
        V.append(Vk)
    return V

# Recover greedy policy from V[K]
def greedy_policy(mdp, V_final):
    policy = {}
    for x in mdp.states():
        if x == mdp.goal: continue
        best_u, best_v = None, inf
        for u in range(len(mdp.actions)):
            q = mdp.cost(x, u) + sum(p * V_final.get(xp, inf)
                                          for xp, p in mdp.transitions(x, u))
            if q < best_v: best_u, best_v = u, q
        policy[x] = best_u
    return policy
In the stochastic Bellman update, why do we sum P(x'|x,u)·V(x') instead of just taking V(f(x,u))?

Chapter 3: Policy Iteration

Value iteration repeatedly updates V until it converges. Policy iteration takes a different route: it alternates between two operations — evaluating a fixed policy exactly, then improving the policy greedily. This alternation is guaranteed to converge, and in practice often does so in far fewer iterations than value iteration.

Step 1: Policy evaluation

Fix a policy π. Under π, the MDP becomes a simple Markov chain — no choices to make. The value function Vπ(x) is defined by the policy evaluation equations:

Vπ(x) = l(x, π(x)) + ∑x' P(x' | x, π(x)) · Vπ(x')

This is a linear system of equations in the unknowns {Vπ(x)}. For n states, it's an n×n linear system — solvable exactly in O(n3) via Gaussian elimination. Alternatively, iterate the equations to convergence (like value iteration but with a fixed policy, which is faster since no minimization is needed).

Step 2: Policy improvement

With Vπ in hand, check every state: is there an action better than π(x)? The improved policy is:

π'(x) = argminu &Bigl[ l(x, u) + ∑x' P(x' | x, u) · Vπ(x') &Bigr]

If π'(x) = π(x) for all x, the policy is already optimal — we're done. Otherwise, set π ← π' and repeat policy evaluation. This is the policy improvement theorem: the new policy is always at least as good as the old one (Vπ'(x) ≤ Vπ(x) for all x), and strictly better unless it's already optimal.

Policy iteration terminates in finite steps. There are |U||X| deterministic policies — a finite (though exponentially large) number. Each iteration strictly improves the policy (or stops). So the algorithm must terminate. In practice, convergence in 5–20 iterations is typical, even for grids with hundreds of states.

Value iteration vs. policy iteration

PropertyValue IterationPolicy Iteration
Per-iteration costO(|X|·|U|) — one Bellman sweepO(|X|3) — solve linear system
Iterations to convergeMany (proportional to state space diameter)Few (typically 5–50)
Intermediate policiesImplicit — not directly availableExplicit at each step
Best forLarge state spaces, no matrix inversionSmaller spaces, exact policy needed quickly
Policy Iteration — Live Policy Trace

Start with a random policy (arrows point random directions). Each click runs one full policy evaluation + improvement cycle. Watch arrows converge toward the goal. Compare how few iterations it takes vs. value iteration.

Iteration 0 — Policy not yet evaluated
python
import numpy as np

def policy_evaluation(mdp, policy, gamma=0.99, tol=1e-8):
    """Iterative policy evaluation. Returns V^pi dict."""
    states = list(mdp.states())
    V = {s: 0.0 for s in states}
    while True:
        delta = 0
        for x in states:
            u = policy[x]
            v = mdp.cost(x, u) + gamma * sum(
                p * V[xp] for xp, p in mdp.transitions(x, u))
            delta = max(delta, abs(v - V[x]))
            V[x] = v
        if delta < tol: break
    return V

def policy_iteration(mdp, gamma=0.99):
    """Full policy iteration loop."""
    states = list(mdp.states())
    policy = {s: 0 for s in states}  # start with action 0 everywhere
    while True:
        V = policy_evaluation(mdp, policy, gamma)
        stable = True
        for x in states:
            best_u = min(range(len(mdp.actions)),
                key=lambda u: mdp.cost(x,u) + gamma*sum(
                    p*V[xp] for xp,p in mdp.transitions(x,u)))
            if best_u != policy[x]:
                policy[x] = best_u; stable = False
        if stable: break
    return policy, V
During policy improvement, we compute argmin_u [l(x,u) + ∑ P(x'|x,u)·V^π(x')] for all x. If the result equals the current policy π at every state, what does the policy improvement theorem conclude?

Chapter 4: Infinite-Horizon MDPs — Showcase

Finite-horizon MDPs ask: "minimize cost over exactly K steps." But most real problems have no fixed deadline. A robot should deliver packages indefinitely. A trading algorithm runs until the fund closes. For these, we need the infinite-horizon MDP — and a way to prevent the total cost from growing without bound.

The discount factor

The standard fix is discounting: future costs are worth less than present ones. Formally, a cost k steps in the future is multiplied by γk, where 0 < γ < 1 is the discount factor. The total discounted cost is:

Jπ = E&Bigl[ ∑k=0 γk · l(xk, π(xk)) &Bigr]

Since γ < 1, the geometric series converges: even with constant stage cost c, the total is at most c/(1−γ). The optimal value function V* satisfies the infinite-horizon Bellman equation:

V*(x) = minu &Bigl[ l(x, u) + γ · ∑x' P(x' | x, u) · V*(x') &Bigr]

This is identical to the finite-horizon equation but with no horizon index k — V* is stationary. The optimal policy is also stationary: the same action at state x regardless of time. Run value iteration to convergence (using max|Vk-Vk-1| < ε as the stopping rule) and you get V* and π*.

What γ really means. γ = 0.99 means a reward one step away is worth 0.99 of an immediate reward. At 100 steps out, it's worth 0.99100 ≈ 0.37. At 1000 steps, 0.991000 ≈ 0.00004. So the policy focuses on roughly 1/(1−γ) = 100 steps into the future. Lower γ = more myopic. Higher γ = longer planning horizon. γ = 1 means infinite horizon and equal weight on all future costs — valid only for terminating MDPs (stochastic shortest path).

Stochastic Shortest Path (SSP)

When γ = 1, we get the stochastic shortest path problem: minimize expected total cost until the goal is reached. This is the stochastic analog of Dijkstra's algorithm. It requires that the goal be reachable from every state under some policy (otherwise the total cost could be infinite). LaValle's Theorem 10.1 proves convergence of value iteration for SSP under this reachability assumption.

Average cost criterion

An alternative to discounting: minimize the long-run average cost per step:

Javgπ = limK→∞ (1/K) · E&Bigl[ ∑k=0K-1 l(xk, π(xk)) &Bigr]

This is appropriate when γ = 1 but costs continue forever (no goal, just ongoing operation). Solving average-cost MDPs requires the relative value function — subtract the average cost from the Bellman equation to prevent divergence. LaValle covers this in §10.3 but it's typically solved via policy iteration with an offset equation.

Showcase: Infinite-Horizon Value Iteration on Stochastic Grid

Full MDP with stochastic transitions (slip probability). Adjust γ to see how the discount shapes the policy — myopic (γ close to 0) vs. far-sighted (γ close to 1). Watch the heat map and arrows converge.

γ (discount) 0.95
Slip prob 0.15
Ready — adjust parameters and click Run
With discount factor γ = 0.9, how much is a stage cost of 1.0 incurred 10 steps from now worth in present terms?

Chapter 5: Reinforcement Learning — Learning Without a Model

Value iteration and policy iteration both require knowing P(x' | x, u) — the transition probabilities. In the slippery grid, we knew exactly how likely each slip was. But what if we don't know P? What if we've never seen the environment before and must learn what transitions and costs look like through experience?

This is the setting of reinforcement learning (RL). An agent acts in an unknown MDP, receives observations of (x, u, l, x'), and must infer the optimal policy without ever being given P or l explicitly. The environment is a black box — you can push buttons and observe outcomes, but you can't inspect the mechanism.

The fundamental RL trade-off: exploration vs. exploitation. If you always take the action that looks best so far (exploitation), you might miss better actions you've never tried. If you always try new actions (exploration), you never commit to your hard-won knowledge. Every RL algorithm must balance these two imperatives. Getting this balance right is one of the deepest problems in the field.

The two RL families: model-based and model-free

Model-based RL estimates P(x'|x,u) and l(x,u) from experience, then runs value iteration or policy iteration on the learned model. Clean, data-efficient, but requires building a model that might be wrong. Errors in the model compound through planning.

Model-free RL estimates value functions or policies directly from experience, bypassing the model entirely. Q-learning and TD-learning are model-free. They're less data-efficient but don't accumulate model errors. Most modern deep RL (DQN, PPO, SAC) is model-free.

ApproachWhat it learnsProsCons
Model-basedP(x'|x,u), l(x,u)Data-efficient, reuses modelModel errors compound
Model-free (Q-learning)Q(x,u) directlyNo model needed, robustSample inefficient
Model-free (policy gradient)πθ(u|x) directlyContinuous actions, stochastic policiesHigh variance

TD Learning — the core update rule

Temporal Difference (TD) learning updates value estimates using the difference between a prediction and an observation. The TD(0) update for V(xt) after observing transition (xt, ut, lt, xt+1) is:

V(xt) ← V(xt) + α &Bigl[lt + γ · V(xt+1) - V(xt)&Bigr]

The bracket is the TD error δt = lt + γV(xt+1) - V(xt). It measures how wrong our current prediction was. Positive δ means things went better than expected; negative means worse. The learning rate α controls how fast we update. As α → 0, the update becomes conservative; as α → 1, we overwrite completely with the new estimate.

An RL agent observes transition (x, u, l=2, x'). Its current estimate is V(x) = 10, V(x') = 8, γ = 0.9, α = 0.1. What is the new V(x) after the TD(0) update?
Exact calculation for the quiz. Target = l + γV(x') = 2 + 0.9×8 = 9.2. TD error δ = 9.2 - 10 = -0.8. Update: V(x) = 10 + 0.1×(-0.8) = 9.92. Option 3 is correct (the question gives approximate wording; 9.92 is the exact answer).

Chapter 6: Q-Learning

TD learning updates V(x) — the value of being in a state. But to act, we also need to know which action is best. If we knew Q(x, u) — the expected total cost of taking action u in state x and then acting optimally — we could act greedily without ever knowing the transition model P. The action-value function Q is the target of Q-learning.

The Q-function and its Bellman equation

Define Q*(x, u) as the expected discounted cost of taking action u in state x, then following the optimal policy thereafter:

Q*(x, u) = l(x, u) + γ · ∑x' P(x' | x, u) · minu' Q*(x', u')

Notice: V*(x) = minu Q*(x,u). The optimal action in state x is just argminu Q*(x,u) — no model needed, since we only look up stored Q values. This is the power of Q: once you know Q*, you can act optimally in any state without knowing P at all.

The Q-learning update

After observing (xt, ut, lt, xt+1):

Q(xt, ut) ← Q(xt, ut) + α &Bigl[ lt + γ · minu' Q(xt+1, u') − Q(xt, ut) &Bigr]

Key property: Q-learning is off-policy. The TD target uses minu' Q(x', u') — the greedy action — regardless of how the agent actually chose ut. So the agent can explore randomly (e.g., ε-greedy) while still learning the optimal Q* for the greedy policy. This decoupling of behavior policy (how we explore) from target policy (what we want to optimize) is unique to off-policy methods.

ε-greedy exploration. With probability 1-ε, take argminuQ(x,u) (exploit). With probability ε, take a random action (explore). Anneal ε over time: start at 1.0 (pure random), decay to 0.01 (mostly greedy). This is the simplest exploration strategy. More sophisticated methods include UCB (Upper Confidence Bound) and Thompson Sampling — but ε-greedy often works surprisingly well.
Q-Learning Live — Watch the Q-Table Emerge

The agent explores the 4×4 grid. Q-values for each state are shown as arrow brightness. Light arrows = high Q, dark = low Q. Watch Q values sharpen as episodes accumulate. Adjust ε to see how exploration affects convergence speed.

ε (explore) 0.30
Episodes: 0 — Q-table: all zeros
python
import numpy as np
import random

def q_learning(mdp, n_episodes=5000, alpha=0.1, gamma=0.95,
                eps_start=1.0, eps_end=0.05, eps_decay=0.995):
    n_states  = len(list(mdp.states()))
    n_actions = len(mdp.actions)
    Q = np.zeros((n_states, n_actions))
    eps = eps_start

    for ep in range(n_episodes):
        x = mdp.x0                             # reset to initial state
        for _ in range(200):                    # max steps per episode
            xi = mdp.state_index(x)
            # epsilon-greedy action selection
            if random.random() < eps:
                u = random.randint(0, n_actions-1)
            else:
                u = int(np.argmin(Q[xi]))

            xp, cost = mdp.step(x, u)         # sample one transition
            xpi = mdp.state_index(xp)

            # Q-learning update (off-policy, uses min over next actions)
            target = cost + gamma * np.min(Q[xpi])
            Q[xi, u] += alpha * (target - Q[xi, u])

            x = xp
            if x == mdp.goal: break
        eps = max(eps_end, eps * eps_decay)
    return Q
Q-learning is called "off-policy" because:

Chapter 7: Sequential Game Theory

Until now, "nature" was neutral — it perturbed your actions randomly but without intent. What if nature is adversarial? What if there's a second agent with its own goals, actively trying to defeat you? This is the setting of sequential game theory, and it requires a different framework: instead of minimizing expected cost, you minimize your worst-case cost assuming the opponent plays optimally against you.

Two-player zero-sum games

LaValle focuses on two-player zero-sum games: whatever one player gains, the other loses. The state-space formulation adds a second action space U2 for the adversary (nature/opponent). At each stage, player 1 picks u1 ∈ U1(x) and simultaneously (or sequentially in alternating games) player 2 picks u2 ∈ U2(x). The transition becomes P(x' | x, u1, u2).

Player 1 minimizes cost; player 2 maximizes it (zero-sum: their costs are negatives of each other). The minimax value V*(x) is the value both players can guarantee:

V*(x) = minu1 maxu2 &Bigl[ l(x, u1, u2) + γ ∑x' P(x' | x, u1, u2) · V*(x') &Bigr]

The minimax theorem (Nash's theorem for zero-sum games) guarantees that mixed strategies (randomizing over actions) always have a well-defined minimax value. Pure strategy minimax may not exist — rock-paper-scissors has no pure strategy Nash equilibrium.

Mixed strategies eliminate predictability. In rock-paper-scissors, any pure strategy (always rock) can be exploited. The only Nash equilibrium is to choose each option with probability 1/3. A minimax optimal strategy randomizes precisely to make the opponent indifferent between their options — unable to gain by any deviation.

Minimax value iteration

Value iteration extends directly to games. At each state, instead of minu, we use minu1 maxu2. For pure strategies, this is a nested min-max. For mixed strategies, each inner problem is a linear program. The result is the minimax optimal policy — the best you can do regardless of what the opponent does.

Two-Player Grid — Minimax vs. Greedy

Player 1 (blue) tries to reach green. Player 2 (red, adversarial) tries to block. Toggle mode to see how minimax policy differs from greedy (expected-cost) policy. Adversary plays optimal blocking strategy.

Mode: Greedy policy

Nash equilibrium in multi-player games

Beyond zero-sum two-player games, LaValle discusses general N-player games. A Nash equilibrium is a joint policy (π1, …, πN) such that no player can improve their own expected cost by unilaterally changing their policy. In zero-sum games, Nash equilibria are exactly the minimax optimal policies. In general-sum games, Nash equilibria may not minimize total cost and may be inefficient — but they are strategically stable.

In a two-player zero-sum sequential game, player 1 computes the minimax optimal policy. This policy guarantees:

Chapter 8: Continuous State Spaces

All our MDPs so far had finite state spaces — 16 cells in a 4×4 grid. But robot positions, joint angles, velocities, and temperatures are continuous. How do we apply MDP theory when X = ℝn?

The Bellman equation with continuous states

The infinite-horizon Bellman equation extends immediately — the sum over x' becomes an integral:

V*(x) = minu &Bigl[ l(x, u) + γ ∫ p(x' | x, u) · V*(x')  dx' &Bigr]

The challenge: we can no longer store V* as a lookup table indexed by states. Instead, we need a function approximator that represents V*(x) for arbitrary x ∈ ℝn. This is where deep RL enters: V*(x) ≈ Vθ(x), a neural network with parameters θ.

Discretization

The simplest approach: discretize the continuous space into a finite grid and apply standard value iteration. For 1D: divide [xmin, xmax] into N bins. For n-D: use an Nn grid — which explodes exponentially (the curse of dimensionality). Even with N=10, a 6D robot arm has 106 = 1,000,000 states. A 20D state has 1020 — completely infeasible.

The curse of dimensionality. Every extra dimension multiplies the state space by N. Value iteration on a 10-dimensional continuous state space with N=100 per dimension requires 10010 = 1020 states. Modern RL sidesteps this not by solving the full Bellman equation, but by learning function approximators (neural networks) that generalize across states without enumerating them.

Linear function approximation

For problems with known structure, represent V*(x) ≈ wTφ(x) where φ(x) is a feature vector (e.g., polynomial basis, radial basis functions). The parameters w ∈ ℝk are learned by TD methods. The update rule becomes:

w ← w + α δt · ∇wVw(xt)

where δt = lt + γVw(xt+1) - Vw(xt) is the TD error. For linear approximation, ∇wVw(x) = φ(x), and the update is w ← w + αδtφ(xt) — a simple scalar-times-feature update.

Continuous 1D — Value Function Approximation

State x ∈ [-1, 1]. Goal: reach 0. Cost: x². Watch TD learning fit V*(x) using polynomial features. The green curve is the true V* (computed analytically); the orange curve is the learned approximation. Add more basis functions to improve fit.

Basis degree 3
TD steps: 0
Why does discretizing a 10-dimensional continuous state space into a 10-bin-per-dimension grid make value iteration infeasible?

Chapter 9: Connections & What's Next

Chapter 10 sits at a crossroads. Every concept it introduces connects outward to other fields — some that came before in LaValle's book, some that come after, and some that live in entirely different disciplines.

Where Chapter 10 fits in LaValle

ChapterTopicConnection to Ch 10
Ch 2Discrete PlanningDeterministic special case: P = 1 on f(x,u)
Ch 9Decision Theory (single stage)MDP = repeat Ch 9 decision across time with state evolution
Ch 11Information SpacesPartial observability: agent doesn't know x, must infer it → POMDP
Ch 12Planning under UncertaintyCombines sampling (Ch 5) with stochastic transitions (Ch 10)

The POMDP extension (Chapter 11 preview)

In MDPs, the agent observes the state x exactly. In a Partially Observable MDP (POMDP), the agent only sees a noisy observation y ~ P(y | x). The true state is hidden. The agent must maintain a belief state b(x) — a distribution over possible states — and plan over the belief space. The belief space is continuous even if X is finite (it's a probability simplex), making POMDPs significantly harder to solve.

MDPs as POMDPs with perfect sensors. An MDP is a POMDP where P(y=x | x) = 1 — the observation uniquely reveals the state. Every MDP algorithm is a special case of a POMDP algorithm. Understanding MDPs deeply makes the POMDP extension natural: replace "current state x" with "current belief b" and re-derive the Bellman equations.

Chapter 10 vs. modern deep RL

MethodState spaceFunction approx.Key reference
Tabular Q-learning (Ch 10)Finite, smallNone — exact tableWatkins 1989
DQNDiscrete, largeDeep CNN for Q(s,a)Mnih et al. 2015
PPOContinuousDeep NN for πθSchulman et al. 2017
SACContinuousDeep NN, entropy bonusHaarnoja et al. 2018
Model-based RL (Dreamer)ContinuousWorld model + policyHafner et al. 2020

Key limitations of tabular MDPs

Tabular methods (value iteration, policy iteration, Q-learning with a table) require storing V or Q for every state. They work when |X| ≤ ~106. Beyond that — robot arms, image-based control, multi-agent systems — function approximation is mandatory. Deep RL is the answer, but it brings new challenges: instability, sample inefficiency, sensitivity to hyperparameters, and lack of convergence guarantees for nonlinear approximators.

The Bellman equation is the constant. Whether you use a lookup table, a linear model, or a 100-layer neural network, you're always trying to satisfy the Bellman equation. Every RL algorithm is a different strategy for doing so under computational and sample constraints. Chapter 10 gives you the foundation; the rest of the field gives you the engineering to scale it.

Related micro-lessons

Continue your learning with these lessons in the microLearning series:

"A policy is the right concept when the world is stochastic: it tells you what to do in every possible situation you might land in, not just the ones you planned for."
— paraphrasing LaValle, §10.1