When nature plays between your moves — MDPs, value iteration, Q-learning, and the mathematics of optimal behavior under uncertainty.
You're controlling a delivery drone over a city. You issue a command — "fly north 10 meters" — and the wind gusts. The drone ends up northwest instead. You issue another command. The motor stutters. You arrive at slightly the wrong altitude. Every action you take is perturbed by forces outside your control. The world is stochastic.
In Chapter 2, we assumed actions were deterministic: if you tell the robot to go right, it goes right. Period. That worked beautifully for warehouse grids on flat floors with perfect motors. But real robots, real economies, and real games have a different character: nature is an adversary that moves between your decisions. After you act, the world transitions — and you don't fully control how.
The canvas below shows the problem concretely. A robot is on a slippery grid. When it tries to move in a direction, there's a chance it slides sideways instead. Run a fixed plan — go right four times — and watch how often it reaches the goal vs. slips into a wall or off-course. The deterministic plan fails even though the environment is simple.
The robot (blue) tries to reach green from orange. Each "right" command succeeds with probability 0.7, slips up or down with 0.15 each. Run the fixed plan 20 times and observe how often it fails.
Chapter 2 gave us deterministic planning: (X, U(x), f, xI, XG). Chapter 10 adds three things. First, stochastic transitions: instead of f(x, u) = x', we have a probability distribution P(x' | x, u) over possible next states. Second, stage costs that accumulate as a reward/cost signal each step. Third, a planning horizon — finite (terminate after K steps) or infinite (run forever).
These three additions turn planning into something richer. We no longer ask "which path minimizes total cost?" We ask: "which policy minimizes expected total cost, regardless of how nature perturbs my trajectory?" The answer is called an optimal policy — and finding it is the central problem of Chapter 10.
| Feature | Ch 2 (Deterministic) | Ch 10 (Stochastic) |
|---|---|---|
| Transition | f(x, u) = x' (certain) | P(x' | x, u) (distribution) |
| Plan type | Sequence of actions | Policy π: X → U |
| Objective | Minimize path cost | Minimize expected cumulative cost |
| Failure mode | No path exists | High expected cost, no safety guarantee |
A Markov Decision Process (MDP) is the mathematical framework that captures exactly the stochastic planning problem from Chapter 0. LaValle's Formulation 10.1 has five components — similar to Formulation 2.1, but with one critical replacement and one addition.
| Symbol | Name | Meaning |
|---|---|---|
| X | State space | All possible states (finite in this chapter) |
| U | Action space | All actions (same at every state for simplicity) |
| P(x' | x, u) | Transition probabilities | Probability of landing in x' from x under action u |
| l(x, u) | Stage cost | Immediate cost incurred for action u in state x |
| xI | Initial state | Where planning begins |
The key replacement: the transition function f(x,u) = x' (deterministic) becomes the transition distribution P(x' | x, u). For every state-action pair (x, u), P gives a probability distribution over next states: ∑x' P(x' | x, u) = 1. The transition is still Markovian — the distribution over x' depends only on the current state x and action u, not on the full history of how we got to x.
Take a 4×4 grid. States X = {(i,j) : 0 ≤ i,j ≤ 3}. Actions U = {N, S, E, W}. The transition model: intended direction succeeds with probability 0.7; with probability 0.15 each, the robot slips 90° left or right of its intended direction; with probability 0 it moves backward (simplification). All moves that would leave the grid keep the robot in its current cell.
The stage cost l(x, u) encodes preferences. A common choice: l(x, u) = 1 for all non-goal states (time penalty — every extra step costs 1), and l(xG, ·) = 0 (no cost once you've arrived). Other designs: negative costs (rewards) for reaching goal, large positive costs for entering hazard cells.
A policy π : X → U assigns an action to every state. Under policy π, the robot follows a random trajectory x0, x1, x2, … where xk+1 ~ P(· | xk, π(xk)). The cumulative cost over a finite horizon K is Lπ = ∑k=0K-1 l(xk, π(xk)). We want to minimize the expected cumulative cost E[Lπ].
The optimal policy π* minimizes expected total cost from every starting state. Finding π* is exactly what the algorithms of Chapters 2 and 3 do — but now the Bellman equation must take expectations over transitions.
python import numpy as np class SlippyGridMDP: def __init__(self, n=4, p_intended=0.7, goal=(3,3)): self.n = n self.p = p_intended # prob of intended direction self.goal = goal self.actions = [(0,1),(0,-1),(1,0),(-1,0)] # E,W,S,N def transitions(self, x, u_idx): """Returns list of (x', probability) for action u_idx.""" u = self.actions[u_idx] u_left = self.actions[(u_idx + 1) % 4] u_right = self.actions[(u_idx - 1) % 4] p_slip = (1 - self.p) / 2 candidates = [(u, self.p), (u_left, p_slip), (u_right, p_slip)] result = {} for du, prob in candidates: xp = (max(0, min(self.n-1, x[0]+du[0])), max(0, min(self.n-1, x[1]+du[1]))) result[xp] = result.get(xp, 0) + prob return list(result.items()) def cost(self, x, u_idx): return 0.0 if x == self.goal else 1.0
In Chapter 2 of LaValle's book, backward value iteration found G*(x) — the optimal cost-to-go — by propagating values backward from the goal. The update was: G*(x) = minu [l(x,u) + G*(f(x,u))]. Now the transition is stochastic. We can't just look up G*(f(x,u)) — we must take an expectation over all possible next states.
The Bellman equation for a finite-horizon MDP with K stages is:
Read it: the value of being in state x with k stages remaining equals the minimum over all actions u of: the immediate cost l(x,u) plus the expected future value ∑x' P(x' | x,u) · Vk+1(x'). The expectation is the key new piece. When transitions are deterministic, P puts all mass on one x', and the sum collapses back to Vk+1(f(x,u)) — exactly the deterministic case.
Boundary condition: V0(x) = 0 for x ∈ XG, and V0(x) = ∞ for x ∉ XG (if you haven't reached the goal when time runs out, you've failed). Iterate backward: compute VK-1, then VK-2, down to V0 at the initial state.
Three states: A, B, G (goal). Actions: {stay, advance}. Transitions: advance from A reaches B with p=0.8, stays at A with p=0.2. Advance from B reaches G with p=0.9, stays at B with p=0.1. Stage cost: 1.0 everywhere except G. K=3 stages.
| State | V0 (base) | V1 | V2 | V3 (start) |
|---|---|---|---|---|
| G | 0 | 0 | 0 | 0 |
| B | ∞ | 1 + 0.9·0 + 0.1·∞ → stay=1 | 1.1 | 1.19 |
| A | ∞ | ∞ | 1 + 0.8·1 + 0.2·∞ → advance ≈ 1.8 | 2.62 |
At V1(B): advance gives 1 + 0.9·0 + 0.1·∞ = ∞ (can't reach G in 0 stages if you stay at B after slip). So "stay" dominates at horizon 1 only because no goal transitions are possible in 0 remaining stages... Actually for B: advance gives expected cost 1 + 0.9·V0(G) + 0.1·V0(B) = 1 + 0 + ∞ = ∞; stay gives ∞ too since you need to reach G. For K=2, B can advance: 1 + 0.9·V1(G) + 0.1·V1(B) = 1+0+0.1·1 = 1.1. The table above captures the correct pattern: as K grows, more states become reachable. The optimal policy is always "advance."
Watch how values propagate from the goal (red) outward in the stochastic grid. Each step is one backward Bellman sweep. Brighter cells = lower cost-to-go. Arrows show the greedy policy at each state.
python import numpy as np from math import inf def value_iteration_mdp(mdp, K): """Backward value iteration for finite-horizon MDP. Returns V[k][state] for k in 0..K.""" states = list(mdp.states()) V = [{s: (0.0 if s == mdp.goal else inf) for s in states}] # V[0] for k in range(1, K+1): Vk = {} for x in states: if x == mdp.goal: Vk[x] = 0.0; continue best = inf for u in range(len(mdp.actions)): cost = mdp.cost(x, u) exp_future = sum(p * V[k-1].get(xp, inf) for xp, p in mdp.transitions(x, u)) total = cost + exp_future if total < best: best = total Vk[x] = best V.append(Vk) return V # Recover greedy policy from V[K] def greedy_policy(mdp, V_final): policy = {} for x in mdp.states(): if x == mdp.goal: continue best_u, best_v = None, inf for u in range(len(mdp.actions)): q = mdp.cost(x, u) + sum(p * V_final.get(xp, inf) for xp, p in mdp.transitions(x, u)) if q < best_v: best_u, best_v = u, q policy[x] = best_u return policy
Value iteration repeatedly updates V until it converges. Policy iteration takes a different route: it alternates between two operations — evaluating a fixed policy exactly, then improving the policy greedily. This alternation is guaranteed to converge, and in practice often does so in far fewer iterations than value iteration.
Fix a policy π. Under π, the MDP becomes a simple Markov chain — no choices to make. The value function Vπ(x) is defined by the policy evaluation equations:
This is a linear system of equations in the unknowns {Vπ(x)}. For n states, it's an n×n linear system — solvable exactly in O(n3) via Gaussian elimination. Alternatively, iterate the equations to convergence (like value iteration but with a fixed policy, which is faster since no minimization is needed).
With Vπ in hand, check every state: is there an action better than π(x)? The improved policy is:
If π'(x) = π(x) for all x, the policy is already optimal — we're done. Otherwise, set π ← π' and repeat policy evaluation. This is the policy improvement theorem: the new policy is always at least as good as the old one (Vπ'(x) ≤ Vπ(x) for all x), and strictly better unless it's already optimal.
| Property | Value Iteration | Policy Iteration |
|---|---|---|
| Per-iteration cost | O(|X|·|U|) — one Bellman sweep | O(|X|3) — solve linear system |
| Iterations to converge | Many (proportional to state space diameter) | Few (typically 5–50) |
| Intermediate policies | Implicit — not directly available | Explicit at each step |
| Best for | Large state spaces, no matrix inversion | Smaller spaces, exact policy needed quickly |
Start with a random policy (arrows point random directions). Each click runs one full policy evaluation + improvement cycle. Watch arrows converge toward the goal. Compare how few iterations it takes vs. value iteration.
python import numpy as np def policy_evaluation(mdp, policy, gamma=0.99, tol=1e-8): """Iterative policy evaluation. Returns V^pi dict.""" states = list(mdp.states()) V = {s: 0.0 for s in states} while True: delta = 0 for x in states: u = policy[x] v = mdp.cost(x, u) + gamma * sum( p * V[xp] for xp, p in mdp.transitions(x, u)) delta = max(delta, abs(v - V[x])) V[x] = v if delta < tol: break return V def policy_iteration(mdp, gamma=0.99): """Full policy iteration loop.""" states = list(mdp.states()) policy = {s: 0 for s in states} # start with action 0 everywhere while True: V = policy_evaluation(mdp, policy, gamma) stable = True for x in states: best_u = min(range(len(mdp.actions)), key=lambda u: mdp.cost(x,u) + gamma*sum( p*V[xp] for xp,p in mdp.transitions(x,u))) if best_u != policy[x]: policy[x] = best_u; stable = False if stable: break return policy, V
Finite-horizon MDPs ask: "minimize cost over exactly K steps." But most real problems have no fixed deadline. A robot should deliver packages indefinitely. A trading algorithm runs until the fund closes. For these, we need the infinite-horizon MDP — and a way to prevent the total cost from growing without bound.
The standard fix is discounting: future costs are worth less than present ones. Formally, a cost k steps in the future is multiplied by γk, where 0 < γ < 1 is the discount factor. The total discounted cost is:
Since γ < 1, the geometric series converges: even with constant stage cost c, the total is at most c/(1−γ). The optimal value function V* satisfies the infinite-horizon Bellman equation:
This is identical to the finite-horizon equation but with no horizon index k — V* is stationary. The optimal policy is also stationary: the same action at state x regardless of time. Run value iteration to convergence (using max|Vk-Vk-1| < ε as the stopping rule) and you get V* and π*.
When γ = 1, we get the stochastic shortest path problem: minimize expected total cost until the goal is reached. This is the stochastic analog of Dijkstra's algorithm. It requires that the goal be reachable from every state under some policy (otherwise the total cost could be infinite). LaValle's Theorem 10.1 proves convergence of value iteration for SSP under this reachability assumption.
An alternative to discounting: minimize the long-run average cost per step:
This is appropriate when γ = 1 but costs continue forever (no goal, just ongoing operation). Solving average-cost MDPs requires the relative value function — subtract the average cost from the Bellman equation to prevent divergence. LaValle covers this in §10.3 but it's typically solved via policy iteration with an offset equation.
Full MDP with stochastic transitions (slip probability). Adjust γ to see how the discount shapes the policy — myopic (γ close to 0) vs. far-sighted (γ close to 1). Watch the heat map and arrows converge.
Value iteration and policy iteration both require knowing P(x' | x, u) — the transition probabilities. In the slippery grid, we knew exactly how likely each slip was. But what if we don't know P? What if we've never seen the environment before and must learn what transitions and costs look like through experience?
This is the setting of reinforcement learning (RL). An agent acts in an unknown MDP, receives observations of (x, u, l, x'), and must infer the optimal policy without ever being given P or l explicitly. The environment is a black box — you can push buttons and observe outcomes, but you can't inspect the mechanism.
Model-based RL estimates P(x'|x,u) and l(x,u) from experience, then runs value iteration or policy iteration on the learned model. Clean, data-efficient, but requires building a model that might be wrong. Errors in the model compound through planning.
Model-free RL estimates value functions or policies directly from experience, bypassing the model entirely. Q-learning and TD-learning are model-free. They're less data-efficient but don't accumulate model errors. Most modern deep RL (DQN, PPO, SAC) is model-free.
| Approach | What it learns | Pros | Cons |
|---|---|---|---|
| Model-based | P(x'|x,u), l(x,u) | Data-efficient, reuses model | Model errors compound |
| Model-free (Q-learning) | Q(x,u) directly | No model needed, robust | Sample inefficient |
| Model-free (policy gradient) | πθ(u|x) directly | Continuous actions, stochastic policies | High variance |
Temporal Difference (TD) learning updates value estimates using the difference between a prediction and an observation. The TD(0) update for V(xt) after observing transition (xt, ut, lt, xt+1) is:
The bracket is the TD error δt = lt + γV(xt+1) - V(xt). It measures how wrong our current prediction was. Positive δ means things went better than expected; negative means worse. The learning rate α controls how fast we update. As α → 0, the update becomes conservative; as α → 1, we overwrite completely with the new estimate.
TD learning updates V(x) — the value of being in a state. But to act, we also need to know which action is best. If we knew Q(x, u) — the expected total cost of taking action u in state x and then acting optimally — we could act greedily without ever knowing the transition model P. The action-value function Q is the target of Q-learning.
Define Q*(x, u) as the expected discounted cost of taking action u in state x, then following the optimal policy thereafter:
Notice: V*(x) = minu Q*(x,u). The optimal action in state x is just argminu Q*(x,u) — no model needed, since we only look up stored Q values. This is the power of Q: once you know Q*, you can act optimally in any state without knowing P at all.
After observing (xt, ut, lt, xt+1):
Key property: Q-learning is off-policy. The TD target uses minu' Q(x', u') — the greedy action — regardless of how the agent actually chose ut. So the agent can explore randomly (e.g., ε-greedy) while still learning the optimal Q* for the greedy policy. This decoupling of behavior policy (how we explore) from target policy (what we want to optimize) is unique to off-policy methods.
The agent explores the 4×4 grid. Q-values for each state are shown as arrow brightness. Light arrows = high Q, dark = low Q. Watch Q values sharpen as episodes accumulate. Adjust ε to see how exploration affects convergence speed.
python import numpy as np import random def q_learning(mdp, n_episodes=5000, alpha=0.1, gamma=0.95, eps_start=1.0, eps_end=0.05, eps_decay=0.995): n_states = len(list(mdp.states())) n_actions = len(mdp.actions) Q = np.zeros((n_states, n_actions)) eps = eps_start for ep in range(n_episodes): x = mdp.x0 # reset to initial state for _ in range(200): # max steps per episode xi = mdp.state_index(x) # epsilon-greedy action selection if random.random() < eps: u = random.randint(0, n_actions-1) else: u = int(np.argmin(Q[xi])) xp, cost = mdp.step(x, u) # sample one transition xpi = mdp.state_index(xp) # Q-learning update (off-policy, uses min over next actions) target = cost + gamma * np.min(Q[xpi]) Q[xi, u] += alpha * (target - Q[xi, u]) x = xp if x == mdp.goal: break eps = max(eps_end, eps * eps_decay) return Q
Until now, "nature" was neutral — it perturbed your actions randomly but without intent. What if nature is adversarial? What if there's a second agent with its own goals, actively trying to defeat you? This is the setting of sequential game theory, and it requires a different framework: instead of minimizing expected cost, you minimize your worst-case cost assuming the opponent plays optimally against you.
LaValle focuses on two-player zero-sum games: whatever one player gains, the other loses. The state-space formulation adds a second action space U2 for the adversary (nature/opponent). At each stage, player 1 picks u1 ∈ U1(x) and simultaneously (or sequentially in alternating games) player 2 picks u2 ∈ U2(x). The transition becomes P(x' | x, u1, u2).
Player 1 minimizes cost; player 2 maximizes it (zero-sum: their costs are negatives of each other). The minimax value V*(x) is the value both players can guarantee:
The minimax theorem (Nash's theorem for zero-sum games) guarantees that mixed strategies (randomizing over actions) always have a well-defined minimax value. Pure strategy minimax may not exist — rock-paper-scissors has no pure strategy Nash equilibrium.
Value iteration extends directly to games. At each state, instead of minu, we use minu1 maxu2. For pure strategies, this is a nested min-max. For mixed strategies, each inner problem is a linear program. The result is the minimax optimal policy — the best you can do regardless of what the opponent does.
Player 1 (blue) tries to reach green. Player 2 (red, adversarial) tries to block. Toggle mode to see how minimax policy differs from greedy (expected-cost) policy. Adversary plays optimal blocking strategy.
Beyond zero-sum two-player games, LaValle discusses general N-player games. A Nash equilibrium is a joint policy (π1, …, πN) such that no player can improve their own expected cost by unilaterally changing their policy. In zero-sum games, Nash equilibria are exactly the minimax optimal policies. In general-sum games, Nash equilibria may not minimize total cost and may be inefficient — but they are strategically stable.
All our MDPs so far had finite state spaces — 16 cells in a 4×4 grid. But robot positions, joint angles, velocities, and temperatures are continuous. How do we apply MDP theory when X = ℝn?
The infinite-horizon Bellman equation extends immediately — the sum over x' becomes an integral:
The challenge: we can no longer store V* as a lookup table indexed by states. Instead, we need a function approximator that represents V*(x) for arbitrary x ∈ ℝn. This is where deep RL enters: V*(x) ≈ Vθ(x), a neural network with parameters θ.
The simplest approach: discretize the continuous space into a finite grid and apply standard value iteration. For 1D: divide [xmin, xmax] into N bins. For n-D: use an Nn grid — which explodes exponentially (the curse of dimensionality). Even with N=10, a 6D robot arm has 106 = 1,000,000 states. A 20D state has 1020 — completely infeasible.
For problems with known structure, represent V*(x) ≈ wTφ(x) where φ(x) is a feature vector (e.g., polynomial basis, radial basis functions). The parameters w ∈ ℝk are learned by TD methods. The update rule becomes:
where δt = lt + γVw(xt+1) - Vw(xt) is the TD error. For linear approximation, ∇wVw(x) = φ(x), and the update is w ← w + αδtφ(xt) — a simple scalar-times-feature update.
State x ∈ [-1, 1]. Goal: reach 0. Cost: x². Watch TD learning fit V*(x) using polynomial features. The green curve is the true V* (computed analytically); the orange curve is the learned approximation. Add more basis functions to improve fit.
Chapter 10 sits at a crossroads. Every concept it introduces connects outward to other fields — some that came before in LaValle's book, some that come after, and some that live in entirely different disciplines.
| Chapter | Topic | Connection to Ch 10 |
|---|---|---|
| Ch 2 | Discrete Planning | Deterministic special case: P = 1 on f(x,u) |
| Ch 9 | Decision Theory (single stage) | MDP = repeat Ch 9 decision across time with state evolution |
| Ch 11 | Information Spaces | Partial observability: agent doesn't know x, must infer it → POMDP |
| Ch 12 | Planning under Uncertainty | Combines sampling (Ch 5) with stochastic transitions (Ch 10) |
In MDPs, the agent observes the state x exactly. In a Partially Observable MDP (POMDP), the agent only sees a noisy observation y ~ P(y | x). The true state is hidden. The agent must maintain a belief state b(x) — a distribution over possible states — and plan over the belief space. The belief space is continuous even if X is finite (it's a probability simplex), making POMDPs significantly harder to solve.
| Method | State space | Function approx. | Key reference |
|---|---|---|---|
| Tabular Q-learning (Ch 10) | Finite, small | None — exact table | Watkins 1989 |
| DQN | Discrete, large | Deep CNN for Q(s,a) | Mnih et al. 2015 |
| PPO | Continuous | Deep NN for πθ | Schulman et al. 2017 |
| SAC | Continuous | Deep NN, entropy bonus | Haarnoja et al. 2018 |
| Model-based RL (Dreamer) | Continuous | World model + policy | Hafner et al. 2020 |
Tabular methods (value iteration, policy iteration, Q-learning with a table) require storing V or Q for every state. They work when |X| ≤ ~106. Beyond that — robot arms, image-based control, multi-agent systems — function approximation is mandatory. Deep RL is the answer, but it brings new challenges: instability, sample inefficiency, sensitivity to hyperparameters, and lack of convergence guarantees for nonlinear approximators.
Continue your learning with these lessons in the microLearning series:
"A policy is the right concept when the world is stochastic: it tells you what to do in every possible situation you might land in, not just the ones you planned for."
— paraphrasing LaValle, §10.1