Thrun et al., Chapter 15

Markov Decision Processes

Planning under uncertainty: value iteration, optimal policies, payoff functions, and robot path planning.

Prerequisites: Basic probability, Chapters 1-2 (uncertainty, Bayes filters).
10
Chapters
1
Simulation
10
Quizzes

Chapter 0: Why Planning Under Uncertainty?

So far, this book has focused on perception: estimating the robot's state from noisy data. But perception alone doesn't make a robot useful. The robot must also act: choose actions that achieve goals despite uncertain outcomes.

Why can't we just plan deterministically and then execute? Because robot actions have uncertain outcomes. A "move forward 1 meter" command might actually move 0.95m or 1.07m. A planned path around an obstacle might fail if the robot drifts slightly.

The fundamental question: Given that actions have uncertain outcomes, what is the best action to take at each state? A good policy doesn't just aim for the goal — it prepares for contingencies. Near a cliff, the optimal policy might take a longer but safer route, even though the direct path is shorter.

This chapter introduces Markov Decision Processes (MDPs), the framework for optimal sequential decision-making under uncertainty when the state is fully observable. Chapter 16 extends to the partially observable case.

Check: Why is deterministic planning insufficient for robots?

Chapter 1: The MDP Framework

An MDP is defined by four components:

ComponentSymbolDescription
StatesSThe set of all possible states (e.g., robot positions)
ActionsAThe set of actions available at each state
Transition modelp(s' | a, s)Probability of reaching s' by doing a in s
Payoff / Rewardc(s)Immediate reward for being in state s

The Markov property: the transition probability depends only on the current state and action, not on history. This is the same assumption used throughout this book for Bayes filters.

MDPs assume full observability: The robot knows its state exactly at every time step. In reality, the robot only has a belief over states (from its localization system). The partially observable case (POMDPs, Chapter 16) handles this, but at much greater computational cost.
Check: What does the transition model p(s'|a,s) represent?

Chapter 2: Policies

A policy π(s) maps each state to an action. Given a policy, the robot knows what to do in any situation. A deterministic policy says "in state s, do action a." A stochastic policy says "in state s, do action a with probability π(s,a)."

The optimal policy π* is the one that maximizes the expected cumulative payoff over time. For a finite horizon T:

π* = argmaxπ E[Σt=0..T γt c(st) | π]

where γ ∈ [0,1] is the discount factor. With γ < 1, future rewards are worth less than immediate ones, encouraging the robot to achieve goals sooner.

Policy vs plan: A plan is a fixed sequence of actions. A policy is a function from states to actions. A policy is strictly more powerful because it tells the robot what to do regardless of what state it ends up in. If an action goes wrong, the policy adapts; a plan cannot.
Check: Why is a policy more robust than a fixed action sequence?

Chapter 3: Value Functions

The value function C(s) is the expected cumulative payoff starting from state s and following the optimal policy thereafter:

C(s) = E[Σt=0..∞ γt c(st) | s0 = s, π*]

The value function captures the "goodness" of each state. States near the goal have high value. States near obstacles have low value. States far from the goal have moderate value (it will take time to reach the goal, so future rewards are discounted).

The value function is everything: Once you have the optimal value function, the optimal policy is trivial — at each state, pick the action that leads to the highest-value successor state. The entire challenge of MDPs is computing C(s).

For robot path planning, the value function forms a gradient field over the map. Following the gradient (ascending toward higher values) leads to the goal along the optimal path.

Check: Given the optimal value function, how do you extract the optimal policy?

Chapter 4: The Bellman Equation

The optimal value function satisfies the Bellman equation, a recursive relationship:

C(s) = maxa ∫ [c(s') + γ C(s')] p(s' | a, s) ds'

In words: the value of state s is the best (over all actions) expected sum of (immediate payoff + discounted future value).

The recursive insight: The value at state s depends on the values at successor states. But those successor values depend on their successors, and so on. The Bellman equation captures this infinite chain in one self-consistent equation. Value iteration (next chapter) solves it by iterating until convergence.

For finite state spaces, the integral becomes a sum. For continuous spaces, discretization or function approximation is needed.

Check: What does the Bellman equation express?

Chapter 5: Value Iteration

Value iteration solves the Bellman equation by repeated application:

1. Initialize Ĉ(s) = 0 for all states.

2. Repeat until convergence:

Ĉ(s) ← maxa ∫ [c(s') + γ Ĉ(s')] p(s' | a, s) ds'   ∀ s

Each iteration improves the value estimate by looking one step further into the future. After k iterations, Ĉ reflects the optimal value for a k-step horizon.

Convergence: Value iteration converges if γ < 1. The error decreases geometrically: after k iterations, the error is at most γk · Cmax. With γ = 0.95, the error halves roughly every 14 iterations. In practice, convergence is observed after 50-200 iterations for typical grid-based robot planning problems.

The order in which states are updated doesn't matter for convergence, as long as each state is updated infinitely often. In practice, sweeping through all states once per iteration works well.

Check: What does each iteration of value iteration do?

Chapter 6: Convergence

Value iteration is guaranteed to converge when γ < 1. The key properties:

PropertyDescription
ContractionEach iteration reduces the maximum error by factor γ
MonotonicityIf Ĉk ≤ C*, then Ĉk+1 ≤ C* (values approach optimum from below)
Fixed pointThe only fixed point of the Bellman update is C*

For γ = 1 (undiscounted), convergence is not guaranteed in general, but works in practice for goal-reaching problems where all states can reach the goal in finite time.

Practical stopping criterion: Stop when the maximum change in any value is below a threshold: maxs |Ĉk+1(s) - Ĉk(s)| < ε. The resulting policy is ε/(1-γ)-optimal.
Check: Why does value iteration converge for γ < 1?

Chapter 7: Robot Path Planning

MDPs naturally solve the robot path planning problem. The state space is the robot's configuration space (x, y for a circular robot). Actions are motion commands (move in 8 directions). The payoff is high at the goal and zero (or negative) elsewhere.

After running value iteration, the value function forms a gradient field: brighter = higher value = closer to goal. Following the gradient leads to the goal along the optimal path, automatically avoiding obstacles (which have low/zero value).

MDP path planning vs shortest path: The shortest path ignores action uncertainty. The MDP optimal path considers that the robot might drift into obstacles. Near narrow passages, the MDP might prefer a wider but safer path. Near cliff edges, the MDP steers further from danger. This is risk-aware planning.

For circular robots on flat surfaces, planning in (x, y) space suffices. For non-circular robots or those with dynamics, the state must include orientation and velocity, making the problem 4-6 dimensional. Value iteration becomes expensive but still tractable for moderate resolutions.

Check: How does MDP path planning differ from shortest-path planning?

Chapter 8: Value Iteration Demo

Watch value iteration compute the optimal value function for a grid world. The goal cell has high reward. Obstacles have zero value. Value propagates outward from the goal, filling the navigable space.

MDP Value Iteration

Bright cells = high value (near goal). Dark cells = low value or obstacles. Click "Iterate" to run one sweep. The value function converges to a gradient pointing toward the goal.

Iteration 0. Click Iterate.
What to notice: The value function starts at zero everywhere and grows outward from the goal like a wave. After convergence, following the brightness gradient from any cell leads to the goal. The path naturally avoids obstacles because obstacle-adjacent cells have lower value.
Check: After value iteration converges, how does the robot find the optimal path?

Chapter 9: Summary

ConceptKey Idea
MDPFramework for sequential decision-making with stochastic transitions
PolicyMapping from states to actions; optimal policy maximizes expected cumulative reward
Value functionExpected future reward from each state under optimal policy
Bellman equationRecursive relationship defining the optimal value function
Value iterationIterative algorithm that converges to optimal value function
What comes next: MDPs assume the state is fully observable. But robots only have beliefs about their state (from localization). Chapter 16 introduces POMDPs — planning in belief space, where the robot must reason about its own uncertainty and choose actions that gather information as well as achieve goals.
"The value of a state is not where you are,
but where you can go from here."
Check: What key assumption does the MDP framework make about state observability?