Thrun et al., Chapter 15

Markov Decision Processes

Planning under uncertainty: value iteration, optimal policies, payoff functions, and robot path planning.

Prerequisites: Basic probability, Chapters 1-2 (uncertainty, Bayes filters).

Chapters

Simulation

Quizzes

Chapter 0: Why Planning Under Uncertainty?

So far, this book has focused on perception: estimating the robot's state from noisy data. But perception alone doesn't make a robot useful. The robot must also act: choose actions that achieve goals despite uncertain outcomes.

Why can't we just plan deterministically and then execute? Because robot actions have uncertain outcomes. A "move forward 1 meter" command might actually move 0.95m or 1.07m. A planned path around an obstacle might fail if the robot drifts slightly.

The fundamental question: Given that actions have uncertain outcomes, what is the best action to take at each state? A good policy doesn't just aim for the goal — it prepares for contingencies. Near a cliff, the optimal policy might take a longer but safer route, even though the direct path is shorter.

This chapter introduces Markov Decision Processes (MDPs), the framework for optimal sequential decision-making under uncertainty when the state is fully observable. Chapter 16 extends to the partially observable case.

Check: Why is deterministic planning insufficient for robots?

Robot actions have uncertain outcomes, so the actual trajectory may differ from the planned one Robots can't follow plans Plans are too slow to compute

Chapter 1: The MDP Framework

An MDP is defined by four components:

Component	Symbol	Description
States	S	The set of all possible states (e.g., robot positions)
Actions	A	The set of actions available at each state
Transition model	p(s' \| a, s)	Probability of reaching s' by doing a in s
Payoff / Reward	c(s)	Immediate reward for being in state s

The Markov property: the transition probability depends only on the current state and action, not on history. This is the same assumption used throughout this book for Bayes filters.

MDPs assume full observability: The robot knows its state exactly at every time step. In reality, the robot only has a belief over states (from its localization system). The partially observable case (POMDPs, Chapter 16) handles this, but at much greater computational cost.

Check: What does the transition model p(s'|a,s) represent?

The probability of transitioning to state s' when executing action a in state s The probability of being in state s The sensor measurement probability

Chapter 2: Policies

A policy π(s) maps each state to an action. Given a policy, the robot knows what to do in any situation. A deterministic policy says "in state s, do action a." A stochastic policy says "in state s, do action a with probability π(s,a)."

The optimal policy π* is the one that maximizes the expected cumulative payoff over time. For a finite horizon T:

π* = argmax_π E[Σ_t=0..T γ^t c(s_t) | π]

where γ ∈ [0,1] is the discount factor. With γ < 1, future rewards are worth less than immediate ones, encouraging the robot to achieve goals sooner.

Policy vs plan: A plan is a fixed sequence of actions. A policy is a function from states to actions. A policy is strictly more powerful because it tells the robot what to do regardless of what state it ends up in. If an action goes wrong, the policy adapts; a plan cannot.

Check: Why is a policy more robust than a fixed action sequence?

A policy prescribes an action for every possible state, so it adapts when actions have unexpected outcomes A policy is faster to compute A policy uses less memory

Chapter 3: Value Functions

The value function C(s) is the expected cumulative payoff starting from state s and following the optimal policy thereafter:

C(s) = E[Σ_t=0..∞ γ^t c(s_t) | s₀ = s, π*]

The value function captures the "goodness" of each state. States near the goal have high value. States near obstacles have low value. States far from the goal have moderate value (it will take time to reach the goal, so future rewards are discounted).

The value function is everything: Once you have the optimal value function, the optimal policy is trivial — at each state, pick the action that leads to the highest-value successor state. The entire challenge of MDPs is computing C(s).

For robot path planning, the value function forms a gradient field over the map. Following the gradient (ascending toward higher values) leads to the goal along the optimal path.

Check: Given the optimal value function, how do you extract the optimal policy?

At each state, choose the action that leads to the highest-value successor state (greedy with respect to C) Choose the action with the highest immediate reward Choose a random action

Chapter 4: The Bellman Equation

The optimal value function satisfies the Bellman equation, a recursive relationship:

C(s) = max_a ∫ [c(s') + γ C(s')] p(s' | a, s) ds'

In words: the value of state s is the best (over all actions) expected sum of (immediate payoff + discounted future value).

The recursive insight: The value at state s depends on the values at successor states. But those successor values depend on their successors, and so on. The Bellman equation captures this infinite chain in one self-consistent equation. Value iteration (next chapter) solves it by iterating until convergence.

For finite state spaces, the integral becomes a sum. For continuous spaces, discretization or function approximation is needed.

Check: What does the Bellman equation express?

The optimal value of a state equals the best expected (immediate reward + discounted future value) The probability of reaching the goal The shortest path to the goal

Chapter 5: Value Iteration

Value iteration solves the Bellman equation by repeated application:

1. Initialize Ĉ(s) = 0 for all states.

2. Repeat until convergence:

Ĉ(s) ← max_a ∫ [c(s') + γ Ĉ(s')] p(s' | a, s) ds' ∀ s

Each iteration improves the value estimate by looking one step further into the future. After k iterations, Ĉ reflects the optimal value for a k-step horizon.

Convergence: Value iteration converges if γ < 1. The error decreases geometrically: after k iterations, the error is at most γ^k · C_max. With γ = 0.95, the error halves roughly every 14 iterations. In practice, convergence is observed after 50-200 iterations for typical grid-based robot planning problems.

The order in which states are updated doesn't matter for convergence, as long as each state is updated infinitely often. In practice, sweeping through all states once per iteration works well.

Check: What does each iteration of value iteration do?

Updates each state's value to reflect the best one-step lookahead using the current value estimates Moves the robot one step Adds a new state to the grid

Chapter 6: Convergence

Value iteration is guaranteed to converge when γ < 1. The key properties:

Property	Description
Contraction	Each iteration reduces the maximum error by factor γ
Monotonicity	If Ĉ^k ≤ C, then Ĉ^k+1 ≤ C (values approach optimum from below)
Fixed point	The only fixed point of the Bellman update is C*

For γ = 1 (undiscounted), convergence is not guaranteed in general, but works in practice for goal-reaching problems where all states can reach the goal in finite time.

Practical stopping criterion: Stop when the maximum change in any value is below a threshold: max_s |Ĉ^k+1(s) - Ĉ^k(s)| < ε. The resulting policy is ε/(1-γ)-optimal.

Check: Why does value iteration converge for γ < 1?

The Bellman update is a contraction mapping, reducing the error by factor γ at each step The state space is finite The rewards are bounded

Chapter 7: Robot Path Planning

MDPs naturally solve the robot path planning problem. The state space is the robot's configuration space (x, y for a circular robot). Actions are motion commands (move in 8 directions). The payoff is high at the goal and zero (or negative) elsewhere.

After running value iteration, the value function forms a gradient field: brighter = higher value = closer to goal. Following the gradient leads to the goal along the optimal path, automatically avoiding obstacles (which have low/zero value).

MDP path planning vs shortest path: The shortest path ignores action uncertainty. The MDP optimal path considers that the robot might drift into obstacles. Near narrow passages, the MDP might prefer a wider but safer path. Near cliff edges, the MDP steers further from danger. This is risk-aware planning.

For circular robots on flat surfaces, planning in (x, y) space suffices. For non-circular robots or those with dynamics, the state must include orientation and velocity, making the problem 4-6 dimensional. Value iteration becomes expensive but still tractable for moderate resolutions.

Check: How does MDP path planning differ from shortest-path planning?

MDP planning accounts for action uncertainty, preferring safer paths even if they're longer MDP planning is always shorter There is no difference

Chapter 8: Value Iteration Demo

Watch value iteration compute the optimal value function for a grid world. The goal cell has high reward. Obstacles have zero value. Value propagates outward from the goal, filling the navigable space.

MDP Value Iteration

Bright cells = high value (near goal). Dark cells = low value or obstacles. Click "Iterate" to run one sweep. The value function converges to a gradient pointing toward the goal.

Iteration 0. Click Iterate.

What to notice: The value function starts at zero everywhere and grows outward from the goal like a wave. After convergence, following the brightness gradient from any cell leads to the goal. The path naturally avoids obstacles because obstacle-adjacent cells have lower value.

Check: After value iteration converges, how does the robot find the optimal path?

At each state, move to the neighboring state with the highest value (follow the gradient) Use A* search on the value function Move in a random direction

Chapter 9: Summary

Concept	Key Idea
MDP	Framework for sequential decision-making with stochastic transitions
Policy	Mapping from states to actions; optimal policy maximizes expected cumulative reward
Value function	Expected future reward from each state under optimal policy
Bellman equation	Recursive relationship defining the optimal value function
Value iteration	Iterative algorithm that converges to optimal value function

What comes next: MDPs assume the state is fully observable. But robots only have beliefs about their state (from localization). Chapter 16 introduces POMDPs — planning in belief space, where the robot must reason about its own uncertainty and choose actions that gather information as well as achieve goals.

"The value of a state is not where you are,
but where you can go from here."

Check: What key assumption does the MDP framework make about state observability?

The state is fully observable — the robot knows its exact state at all times The state is partially observable No assumption about observability