Planning under uncertainty: value iteration, optimal policies, payoff functions, and robot path planning.
So far, this book has focused on perception: estimating the robot's state from noisy data. But perception alone doesn't make a robot useful. The robot must also act: choose actions that achieve goals despite uncertain outcomes.
Why can't we just plan deterministically and then execute? Because robot actions have uncertain outcomes. A "move forward 1 meter" command might actually move 0.95m or 1.07m. A planned path around an obstacle might fail if the robot drifts slightly.
This chapter introduces Markov Decision Processes (MDPs), the framework for optimal sequential decision-making under uncertainty when the state is fully observable. Chapter 16 extends to the partially observable case.
An MDP is defined by four components:
| Component | Symbol | Description |
|---|---|---|
| States | S | The set of all possible states (e.g., robot positions) |
| Actions | A | The set of actions available at each state |
| Transition model | p(s' | a, s) | Probability of reaching s' by doing a in s |
| Payoff / Reward | c(s) | Immediate reward for being in state s |
The Markov property: the transition probability depends only on the current state and action, not on history. This is the same assumption used throughout this book for Bayes filters.
A policy π(s) maps each state to an action. Given a policy, the robot knows what to do in any situation. A deterministic policy says "in state s, do action a." A stochastic policy says "in state s, do action a with probability π(s,a)."
The optimal policy π* is the one that maximizes the expected cumulative payoff over time. For a finite horizon T:
where γ ∈ [0,1] is the discount factor. With γ < 1, future rewards are worth less than immediate ones, encouraging the robot to achieve goals sooner.
The value function C(s) is the expected cumulative payoff starting from state s and following the optimal policy thereafter:
The value function captures the "goodness" of each state. States near the goal have high value. States near obstacles have low value. States far from the goal have moderate value (it will take time to reach the goal, so future rewards are discounted).
For robot path planning, the value function forms a gradient field over the map. Following the gradient (ascending toward higher values) leads to the goal along the optimal path.
The optimal value function satisfies the Bellman equation, a recursive relationship:
In words: the value of state s is the best (over all actions) expected sum of (immediate payoff + discounted future value).
For finite state spaces, the integral becomes a sum. For continuous spaces, discretization or function approximation is needed.
Value iteration solves the Bellman equation by repeated application:
1. Initialize Ĉ(s) = 0 for all states.
2. Repeat until convergence:
Each iteration improves the value estimate by looking one step further into the future. After k iterations, Ĉ reflects the optimal value for a k-step horizon.
The order in which states are updated doesn't matter for convergence, as long as each state is updated infinitely often. In practice, sweeping through all states once per iteration works well.
Value iteration is guaranteed to converge when γ < 1. The key properties:
| Property | Description |
|---|---|
| Contraction | Each iteration reduces the maximum error by factor γ |
| Monotonicity | If Ĉk ≤ C*, then Ĉk+1 ≤ C* (values approach optimum from below) |
| Fixed point | The only fixed point of the Bellman update is C* |
For γ = 1 (undiscounted), convergence is not guaranteed in general, but works in practice for goal-reaching problems where all states can reach the goal in finite time.
MDPs naturally solve the robot path planning problem. The state space is the robot's configuration space (x, y for a circular robot). Actions are motion commands (move in 8 directions). The payoff is high at the goal and zero (or negative) elsewhere.
After running value iteration, the value function forms a gradient field: brighter = higher value = closer to goal. Following the gradient leads to the goal along the optimal path, automatically avoiding obstacles (which have low/zero value).
For circular robots on flat surfaces, planning in (x, y) space suffices. For non-circular robots or those with dynamics, the state must include orientation and velocity, making the problem 4-6 dimensional. Value iteration becomes expensive but still tractable for moderate resolutions.
Watch value iteration compute the optimal value function for a grid world. The goal cell has high reward. Obstacles have zero value. Value propagates outward from the goal, filling the navigable space.
Bright cells = high value (near goal). Dark cells = low value or obstacles. Click "Iterate" to run one sweep. The value function converges to a gradient pointing toward the goal.
| Concept | Key Idea |
|---|---|
| MDP | Framework for sequential decision-making with stochastic transitions |
| Policy | Mapping from states to actions; optimal policy maximizes expected cumulative reward |
| Value function | Expected future reward from each state under optimal policy |
| Bellman equation | Recursive relationship defining the optimal value function |
| Value iteration | Iterative algorithm that converges to optimal value function |