Ch 9: Basic Decision Theory — LaValle, Planning Algorithms

Chapter 0: Uncertainty in Outcomes

You’re a surgeon choosing between two procedures. Procedure A succeeds 80% of the time with a 5% chance of a minor complication. Procedure B succeeds 95% of the time but carries a 2% chance of a severe complication. Which do you choose?

This isn’t a planning problem in the geometric sense — there’s no configuration space, no obstacles, no path. Yet it is exactly the kind of problem planners face the moment the environment stops being fully predictable. Chapter 9 of LaValle asks: how do you make a rational decision when outcomes depend on chance?

The standard answer from economics and statistics is expected utility theory: assign a number (utility or cost) to each possible outcome, multiply by its probability, sum up. Pick the action that maximizes expected utility (or minimizes expected cost). Simple. But as we’ll see, “simple” hides a lot of controversy.

Three fundamentally different scenarios: Sometimes the randomness comes from nature (wind, sensor noise) — a game against nature where nature has no agenda. Sometimes it comes from an adversary who actively tries to make you fail — a zero-sum game. And sometimes it comes from other agents who have their own goals that partially overlap with yours — a nonzero-sum game. Each demands a different decision criterion.

The decision matrix

All three scenarios share a common representation: a cost matrix (or payoff matrix) where rows are your choices (actions) and columns are the states of the world (what nature/adversary/other player does). The cell at row i, column j is the cost you pay when you choose action i and the world is in state j.

The Decision Landscape

Three decision scenarios. Click each to see how the “opponent” behaves and which criterion applies.

A weather forecasting system must choose between two prediction models. The outcome depends on which type of storm actually develops, which is random. What kind of decision problem is this?

A game against nature — the storm doesn’t try to fool the forecaster; it just is what it is A zero-sum game — storms actively try to make forecasts wrong A nonzero-sum game — storms and forecasters have partially aligned goals

Chapter 1: Probability Review

Before we can talk about expected anything, we need a quick calibration on what probability means in this context. LaValle §9.1 sets up the mathematical scaffolding that everything else hangs on.

Sample spaces and probability distributions

A sample space Θ is the set of all possible states the world could be in. Each element θ ∈ Θ is called a nature state (or “state of nature”). A probability distribution P over Θ assigns a number P(θ) to each state, with two constraints: P(θ) ≥ 0 for all θ, and the probabilities sum to 1.

∑_{θ ∈ Θ} P(θ) = 1

If Θ is continuous (say, the exact wind speed) we use a probability density function p(θ) and integrate instead of sum. For planning, we usually discretize: finitely many possible states, each with a known probability.

Expected value

The expected value of a function f over the distribution P is a weighted average: each possible outcome f(θ) is weighted by how likely that outcome is.

E[f] = ∑_θ P(θ) · f(θ)

Think of it as a long-run average. If you repeated the random experiment thousands of times, the average value of f would converge to E[f]. This is the Law of Large Numbers — the bedrock justification for expected-value reasoning.

Why expected value? It’s the unique combination rule that is (a) linear in probabilities, (b) consistent across repeated decisions, and (c) derivable from a small set of reasonable axioms (the von Neumann–Morgenstern axioms). These axioms feel so natural that their violation — which real humans exhibit — is called “irrational.” Chapter 8 will scrutinize whether that label is fair.

Conditional probability and Bayes

When you have partial information — you observe something about the world — you update your beliefs using Bayes’ rule.

P(θ | z) = P(z | θ) · P(θ) / P(z)

Here z is an observation, P(θ) is your prior belief, P(z | θ) is how likely you were to see z if θ were true (the likelihood), and P(θ | z) is your updated posterior belief. In planning under uncertainty, the agent typically maintains a posterior over Θ and updates it as new sensor data arrives.

Probability Mass and Expected Value

Drag the sliders to set probabilities for three outcomes. Watch how E[cost] changes.

Cost A = 1, P(A)0.30

Cost B = 5, P(B)0.50

A robot can end up in one of three states: safe (probability 0.6, cost 0), slowed (probability 0.3, cost 4), crashed (probability 0.1, cost 20). What is the expected cost?

8.0 — average of the three costs 3.2 — weighted sum: 0.6×0 + 0.3×4 + 0.1×20 2.0 — only count the most likely outcome

Chapter 2: Game Against Nature

A game against nature is a decision problem where the randomness in outcomes has no agenda. Nature doesn’t try to foil you. It simply picks states according to a fixed probability distribution, indifferent to your choices.

This is the setting of classical statistics and most robot planning under uncertainty. The robot picks an action. Nature picks a state (perhaps wind speed, terrain friction, sensor noise level). The combination determines a cost. Repeat.

Setting up the problem

Formally (LaValle §9.1.2): you have a set of actions A = {a₁, …, a_m} and a set of nature states Θ = {θ₁, …, θ_n}. The cost function L: A × Θ → ℜ specifies L(a, θ) = cost when you take action a and nature is in state θ. Nature has a known distribution P over Θ.

The expected cost criterion

The expected cost criterion says: choose the action that minimizes expected cost.

a* = argmin_{a ∈ A} ∑_{θ ∈ Θ} P(θ) · L(a, θ)

This is the rational choice if you’ll be making this decision many times and P is correct. In a single shot, expected cost is still the best you can do given no other information — any other rule is dominated by expected-cost minimization under the right interpretation.

When does this apply? The expected cost criterion requires that (1) probabilities are known, (2) you care about long-run average performance, and (3) you’re not in a regime where rare catastrophes dominate all other considerations. When these conditions fail, other criteria (minimax, CVaR) are more appropriate.

A worked example

You’re a delivery drone choosing between two flight corridors. Corridor A has clear air 70% of the time (cost 1) and turbulence 30% of the time (cost 8). Corridor B is always calm (cost 3).

E[cost | A] = 0.7 × 1 + 0.3 × 8 = 0.7 + 2.4 = 3.1

E[cost | B] = 1.0 × 3 = 3.0

Expected cost slightly favors Corridor B (3.0 vs 3.1). But if turbulence cost were 7 instead of 8, A would win with E = 2.8. The decision is sensitive to the exact cost values — which is why getting the cost function right matters enormously in practice.

Expected Cost Calculator

Adjust the turbulence cost and probability. See when Corridor A beats Corridor B.

Turbulence cost8

P(turbulence)0.30

In a game against nature, probabilities are unknown. You can estimate them from data, but you’re not sure they’re right. The expected cost criterion is…

Completely invalid — you can’t use expected cost without perfect probabilities Guaranteed optimal — estimated probabilities are as good as exact ones A reasonable heuristic — with good estimates it performs well, but robustness criteria may be safer under severe uncertainty

Chapter 3: Minimax Criterion

What if you don’t know the probabilities? Or what if you’re designing a safety-critical system where rare worst-case events could be catastrophic, and you can’t afford to get unlucky?

The minimax criterion (LaValle §9.1.3) abandons probability entirely. Instead of minimizing expected cost, it minimizes the worst possible cost. You assume nature will conspire against you and pick the state that hurts you most. You counter by choosing the action that makes that worst case as good as possible.

a* = argmin_{a ∈ A} max_{θ ∈ Θ} L(a, θ)

Read this as: “For each action, find the worst thing nature can do. Then pick the action whose worst case is least terrible.”

Minimax vs expected cost — a concrete comparison

Consider two earthquake shelter designs. Each has a cost (structural resources wasted) in calm weather and a cost (lives endangered) in a major quake. Major quakes happen 5% of the time.

	Calm (95%)	Quake (5%)
Design A	2	50
Design B	10	12

Expected cost: A = 0.95×2 + 0.05×50 = 4.4, B = 0.95×10 + 0.05×12 = 10.1

Minimax: worst(A) = 50, worst(B) = 12 → choose B

Expected cost says Design A. Minimax says Design B. Which is right? If you’re building a school, the minimax argument is compelling — you can’t average over the children who get killed in the 5% scenario.

Minimax is conservative by design. It assumes an adversarial world. In games against nature (random, non-adversarial), minimax is often too pessimistic — you spend a lot to hedge against outcomes that are very unlikely. It earns its keep when (a) the worst case is catastrophic and irreversible, or (b) you genuinely don’t know the probabilities and can’t estimate them reliably.

Minimax vs Expected Cost

Drag the quake probability. See when each criterion flips its choice.

P(quake)0.05

A robot arm must choose between two grasps. Grasp A costs 1 in normal conditions but drops the object (cost 100) if humidity is high. Grasp B always costs 20, regardless of humidity. Humidity is high 3% of the time. Minimax prefers…

Grasp A — expected cost is lower (0.97×1 + 0.03×100 = 3.97 vs 20) Grasp B — the worst case (20) is better than Grasp A’s worst case (100) Neither — minimax can’t handle more than two options

Chapter 4: Zero-Sum Games

Now the adversary has a mind. In a zero-sum game (LaValle §9.2), two players compete: whatever Player 1 gains, Player 2 loses by exactly the same amount. Their interests are perfectly opposed. Think rock-paper-scissors, chess, or a robot being tested against an adversarial environment.

Both players choose actions simultaneously, neither knowing what the other will pick. The outcome is determined by the combination. Player 1 wants to minimize costs; Player 2 wants to maximize them (or equivalently, Player 2 minimizes their own payoff which equals −Player 1’s payoff).

Nash Equilibrium

A Nash equilibrium is a pair of strategies (one per player) such that neither player can improve their outcome by unilaterally changing their strategy. It’s the stable resting point where both players are playing optimally given what the other is doing.

In a zero-sum game, minimax and Nash equilibrium coincide: the optimal strategy for Player 1 is to minimax (minimize the maximum cost the adversary can impose), and Player 2 maximin-izes (maximizes the minimum gain). The value where these meet is the game value.

The minimax theorem (von Neumann, 1928): For every finite two-player zero-sum game, there exists a mixed-strategy Nash equilibrium, and both players achieve the same game value. Player 1 cannot do better than this value; Player 2 cannot prevent Player 1 from achieving it. The value is unique even though the equilibrium strategies may not be.

Finding Nash Equilibrium — pure strategies first

A pure strategy Nash equilibrium exists when there is a cell in the payoff matrix that is simultaneously a row minimum (best for Player 1 given that column) and a column maximum (best for Player 2 given that row). Such a cell is called a saddle point.

Interactive Payoff Matrix & Nash Equilibrium Finder

Click any cell to edit the cost. The finder highlights the Nash equilibrium (if pure) or shows that mixed strategies are needed.

In the payoff matrix below, Player 1 picks rows (minimizing), Player 2 picks columns (maximizing). Costs: Row A = [3, 5], Row B = [7, 2]. Is there a pure Nash equilibrium?

Yes: (A, col 1) with value 3 — it’s the row minimum of row A (min(3,5)=3) and column maximum of col 1 (max(3,7)=7 ≠ 3)… wait, let me reconsider. No pure Nash equilibrium exists — neither (A,1)=3 nor (B,2)=2 satisfies both conditions simultaneously, so mixed strategies are required Yes: (B, col 2) with value 2 — it is the global minimum

Chapter 5: Mixed Strategies

Most zero-sum games don’t have a saddle point. Rock-paper-scissors has no pure Nash equilibrium: whatever you play deterministically, the adversary can exploit it. The solution is to randomize.

A mixed strategy is a probability distribution over your pure actions. Instead of committing to one move, you commit to a randomization rule — “play Rock 40%, Scissors 30%, Paper 30%.” The key insight: if your mixed strategy is at Nash equilibrium, your opponent cannot gain by knowing your strategy (because you’re randomizing). The optimal mixed strategy is the one that makes your opponent indifferent between all their actions.

Computing mixed Nash equilibrium

For a 2×2 game with cost matrix L (Player 1 minimizes, Player 2 maximizes), let Player 1 mix with probability p on row 1 and (1−p) on row 2. Player 1 sets p to make Player 2 indifferent — i.e., so both columns give Player 2 the same expected payoff.

p · L(1,1) + (1−p) · L(2,1) = p · L(1,2) + (1−p) · L(2,2)

p* = [L(2,2) − L(2,1)] / [L(1,1) − L(1,2) − L(2,1) + L(2,2)]

Symmetrically, Player 2 sets their mix to make Player 1 indifferent between rows. The game value V is then the expected cost under these mixed strategies.

Why randomize against nature? You don’t — against a non-strategic opponent, pure deterministic strategies are always at least as good as mixed ones. Mixed strategies are only valuable when the opponent is strategic and can exploit your pattern. Against truly random nature, pick the best deterministic action.

The linear programming connection

For games larger than 2×2, finding the mixed Nash equilibrium reduces to a linear program. Player 1 solves: maximize V subject to, for each column j, ∑_i p_i L(i,j) ≤ V, and ∑_i p_i = 1, p_i ≥ 0. This LP is always feasible and has a unique optimal value (the game value), though the optimal strategies may not be unique.

Mixed Strategy Equilibrium (2×2)

Set the four payoffs. The diagram shows the optimal mixed strategy and game value.

L(1,1)0

L(1,2)5

L(2,1)3

L(2,2)-2

In rock-paper-scissors, what is the Nash equilibrium mixed strategy for Player 1?

Play Rock always — it beats Scissors and Scissors is least common Play Rock, Paper, Scissors each with probability 1/3 — this makes Player 2 indifferent between all three choices Play whichever move the opponent just used — matching their last move

Chapter 6: Nonzero-Sum Games

Not all conflicts are total. Two drivers approaching an intersection both want to cross quickly — but they both also want to avoid a collision. Their interests overlap but don’t perfectly align. This is a nonzero-sum game (LaValle §9.3): the losses of one player do not equal the gains of the other.

In a nonzero-sum game, each player has their own payoff matrix. Cooperation can sometimes leave both players better off than pure competition. But the question of whether to cooperate or defect — and whether to trust that the other player will cooperate — is where things get interesting.

Nash equilibrium in nonzero-sum games

Nash equilibrium still applies: a pair of strategies is a Nash equilibrium if neither player can improve their own payoff by unilaterally deviating. But now:

There can be multiple Nash equilibria, with different payoffs for both players.
Nash equilibria can be inefficient: both players might do better if they could coordinate to a different outcome.
The minimax theorem no longer applies in general — the game value concept breaks down.

The efficiency gap: In zero-sum games, Nash equilibrium is always efficient by definition (there’s no room for joint improvement). In nonzero-sum games, Nash equilibria are often Pareto suboptimal — there exists another outcome that makes both players better off, but rational self-interest leads them away from it. This is the central tragedy of strategic interaction.

Coordination games

Some nonzero-sum games have multiple equilibria that are Pareto-ranked. Two robots must choose which side of a corridor to walk on. Both prefer any consistent choice (left–left or right–right) over inconsistency (left–right collision). This is a coordination game: the problem is selecting which equilibrium to play, not whether to cooperate.

	Robot B: Left	Robot B: Right
Robot A: Left	3, 3 ← Nash	0, 0
Robot A: Right	0, 0	3, 3 ← Nash

Both (Left, Left) and (Right, Right) are Nash equilibria. Traffic conventions solve the coordination problem socially — a form of “pre-game communication” that game theory alone can’t provide.

Two firms both advertise their product. If neither advertises, each earns 10. If both advertise, each earns 5 (ad costs cancel out). If only one advertises, it earns 15 and the other earns 2. How many pure Nash equilibria does this game have?

Zero — neither “both advertise” nor “neither advertise” is stable One — (Both advertise, Both advertise): if one firm deviates to not advertising, it drops from 5 to 2, so neither wants to deviate Two — (Both advertise) and (Neither advertise) are both stable

Chapter 7: Prisoner’s Dilemma

The most famous nonzero-sum game in history. Two suspects are arrested. Each can either cooperate with the other (stay silent) or defect (betray the other). The police keep them separate — no communication is possible. The payoffs are set up so that defection is always better for you regardless of what the other person does, yet if both defect, both are worse off than if both had cooperated.

	B: Cooperate	B: Defect
A: Cooperate	(−1, −1)	(−10, 0)
A: Defect	(0, −10)	(−5, −5) ← Nash

Numbers are years in prison (negative = bad). If A cooperates and B defects, A gets 10 years and B goes free. If both defect, each gets 5 years. If both cooperate, each gets only 1 year.

Why defect is dominant

A dominant strategy is one that is strictly better regardless of what the other player does. For each player: if the other cooperates, defecting gives 0 years (vs 1 year for cooperating) — defect wins. If the other defects, defecting gives 5 years (vs 10 years for cooperating) — defect wins again.

So rational players defect. Both defect. Both get 5 years. But the Pareto-optimal outcome — both cooperate, 1 year each — is just sitting there. Individual rationality leads to collective irrationality. This is the dilemma.

Why does this matter for planning? Multi-robot systems face prisoner’s dilemmas constantly. Two autonomous vehicles approaching an intersection: each individually benefits from assuming the other will yield, but if both assume that, they collide. Resource sharing among robots: each benefits from using shared resources freely, but collective overuse depletes them. The dilemma appears wherever individually rational behavior produces collectively bad outcomes.

Escaping the dilemma

Three main escape routes have been studied:

Repeated interaction

If the game repeats indefinitely, “tit-for-tat” (cooperate first, then copy opponent) can sustain cooperation as a Nash equilibrium — defecting now invites retaliation later.

↓

Communication + commitment

If players can credibly commit to cooperation before playing (e.g., a binding contract), the dilemma dissolves. The key word is “credible” — promises without enforcement don’t help.

↓

Mechanism design

A third party (government, protocol designer) changes the payoffs — adding punishment for defection or reward for cooperation — so that cooperation becomes the dominant strategy.

Iterated Prisoner’s Dilemma Simulator

Run 20 rounds. Player A uses tit-for-tat. Choose Player B’s strategy and see cumulative years in prison.

In a single-shot prisoner’s dilemma (played exactly once), Player A knows Player B is perfectly rational and self-interested. What should Player A do?

Defect — defection is a dominant strategy regardless of B’s action, and since both players are rational, both will reason this way and both will defect Cooperate — since B is rational, B will cooperate, and cooperating with a cooperator yields the best joint outcome Randomize 50/50 — it’s the mixed Nash equilibrium of this game

Chapter 8: Decision Theory Under Scrutiny

Expected utility theory is elegant, axiomatically grounded, and computationally tractable. It is also empirically wrong — human beings systematically violate its predictions. LaValle §9.4 surveys the most important challenges.

The Allais Paradox

Maurice Allais (1953) posed two choice problems that expose a deep inconsistency in expected utility maximization:

Problem 1: Choose between:

A: $1M with certainty
B: $5M with prob 0.1, $1M with prob 0.89, $0 with prob 0.01

Problem 2: Choose between:

C: $1M with prob 0.11, $0 with prob 0.89
D: $5M with prob 0.10, $0 with prob 0.90

Most people choose A in Problem 1 (certainty is appealing) and D in Problem 2 (the extra $4M is worth the 1% extra risk). But any utility function U that rationalizes A > B implies C > D, and any that rationalizes D > C implies B > A. You can’t have both A and D without violating expected utility theory. Yet nearly everyone picks A and D. This is the Allais paradox.

What the paradox reveals: People weight certainty more than expected utility theory allows. A 100% chance feels qualitatively different from a 99% chance, even though the 1% difference affects expected value identically whether you’re going from 100%→99% or 11%→10%. Expected utility is linear in probability; human preferences are not.

The Ellsberg Paradox

Daniel Ellsberg (1961) showed that people are averse not just to risk (known probabilities) but to ambiguity (unknown probabilities). An urn contains 30 red balls and 60 balls that are either black or yellow in unknown proportion. Most people prefer to bet on red over black (Problem 1) yet prefer to bet on “black or yellow” over “red or yellow” (Problem 2). These preferences are inconsistent with any subjective probability assignment — showing that ambiguity aversion is not mere risk aversion under mis-calibrated probabilities.

Prospect theory (Kahneman & Tversky)

Kahneman and Tversky’s prospect theory (1979) replaces expected utility with a descriptively accurate model that captures observed behavior:

Reference dependence: People evaluate outcomes as gains or losses relative to a reference point, not as absolute wealth levels.
Loss aversion: Losses hurt roughly twice as much as equivalent gains feel good. The utility function is steeper for losses than for gains.
Probability weighting: Small probabilities are overweighted (lottery appeal), large probabilities are underweighted (insurance appeal).
Diminishing sensitivity: The marginal impact of gains and losses decreases as magnitude increases (“$1000 less painful to lose when you’re already losing $10,000”).

For planning systems: If you’re building a robot that serves humans, expected utility maximization may not produce the choices users will accept or prefer. Understanding where human decision-making deviates from the rational model is essential for designing trustworthy autonomous systems. If you’re building a planning algorithm, expected utility remains the standard normative benchmark — but robustness to model uncertainty (Ellsberg) is worth engineering explicitly.

The Allais paradox demonstrates that expected utility theory fails as a…

Descriptive model of human choice — people systematically violate its predictions, preferring certainty more than probability linearity allows Normative model of rational choice — the axioms leading to expected utility are self-contradictory Computational model — expected utility is too hard to compute in practice

Chapter 9: Connections & What’s Next

Decision theory is the foundation for everything that comes next in planning under uncertainty. Chapter 9 of LaValle is explicitly a gateway, not a destination.

How this chapter fits in LaValle’s arc

Ch 1–6: Deterministic planning

Full state knowledge, no randomness. Find a path.

↓

Ch 7–8: Feedback planning

Partially unknown state, react as you go. Sensor-based strategies.

↓

Ch 9: Decision theory (this lesson)

Formal framework for choices under uncertainty. Expected cost, minimax, Nash.

↓

Ch 10: Sequential decisions (MDPs)

Now make many decisions in sequence, with state uncertainty. Bellman equations.

↓

Ch 11: Sensors and information spaces

What does the robot know? How does it represent uncertain state?

↓

Ch 12: Planning under sensing uncertainty (POMDPs)

The full problem: uncertain state, imperfect sensors, optimal policy.

Decision criteria: comparison

Criterion	When to use	Key assumption	Main weakness
Expected cost	Known probabilities, repeated decisions	Probabilities are correct	Ignores tail risk
Minimax	Unknown probabilities, safety-critical	Adversarial nature	Too conservative
Nash (zero-sum)	Strategic adversary	Both players rational	May require mixing
Nash (nonzero-sum)	Strategic cooperation/conflict	Both players rational	Multiple equilibria, inefficiency
Prospect theory	Modeling human behavior	Descriptive accuracy	Hard to compute, non-axiomatic

Related lessons in this library

Ch 8: Feedback Planning — planning that reacts to sensor input, the bridge to uncertainty
Ch 10: Sequential Decisions — MDPs, Bellman equations, value iteration
Kochenderfer: Single-Agent Decision Making — a deeper treatment with computational focus

Three things to carry forward

1. Expected cost is the default. When you have reliable probabilities and a convex cost function, minimize expected cost. Everything else is a special case or a robustness hedge.

2. The adversary type determines the criterion. Nature → expected cost or minimax. Strategic adversary → Nash. Mixed motive → Nash for nonzero-sum, possibly mechanism design.

3. Rational ≠ human. Expected utility theory is normatively compelling but descriptively wrong. If your autonomous system must work with or for humans, build in space for this gap.

“The theory of probability is at bottom only common sense reduced to calculus.”
— Pierre-Simon Laplace, Théorie Analytique des Probabilités, 1812

You’re designing an autonomous vehicle that must choose between two lane-change strategies. You have extensive data on outcomes in 50,000 similar scenarios. Safety failures are rare but catastrophic. Which decision criterion is most appropriate?

Expected cost minimization, informed by the 50,000-scenario data — with large data you trust the probabilities, but also analyze the tail (CVaR or minimax on rare catastrophic cases) given irreversibility of safety failures Minimax only — always hedge against the worst case in safety-critical systems Nash equilibrium — other road users are strategic adversaries