When outcomes are uncertain, how do you choose? Expected cost, minimax, Nash equilibria, and the deep critiques that keep philosophers busy.
You’re a surgeon choosing between two procedures. Procedure A succeeds 80% of the time with a 5% chance of a minor complication. Procedure B succeeds 95% of the time but carries a 2% chance of a severe complication. Which do you choose?
This isn’t a planning problem in the geometric sense — there’s no configuration space, no obstacles, no path. Yet it is exactly the kind of problem planners face the moment the environment stops being fully predictable. Chapter 9 of LaValle asks: how do you make a rational decision when outcomes depend on chance?
The standard answer from economics and statistics is expected utility theory: assign a number (utility or cost) to each possible outcome, multiply by its probability, sum up. Pick the action that maximizes expected utility (or minimizes expected cost). Simple. But as we’ll see, “simple” hides a lot of controversy.
All three scenarios share a common representation: a cost matrix (or payoff matrix) where rows are your choices (actions) and columns are the states of the world (what nature/adversary/other player does). The cell at row i, column j is the cost you pay when you choose action i and the world is in state j.
Three decision scenarios. Click each to see how the “opponent” behaves and which criterion applies.
Before we can talk about expected anything, we need a quick calibration on what probability means in this context. LaValle §9.1 sets up the mathematical scaffolding that everything else hangs on.
A sample space Θ is the set of all possible states the world could be in. Each element θ ∈ Θ is called a nature state (or “state of nature”). A probability distribution P over Θ assigns a number P(θ) to each state, with two constraints: P(θ) ≥ 0 for all θ, and the probabilities sum to 1.
If Θ is continuous (say, the exact wind speed) we use a probability density function p(θ) and integrate instead of sum. For planning, we usually discretize: finitely many possible states, each with a known probability.
The expected value of a function f over the distribution P is a weighted average: each possible outcome f(θ) is weighted by how likely that outcome is.
Think of it as a long-run average. If you repeated the random experiment thousands of times, the average value of f would converge to E[f]. This is the Law of Large Numbers — the bedrock justification for expected-value reasoning.
When you have partial information — you observe something about the world — you update your beliefs using Bayes’ rule.
Here z is an observation, P(θ) is your prior belief, P(z | θ) is how likely you were to see z if θ were true (the likelihood), and P(θ | z) is your updated posterior belief. In planning under uncertainty, the agent typically maintains a posterior over Θ and updates it as new sensor data arrives.
Drag the sliders to set probabilities for three outcomes. Watch how E[cost] changes.
A game against nature is a decision problem where the randomness in outcomes has no agenda. Nature doesn’t try to foil you. It simply picks states according to a fixed probability distribution, indifferent to your choices.
This is the setting of classical statistics and most robot planning under uncertainty. The robot picks an action. Nature picks a state (perhaps wind speed, terrain friction, sensor noise level). The combination determines a cost. Repeat.
Formally (LaValle §9.1.2): you have a set of actions A = {a1, …, am} and a set of nature states Θ = {θ1, …, θn}. The cost function L: A × Θ → ℜ specifies L(a, θ) = cost when you take action a and nature is in state θ. Nature has a known distribution P over Θ.
The expected cost criterion says: choose the action that minimizes expected cost.
This is the rational choice if you’ll be making this decision many times and P is correct. In a single shot, expected cost is still the best you can do given no other information — any other rule is dominated by expected-cost minimization under the right interpretation.
You’re a delivery drone choosing between two flight corridors. Corridor A has clear air 70% of the time (cost 1) and turbulence 30% of the time (cost 8). Corridor B is always calm (cost 3).
Expected cost slightly favors Corridor B (3.0 vs 3.1). But if turbulence cost were 7 instead of 8, A would win with E = 2.8. The decision is sensitive to the exact cost values — which is why getting the cost function right matters enormously in practice.
Adjust the turbulence cost and probability. See when Corridor A beats Corridor B.
What if you don’t know the probabilities? Or what if you’re designing a safety-critical system where rare worst-case events could be catastrophic, and you can’t afford to get unlucky?
The minimax criterion (LaValle §9.1.3) abandons probability entirely. Instead of minimizing expected cost, it minimizes the worst possible cost. You assume nature will conspire against you and pick the state that hurts you most. You counter by choosing the action that makes that worst case as good as possible.
Read this as: “For each action, find the worst thing nature can do. Then pick the action whose worst case is least terrible.”
Consider two earthquake shelter designs. Each has a cost (structural resources wasted) in calm weather and a cost (lives endangered) in a major quake. Major quakes happen 5% of the time.
| Calm (95%) | Quake (5%) | |
|---|---|---|
| Design A | 2 | 50 |
| Design B | 10 | 12 |
Expected cost: A = 0.95×2 + 0.05×50 = 4.4, B = 0.95×10 + 0.05×12 = 10.1
Minimax: worst(A) = 50, worst(B) = 12 → choose B
Expected cost says Design A. Minimax says Design B. Which is right? If you’re building a school, the minimax argument is compelling — you can’t average over the children who get killed in the 5% scenario.
Drag the quake probability. See when each criterion flips its choice.
Now the adversary has a mind. In a zero-sum game (LaValle §9.2), two players compete: whatever Player 1 gains, Player 2 loses by exactly the same amount. Their interests are perfectly opposed. Think rock-paper-scissors, chess, or a robot being tested against an adversarial environment.
Both players choose actions simultaneously, neither knowing what the other will pick. The outcome is determined by the combination. Player 1 wants to minimize costs; Player 2 wants to maximize them (or equivalently, Player 2 minimizes their own payoff which equals −Player 1’s payoff).
A Nash equilibrium is a pair of strategies (one per player) such that neither player can improve their outcome by unilaterally changing their strategy. It’s the stable resting point where both players are playing optimally given what the other is doing.
In a zero-sum game, minimax and Nash equilibrium coincide: the optimal strategy for Player 1 is to minimax (minimize the maximum cost the adversary can impose), and Player 2 maximin-izes (maximizes the minimum gain). The value where these meet is the game value.
A pure strategy Nash equilibrium exists when there is a cell in the payoff matrix that is simultaneously a row minimum (best for Player 1 given that column) and a column maximum (best for Player 2 given that row). Such a cell is called a saddle point.
Click any cell to edit the cost. The finder highlights the Nash equilibrium (if pure) or shows that mixed strategies are needed.
Most zero-sum games don’t have a saddle point. Rock-paper-scissors has no pure Nash equilibrium: whatever you play deterministically, the adversary can exploit it. The solution is to randomize.
A mixed strategy is a probability distribution over your pure actions. Instead of committing to one move, you commit to a randomization rule — “play Rock 40%, Scissors 30%, Paper 30%.” The key insight: if your mixed strategy is at Nash equilibrium, your opponent cannot gain by knowing your strategy (because you’re randomizing). The optimal mixed strategy is the one that makes your opponent indifferent between all their actions.
For a 2×2 game with cost matrix L (Player 1 minimizes, Player 2 maximizes), let Player 1 mix with probability p on row 1 and (1−p) on row 2. Player 1 sets p to make Player 2 indifferent — i.e., so both columns give Player 2 the same expected payoff.
Symmetrically, Player 2 sets their mix to make Player 1 indifferent between rows. The game value V is then the expected cost under these mixed strategies.
For games larger than 2×2, finding the mixed Nash equilibrium reduces to a linear program. Player 1 solves: maximize V subject to, for each column j, ∑i pi L(i,j) ≤ V, and ∑i pi = 1, pi ≥ 0. This LP is always feasible and has a unique optimal value (the game value), though the optimal strategies may not be unique.
Set the four payoffs. The diagram shows the optimal mixed strategy and game value.
Not all conflicts are total. Two drivers approaching an intersection both want to cross quickly — but they both also want to avoid a collision. Their interests overlap but don’t perfectly align. This is a nonzero-sum game (LaValle §9.3): the losses of one player do not equal the gains of the other.
In a nonzero-sum game, each player has their own payoff matrix. Cooperation can sometimes leave both players better off than pure competition. But the question of whether to cooperate or defect — and whether to trust that the other player will cooperate — is where things get interesting.
Nash equilibrium still applies: a pair of strategies is a Nash equilibrium if neither player can improve their own payoff by unilaterally deviating. But now:
Some nonzero-sum games have multiple equilibria that are Pareto-ranked. Two robots must choose which side of a corridor to walk on. Both prefer any consistent choice (left–left or right–right) over inconsistency (left–right collision). This is a coordination game: the problem is selecting which equilibrium to play, not whether to cooperate.
| Robot B: Left | Robot B: Right | |
|---|---|---|
| Robot A: Left | 3, 3 ← Nash | 0, 0 |
| Robot A: Right | 0, 0 | 3, 3 ← Nash |
Both (Left, Left) and (Right, Right) are Nash equilibria. Traffic conventions solve the coordination problem socially — a form of “pre-game communication” that game theory alone can’t provide.
The most famous nonzero-sum game in history. Two suspects are arrested. Each can either cooperate with the other (stay silent) or defect (betray the other). The police keep them separate — no communication is possible. The payoffs are set up so that defection is always better for you regardless of what the other person does, yet if both defect, both are worse off than if both had cooperated.
| B: Cooperate | B: Defect | |
|---|---|---|
| A: Cooperate | (−1, −1) | (−10, 0) |
| A: Defect | (0, −10) | (−5, −5) ← Nash |
Numbers are years in prison (negative = bad). If A cooperates and B defects, A gets 10 years and B goes free. If both defect, each gets 5 years. If both cooperate, each gets only 1 year.
A dominant strategy is one that is strictly better regardless of what the other player does. For each player: if the other cooperates, defecting gives 0 years (vs 1 year for cooperating) — defect wins. If the other defects, defecting gives 5 years (vs 10 years for cooperating) — defect wins again.
So rational players defect. Both defect. Both get 5 years. But the Pareto-optimal outcome — both cooperate, 1 year each — is just sitting there. Individual rationality leads to collective irrationality. This is the dilemma.
Three main escape routes have been studied:
Run 20 rounds. Player A uses tit-for-tat. Choose Player B’s strategy and see cumulative years in prison.
Expected utility theory is elegant, axiomatically grounded, and computationally tractable. It is also empirically wrong — human beings systematically violate its predictions. LaValle §9.4 surveys the most important challenges.
Maurice Allais (1953) posed two choice problems that expose a deep inconsistency in expected utility maximization:
Problem 1: Choose between:
Problem 2: Choose between:
Most people choose A in Problem 1 (certainty is appealing) and D in Problem 2 (the extra $4M is worth the 1% extra risk). But any utility function U that rationalizes A > B implies C > D, and any that rationalizes D > C implies B > A. You can’t have both A and D without violating expected utility theory. Yet nearly everyone picks A and D. This is the Allais paradox.
Daniel Ellsberg (1961) showed that people are averse not just to risk (known probabilities) but to ambiguity (unknown probabilities). An urn contains 30 red balls and 60 balls that are either black or yellow in unknown proportion. Most people prefer to bet on red over black (Problem 1) yet prefer to bet on “black or yellow” over “red or yellow” (Problem 2). These preferences are inconsistent with any subjective probability assignment — showing that ambiguity aversion is not mere risk aversion under mis-calibrated probabilities.
Kahneman and Tversky’s prospect theory (1979) replaces expected utility with a descriptively accurate model that captures observed behavior:
Decision theory is the foundation for everything that comes next in planning under uncertainty. Chapter 9 of LaValle is explicitly a gateway, not a destination.
| Criterion | When to use | Key assumption | Main weakness |
|---|---|---|---|
| Expected cost | Known probabilities, repeated decisions | Probabilities are correct | Ignores tail risk |
| Minimax | Unknown probabilities, safety-critical | Adversarial nature | Too conservative |
| Nash (zero-sum) | Strategic adversary | Both players rational | May require mixing |
| Nash (nonzero-sum) | Strategic cooperation/conflict | Both players rational | Multiple equilibria, inefficiency |
| Prospect theory | Modeling human behavior | Descriptive accuracy | Hard to compute, non-axiomatic |
“The theory of probability is at bottom only common sense reduced to calculus.”
— Pierre-Simon Laplace, Théorie Analytique des Probabilités, 1812