Simple Decisions

Chapter 1: The VNM Axioms

The von Neumann–Morgenstern (VNM) axioms are four constraints on preferences that, taken together, guarantee the existence of a utility function. They define what it means to be rational. The theorem was published in 1944 and remains the mathematical foundation of game theory, economics, and decision-making AI:

Axiom	Statement	Why It Makes Sense
Completeness	For any A, B: exactly one of A≻B, B≻A, or A∼B	You must be able to compare any two lotteries
Transitivity	A⪰B and B⪰C implies A⪰C	Circular preferences lead to money-pump exploitation
Continuity	If A⪰C⪰B, ∃p: [A:p; B:1−p] ∼ C	No outcome is infinitely better or worse than all others
Independence	A⪰B iff [A:p; C:1−p] ⪰ [B:p; C:1−p] for all C, p>0	Adding a common component shouldn't flip preferences

The continuity axiom says that for any outcome C between best (A) and worst (B), there exists some probability p such that you are indifferent between C for certain and the lottery [A:p; B:1−p]. This is exactly the indifference equation used for utility elicitation (Chapter 3). The axiom rules out outcomes that are "infinitely better" or "infinitely worse" than others.

The independence axiom is the most contested. It says: if you prefer A to B, then you should still prefer [A:p; C:1−p] over [B:p; C:1−p] for any third outcome C and any probability p. In other words, adding a common lottery component C (at the same probability) should not change your preference between A and B. This rules out the Allais paradox (Chapter 7).

The VNM theorem proves: if your preferences satisfy all four axioms, then a utility function U exists such that U(A) > U(B) if and only if A ≻ B. Moreover, U is unique up to a positive affine transformation — you can always rescale and shift without changing any preferences.

Uniqueness up to positive affine transformation means: two utility functions U and U' represent the same preferences if and only if U' = aU + b for some constants a > 0, b. This is exactly like temperature scales: Celsius and Fahrenheit measure the same physical quantity with different origins and scales. You can rescale utility to any convenient range (e.g., [0,1] by setting U(worst)=0, U(best)=1) without changing any decisions.

The proof is constructive and uses continuity directly. Fix U(best) = 1 and U(worst) = 0. For any outcome C, continuity guarantees there exists p_C such that C ~ [best:p_C; worst:1−p_C]. Define U(C) = p_C. Transitivity ensures this is consistent. Independence ensures that the expected utility of a compound lottery equals the sum of weighted utilities: U([L₁:q; L₂:1−q]) = q·U(L₁) + (1−q)·U(L₂). This is the linearity in probabilities property that makes EU maximization the unique correct decision rule under the axioms.

The utility representation theorem (VNM, 1944). Given a preference relation ⪰ satisfying completeness, transitivity, continuity, and independence over lotteries, there exists a function U: outcomes → ℜ such that: L₁ ⪰ L₂ if and only if E[U(L₁)] ≥ E[U(L₂)]. This function is unique up to positive affine transformation. This is one of the most important theorems in all of decision science.

Axioms are normative, not descriptive. These are not claims about how people actually behave. Experimental psychology shows that humans routinely violate independence and even transitivity. The axioms define what it would mean to be rational. Our goal is to build decision systems that satisfy them, even if their human users do not.

Why Each Axiom Is Necessary

Removing any single axiom breaks the utility representation:

If you drop...	What goes wrong	Example
Completeness	You cannot compare some pairs → no universal ranking exists	"I can't decide between A and B" is valid but blocks optimization
Transitivity	Cyclic preferences → money pump is possible (see quiz)	A≻B≻C≻A: pay $1 to swap B for A, pay $1 for C, pay $1 for A, repeat forever
Continuity	Some outcomes are "infinitely good/bad" → no finite U value	If you prefer any chance of death to the worst non-death outcome, U(death)=−∞ breaks arithmetic
Independence	Preferences flip when irrelevant alternatives are mixed in → Allais paradox; no linear EU formula	Rank A≻B, but [A:0.1, C:0.9] ≺ [B:0.1, C:0.9]: the C mixture changed your mind

The independence axiom is the most controversial because it outlaws any "certainty effect" — the documented human tendency to overweight certainty relative to near-certainty. Kahneman and Tversky's Prospect Theory relaxes independence to model actual human behavior. But for autonomous systems, independence is maintained: a drone that violates independence can be trapped in cycles by an adversarial environment.

Why does a violation of transitivity allow a rational opponent to extract unlimited money from you?

If A≻B≻C≻A, the opponent can cycle you through paying small amounts at each swap (money pump) forever Transitivity violations make all lotteries equivalent Non-transitive preferences mean you cannot assign probabilities

Chapter 2: Utility Functions

The VNM theorem guarantees that if your preferences satisfy the four axioms, there exists a real-valued function U on outcomes such that:

A ≻ B if and only if U(A) > U(B)
The utility of a lottery is its expected utility: U([S₁:p₁;...;S_n:p_n]) = ∑_i p_i U(S_i)

This U is unique up to a positive affine transformation: if U works, then U'(x) = mU(x) + b (with m > 0) also works. This is exactly like temperature scales — Celsius and Fahrenheit encode the same information in different units. The only things that matter are the ordering and the relative spacings of utility values.

Utility is not money, not happiness, not a single quantity. U is a personal, subjective representation of preferences. Two rational people can have completely different utility functions. What matters is that each person's U is internally consistent with their four axioms. The textbook's collision avoidance system uses utility over states like "alert + no collision," "no alert + collision," etc.

Lottery Expected Utility Calculator

Set probabilities for three outcomes (A: U=10, B: U=5, C: U=−3). If P(A)+P(B)>1, P(B) is trimmed. The expected utility and the "certainty equivalent" (the utility bar height) update live.

P(outcome A) 0.50

P(outcome B) 0.30

If U(x) = 3x + 7 is a valid utility function, is V(x) = 3x + 100 also valid?

Yes — utility functions are unique only up to positive affine transformations; adding a constant preserves all preference orderings No — there is a unique correct utility function Only if x is bounded

Chapter 3: Utility Elicitation

Utility elicitation is the process of discovering a person's utility function from choices. The standard method exploits the continuity axiom:

Step 1: Anchor

Fix U(worst outcome S̲) = 0 and U(best outcome S̄) = 1

↓

Step 2: Indifference lottery

For each intermediate outcome S, find p such that S ∼ [S̄:p; S̲:1−p]

↓

Step 3: Assign utility

Set U(S) = p. The indifference probability IS the utility.

For the collision avoidance textbook example: S̄ = "no alert, no collision" (U=1), S̲ = "alert, collision" (U=0). For the outcome "alert, no collision," a domain expert might say they are indifferent between that outcome for certain and the lottery [no alert/no collision : 0.9 ; alert/collision : 0.1]. Then U("alert, no collision") = 0.9.

Never use money directly as utility. The utility of money is not linear. $1000 to a billionaire is worth much less than $1000 to a student. This diminishing marginal utility is precisely why people buy insurance: paying a small certain cost (the premium) to avoid a tiny probability of catastrophic loss. Insurance is rational only if your utility is concave in wealth.

Practical Challenges in Utility Elicitation

The lottery method works in theory, but real-world elicitation faces significant challenges:

Challenge	Why It Occurs	Mitigation
Anchoring bias	Domain experts anchor on the first probability they are shown	Ask the question multiple ways; start from different anchors
Probability insensitivity	Humans cannot distinguish 1% from 0.1% intuitively	Use frequency framing ("1 in 100" vs "1%"); use visual aids
Outcome scope sensitivity	Experts change their answers when the outcome set changes	Keep outcomes fixed; use consistent reference lotteries
Inconsistency across sessions	Experts give different indifference probabilities on different days	Average over multiple sessions; use internal consistency checks
Non-unique utility function	Multiple utility functions are consistent with elicited data	Elicit enough data points to over-constrain the function; fit with regression

The textbook notes that for safety-critical systems (aviation, medical), utility functions should be elicited from multiple domain experts and reconciled through a formal process. NASA and FAA require that ACAS X's utility function be explicitly documented and approved. The utility function over collision states and alert states is publicly available in the technical report and has been the subject of extensive review.

For large outcome spaces, elicitation becomes a structured optimization problem: elicit pairwise indifferences for a representative subset of outcomes, then fit a parametric utility model (e.g., exponential or power) to all elicited data points simultaneously. This reduces the number of required interviews from O(|outcomes|) to O(1) for simple parametric forms.

Elicitation Game: Find Your Indifference Point

Best outcome = "no alert, no collision" (U=1). Worst = "alert, collision" (U=0). For "alert, no collision," adjust p until the lottery feels equivalent to having that outcome for certain. Your U = p.

Lottery probability p 0.80

In the elicitation procedure, if you are indifferent between "medium outcome" for certain and [best:0.7; worst:0.3], what is U(medium)?

0.7 — the indifference probability for the best outcome equals the utility 0.3 0.5

Chapter 4: Risk Attitudes

The shape of the utility function encodes how a person feels about risk. Consider: you can have $50 for certain, or a 50% chance of $100. Both have the same expected value ($50). But most people prefer the certain $50. Why? Because their utility of money is concave.

Attitude	Utility Shape	Choice	Defines
Risk neutral	Linear: U(x) = x	Indifferent: EU = 50 = value of certain $50	Maximizes expected value
Risk averse	Concave: U = √x or log x	Prefers $50 certain: U(50) > 0.5·U(100)	Buys insurance
Risk seeking	Convex: U = x²	Prefers lottery: 0.5·U(100) > U(50)	Buys lottery tickets

Computing the Certainty Equivalent

The certainty equivalent (CE) of a lottery L is the guaranteed amount you would accept in exchange for the lottery. For exponential utility U(x) = 1 − e^−λx and a lottery L over wealth outcomes:

CE(L) = −¹/_λ log(E[e^−λX])

This is the moment generating function of X evaluated at −λ. For a discrete lottery with outcomes x₁,...,x_n and probabilities p₁,...,p_n:

CE = −¹/_λ log(∑_i p_i e^−λx_i)

Worked example: 50/50 lottery for {$0, $100}. With λ=0.02 (mild risk aversion):

CE = −¹/_0.02 log(0.5 e⁰ + 0.5 e^−0.02×100) = −50 log(0.5 + 0.5 e⁻²)

= −50 log(0.5 + 0.5 × 0.135) = −50 log(0.568) = −50 × (−0.566) = 28.3

So the certainty equivalent is $28.30, far below the expected value of $50. The risk premium is $50 − $28.30 = $21.70 — the maximum premium this person would pay for insurance that guarantees $50 instead of the 50/50 lottery. The more risk-averse (λ → ∞), the lower the CE and the higher the risk premium.

CE as the true measure of value. When making decisions for risk-averse stakeholders, always report the certainty equivalent, not the expected value. An expected profit of $10M from a risky project sounds great; a CE of $2M reveals what that risk is actually worth to the stakeholder. The textbook uses this insight to argue that "maximize expected value" is only correct for risk-neutral agents; general agents should maximize expected utility.

The textbook discusses several functional forms. Exponential utility U(x) = 1 − e^−λx (with λ > 0 for risk aversion) is popular because it has constant absolute risk aversion — the risk premium does not depend on your wealth level. Power utility U(x) = x^α (with α < 1) has constant relative risk aversion. For collision avoidance systems, the textbook uses a piecewise specification where outcomes are enumerated explicitly and utilities are elicited from domain experts.

The risk premium for a lottery L is the amount E[L] − CE(L) you would pay to avoid the uncertainty. For exponential utility with λ > 0, the risk premium for a Gaussian-distributed payoff with mean μ and variance σ² is exactly λσ²/2 — a clean closed-form result that makes exponential utility tractable in continuous settings.

The certainty equivalent. For a lottery L, the certainty equivalent CE(L) is the guaranteed amount you would trade for L. For a risk-averse person, CE(L) < E[L] (the certain equivalent is less than the expected value). The difference E[L] − CE(L) is the risk premium — how much you'd pay to avoid the uncertainty.

The Arrow-Pratt Coefficient of Risk Aversion

Different utility functions have different degrees of risk aversion. The Arrow-Pratt coefficient of absolute risk aversion formalizes this:

A(x) = − U''(x) / U'(x)

For exponential utility U(x) = 1 − e^−λx: U'(x) = λe^−λx, U''(x) = −λ²e^−λx. So A(x) = λ²e^−λx / (λe^−λx) = λ. The coefficient is constant — hence "constant absolute risk aversion" (CARA). A higher λ means more risk aversion at every wealth level.

For power utility U(x) = x^α: A(x) = (1−α)/x. Risk aversion decreases with wealth — richer agents are less risk averse per dollar. This is "hyperbolic absolute risk aversion" (HARA) and is empirically more realistic for humans.

Utility Function	CARA?	Arrow-Pratt A(x)	Typical Use
U(x) = x (linear)	Yes, λ=0	0 (risk neutral)	Risk-neutral agents, EMV optimization
U(x) = log(x)	No (HARA)	1/x	Kelly criterion, log-wealth portfolios
U(x) = 1−e^−λx	Yes	λ (constant)	Closed-form solutions; ACAS utility model
U(x) = x^α, α<1	No (HARA)	(1−α)/x	Portfolio theory, diminishing returns

For autonomous decision systems, CARA (exponential utility) is often used because it has clean closed-form properties and is easy to elicit: the single parameter λ controls all risk aversion, and λ=1/R where R is the "risk tolerance" (maximum EV loss you'd accept for certainty). The textbook's collision avoidance utility is elicited as a piecewise specification equivalent to CARA over the safety-relevant outcome range. Notably, for small lotteries relative to the scale of R, CARA and all reasonable utility functions are approximately linear — expected value maximization is adequate for stakes far below your risk tolerance. This explains why insurance companies (with large capital reserves) can price based on expected cost alone, while individuals (with limited wealth) must use full EU theory.

Risk Attitude Explorer

Adjust λ to change risk aversion. The orange dot marks $50 certain. The teal dot marks the expected utility of the 50/50 lottery for $100. When the teal dot is below orange: risk averse.

λ (risk aversion) 0.50

A person is offered $50 certain vs a 50% chance of $100. Their CE is $35. What does this tell us?

They are risk averse: their utility function is concave, so they value certainty at $35 even though the expected value is $50 They are risk seeking They are risk neutral

Chapter 5: Maximum Expected Utility

Everything in this chapter builds toward one equation. Given a utility function U over outcomes and a probabilistic model P(o | a) of how observation o results from action a, the maximum expected utility (MEU) principle says: choose the action that maximizes expected utility.

a^* = arg max_{a ∈ A} ∑_o P(o | a) · U(o)

Where MEU Comes From: Deriving It from the VNM Axioms

The MEU principle is not an assumption — it follows directly from the VNM axioms. Here is the argument:

By completeness + transitivity: there is a total ordering on actions (equivalently, on the lotteries they induce).
By continuity: for any action with outcome distribution P(o|a), there exists a utility U(a) such that you are indifferent between a and the lottery [best:U(a); worst:1−U(a)].
By independence: the utility of a compound lottery (which is what an action produces — a mixture over outcomes) equals the weighted sum of the utilities of its components: U([P(o₁|a):p₁,...]) = ∑_o P(o|a) × U(o).
Therefore: you prefer action a over action b if and only if ∑_o P(o|a) U(o) > ∑_o P(o|b) U(o). Maximizing this sum is equivalent to maximizing your preferences.

Step 3 is the crucial one: it says that the utility of an action (which produces a lottery over outcomes) equals the expected utility of that lottery. This is exactly the independence axiom in disguise. Without independence, you could not simplify compound lotteries this way — you would need to track the full joint distribution of outcomes, not just their expected utility.

One critical insight about MEU that the textbook emphasizes: the MEU computation is efficient only if P(o|a) can be computed efficiently. For a Bayesian network with n nodes, computing P(o|a) via exact inference takes time exponential in the network's treewidth. For the simple umbrella problem (treewidth 1), this is O(|states|). For a complex medical diagnosis network with many diseases (high treewidth), exact P(o|a) computation may require approximations (belief propagation, sampling). The MEU equation is always correct; the bottleneck is the probability model, not the utility maximization step.

MEU is complete and consistent. Given any VNM-rational utility function, MEU produces: (1) a complete ranking of all actions (no ties possible unless multiple actions have exactly equal EU), (2) a consistent policy (if you would choose a in situation S, you would still choose a if S is embedded in a larger problem), (3) an optimal policy (no other decision rule produces better outcomes in expectation). No other decision rule shares all three properties.

This is the central equation of decision theory. Every algorithm from here on — MDPs, POMDPs, reinforcement learning — is ultimately computing or approximating a^*. The Bellman equation for MDPs is the MEU principle applied recursively over time: U*(s) = max_a[R(s,a) + γ∑_s'T(s'|s,a)U*(s')].

For the textbook's umbrella problem: actions are {bring umbrella, leave umbrella}. Observations are {rain forecast, sun forecast}. The state space has four outcomes: (rain, umbrella), (rain, no umbrella), (sun, umbrella), (sun, no umbrella). The utility of each outcome is specified by a domain expert. We compute EU for each action given the forecast and pick the max.

The MEU principle has one subtle requirement: the action space A must be well-defined. In many real problems, the action space is continuous (set the thrust between 0 and 1) or has exponentially many elements (choose a portfolio over 500 assets). The MEU principle still applies in principle — argmax over a continuous action space — but computing it requires additional optimization machinery (gradient ascent for continuous actions, integer programming for combinatorial ones). For the textbook's examples, the action space is always small and discrete.

The MEU principle also naturally handles the exploration-exploitation tradeoff when the agent is uncertain about the environment. An agent that maximizes myopic EU ignores information value and always exploits: it takes the action with the highest immediate expected utility based on its current beliefs. An agent that accounts for the future value of information gained by taking uncertain actions will sometimes deliberately take a suboptimal immediate action to learn more. The Bayes-optimal policy — maximizing total expected utility over all future decisions, including the value of information from current actions — naturally balances exploration and exploitation. The chapters on online planning (Ch. 9) and MCTS compute this Bayes-optimal exploration strategy for finite horizons.

The textbook gives specific utility values for the umbrella problem that have become canonical in the literature: U(rain, umbrella) = 70, U(sun, umbrella) = 20, U(rain, no umbrella) = 0, U(sun, no umbrella) = 100. The model: P(rain) = 0.4, P(forecast rain | rain) = 0.8, P(forecast rain | sun) = 0.2.

EU(umbrella | forecast rain) = P(rain | fc rain) · 70 + P(sun | fc rain) · 20

P(rain | fc rain) = P(fc rain | rain) · P(rain) / P(fc rain) = 0.8 × 0.4 / (0.8×0.4 + 0.2×0.6) = 0.727

EU(umbrella | fc rain) = 0.727 × 70 + 0.273 × 20 = 56.3

EU(no umbrella | fc rain) = 0.727 × 0 + 0.273 × 100 = 27.3

Conclusion: when the forecast says rain, bring the umbrella (EU 56.3 > 27.3). The computation is pure Bayesian inference + weighted average — exactly what the MEU formula prescribes.

MEU is normative, not optimal in computation. The MEU principle tells us what the right action IS. Computing it exactly can be intractable (NP-hard in general). Most of this book is about efficient approximations to MEU for complex, sequential problems. The principle never changes; the algorithms do.

python
def max_expected_utility(actions, outcomes, P, U):
    """
    actions: list of possible actions
    outcomes: list of possible outcomes
    P(o, a): P[o][a] = probability of outcome o given action a
    U(o): U[o] = utility of outcome o
    Returns: (best_action, best_eu)
    """
    best_a, best_eu = None, -float('inf')
    for a in actions:
        eu = sum(P[o][a] * U[o] for o in outcomes)
        if eu > best_eu:
            best_eu, best_a = eu, a
    return best_a, best_eu

# Umbrella example
actions = ['umbrella', 'no_umbrella']
outcomes = ['rain_u', 'sun_u', 'rain_no', 'sun_no']
P = {'rain_u':{'umbrella':.8}, 'sun_u':{'umbrella':.2},
     'rain_no':{'no_umbrella':.8}, 'sun_no':{'no_umbrella':.2}}
U = {'rain_u':70, 'sun_u':20, 'rain_no':0, 'sun_no':100}
best, eu = max_expected_utility(actions, outcomes, P, U)
print(f"Best action: {best}, EU={eu:.1f}")
# Best action: umbrella, EU=60.0

If EU(bring umbrella | rain forecast) = 72 and EU(leave umbrella | rain forecast) = 20, what does a rational agent do?

Bring the umbrella — it has strictly higher expected utility (72 > 20) Leave the umbrella — it might not actually rain Flip a coin — we don't know for certain what will happen

Chapter 6: Decision Networks

A decision network (also called an influence diagram) is an extension of a Bayesian network that incorporates decisions and utilities. It has three node types:

Node Type	Shape	Meaning	How Solved
Chance node	Circle (oval)	Random variable with CPT	Inference (sum out)
Decision node	Rectangle (square)	Variable the agent controls	Optimization (max over)
Utility node	Diamond	Utility as a function of its parents	Expected value (product with P)

There are three edge types: conditional edges (into chance nodes, like a BN), informational edges (into decision nodes, showing what's observed before deciding), and functional edges (into utility nodes).

A decision network can be solved by a simple extension of the Bayesian network inference algorithms from Chapters 2–3. The key observation is that a decision node A with no informational edges (A is decided before any observation) is equivalent to conditioning on each possible value of A and taking the max: U*(unobserved) = max_a EU(a). A decision node with informational edges O → A (A is decided after observing O) becomes a strategy: for each value o of O, independently compute EU(a|O=o) for each action and take the argmax. The full solution is a conditional policy π(O) → A that maps observations to optimal actions.

The complexity of exact decision network solving: solving a single decision with k ancestors requires running Bayesian inference over those k variables, which is exponential in the treewidth. For the medical diagnosis network (3 nodes, treewidth 1), inference is exact and fast. For a network with 20 interconnected variables, exact inference may be intractable, and approximate methods (belief propagation, sampling) are needed. This is exactly the same complexity bottleneck as BN inference from Chapter 3 — the decision extension adds O(|A|) to the constant factor but does not change the fundamental complexity.

To solve a decision network, we run inference on the chance nodes conditioned on the decision choices, compute the expected utility for each action, and return the action that maximizes it. This is literally applying the MEU equation from Chapter 5 using the BN inference machinery from Chapters 2–3.

The decision network representation separates mechanism (P(O|S), P(S)) from preferences (U(S,A)), which is crucial for modularity. You can update the probability model (better sensor, updated prior) without changing the utility function, or update the utility function (new stakeholder requirements) without changing the probability model. This separation of concerns is impossible with monolithic approaches like "hardcoded rules" or "naive weighted scoring" — changing either the model or the preferences requires re-engineering the entire system. Decision networks, like Bayesian networks before them, are the right abstraction for building maintainable decision-making systems.

The textbook's worked example: the collision avoidance domain from Kochenderfer's own research. State O has three possible observations: O₁ = collision threat detected (strong), O₂ = possible threat (weak), O₃ = no threat. Action A has two values: A₁ = issue alert, A₂ = remain silent. The state S has two values: collision vs. no collision. The decision network has: chance nodes S and O, decision node A, utility node U with parents {S, A}. Given observation O, the optimal decision is computed by:

Compute P(S | O) using Bayes' rule on the BN
For each action a ∈ {alert, silent}: compute EU(a | O) = ∑_s P(s | O) · U(s, a)
Return a* = argmax_a EU(a | O)

Decision networks = BNs + actions + objectives. All the inference tools from Chapter 3 (factor operations, variable elimination) apply directly. The new steps are: (1) condition on the action node value (treat it as observed), (2) compute the expected utility at the utility node, (3) repeat for each action value and take the max.

Solving a Decision Network: Variable Elimination

The textbook solves decision networks using the same variable elimination (VE) algorithm as BN inference, extended with utility maximization. Here is the full procedure for the collision avoidance network with nodes: State S (collision threat), Observation O, Action A (alert/silent), Utility U(S, A).

Setup: We have observed O = o. We want argmax_a EU(A=a | O=o).

Initialize factors: P(S), P(O|S), U(S,A) (utility table)
Condition on observation: replace P(O|S) with P(O=o|S) (a factor over S only)
For each action a ∈ {alert, silent}:
1. Condition on action: replace U(S,A) with U(S, A=a) (a factor over S only)
2. Multiply all factors: P(S) × P(O=o|S) × U(S, A=a) → one joint factor over S
3. Sum out S: EU(a|o) = ∑_s [factor over S]
Return a* = argmax_a EU(a|o)

Numerical example: P(S=collision) = 0.05 (base rate), P(O=strong|collision) = 0.9, P(O=strong|no collision) = 0.1. Utilities: U(collision, alert) = −10 (false alarm cost), U(collision, silent) = −1000 (catastrophic), U(no collision, alert) = −10, U(no collision, silent) = 0.

With O=strong: P(S=collision|O=strong) = 0.9×0.05 / (0.9×0.05 + 0.1×0.95) = 0.045/0.14 = 0.32. EU(alert|strong) = 0.32×(−10) + 0.68×(−10) = −10. EU(silent|strong) = 0.32×(−1000) + 0.68×0 = −320. Optimal: alert (EU −10 ≫ −320). Even though a collision is only 32% likely, the catastrophic cost of silence dominates.

Solving Decision Networks: Software and Complexity

Solving a decision network exactly has the same complexity as Bayesian network inference, plus the optimization over action nodes. For a single decision node, the complexity is O(|A| × inference cost), where inference cost depends on the BN structure (typically exponential in treewidth). For sequential decision networks with multiple decision nodes (influence diagrams), the problem is PSPACE-complete in general.

Practical solvers: the pgmpy Python library supports influence diagrams through pgmpy.models.DynamicBayesianNetwork. The Hugin software (commercial) is the standard industrial tool for decision network inference, used in medical diagnosis and risk assessment. For the textbook's collision avoidance network with 3 nodes, exact solution takes microseconds. For networks with 10+ variables, variable elimination or junction tree algorithms are required.

The key data structure for efficient decision network solving is the relevance graph: given a decision node A, only variables that are d-connected to A (conditional on what is observed before A) are relevant for computing EU(A). Irrelevant variables can be pruned before inference, dramatically reducing computation. For the umbrella problem, when forecasting rain, only the weather variable is relevant — all others are pruned.

The collision avoidance lesson. The key insight from the textbook: a rational system should alert even when collision probability is low, if the asymmetry in utilities is extreme enough. Here, −1000 vs −10 means the threshold P(collision) above which alerting is optimal is only 10/1000 = 1%. Below 1% collision probability, silence is rational; above it, alert. This threshold-based policy is exactly what ACAS X implements.

Medical Diagnosis Decision Network

Disease D has prior 0.15. A test O is run. We choose treatment T. The utility depends on D and T. Positive tests make treatment more likely optimal. Compare EU for each action.

Select test result

What does an informational edge into a decision node represent in a decision network?

The decision-maker observes the source node's value before choosing the action The action causally affects the source node The utility depends on the source node

Chapter 7: Value of Information

Sometimes we can gather additional observations before making a decision. Should we? The value of information (VOI) answers this: it is the expected improvement in utility from observing a variable before acting.

VOI(O') = EU_{after observing O'}(optimal policy) − EU_{before observing O'}(optimal action)

The VOI is always non-negative. Knowing more can never hurt a rational agent: you can always ignore information. If VOI(O') > cost(measuring O'), then measuring O' is worthwhile. If VOI(O') = 0, the information would not change any decision, so there's no point gathering it.

The VOI is zero when either (1) the best action is already the same regardless of O', or (2) O' is independent of the relevant variables conditional on what you already know. In the umbrella problem: if the forecast is 100% accurate, VOI(forecast) = big number. If forecast is random noise, VOI(forecast) = 0.

VOI for the medical test. Without the test, you choose the action with higher prior EU (maybe "treat" if P(disease) is high enough). After the test, you condition on the result and might reverse the decision. VOI = average (over possible test outcomes) of the improvement. A positive test dramatically changes P(disease), so it has high VOI.

The precise VOI formula: VOI(O') = ∑_o' P(o') · max_a EU(a|O'=o') − max_a EU(a). For each possible test outcome o', compute P(o') and the best achievable EU conditioned on that outcome, then average and subtract the uninformed baseline. An upper bound is the value of perfect information (VOPI): VOPI = ∑_s P(s) U(s, a*(s)) − max_a EU(a), where a*(s) is the action that would be optimal if we knew state s exactly. VOPI is easy to compute and tells you the maximum any test could be worth. If VOPI is small, no test of any quality is worth running; if large, a good test may be justified.

VOI and the "No Free Lunch" of Information

The VOI formula seems to say information is always free: since VOI ≥ 0, why not always collect it? Three reasons why we don't:

Measurement cost. Tests are expensive. A CT scan costs $1000; an MRI costs $3000; a biopsy has both monetary and physical cost. The decision to test is itself a decision, governed by MEU: take the test iff VOI > cost(test).
Latency cost. Gathering information takes time. In a collision avoidance system, waiting to gather better sensor data means waiting while the aircraft approaches a collision. The value of immediate action may exceed the value of delayed action after better information.
Irreversibility. Some actions are irreversible; once taken, no future information can undo them. In sequential problems, the timing of information relative to decisions matters. VOPI is only achievable when the information arrives before the relevant decision.

The formal treatment: in a sequential decision problem, compute the VOI for each possible observation at each decision node. The result is an information strategy — which observations to make, in what order, given what prior observations. This is computationally hard (exponential in the number of variables) but can be approximated greedily: compute myopic VOI for each variable and test the one with highest myopic VOI first. Myopic greedy is near-optimal when VOIs are submodular.

A worked numerical example from the textbook: suppose P(disease) = 0.1, U(treat, disease) = 80, U(treat, healthy) = 60, U(no treat, disease) = 20, U(no treat, healthy) = 100. Without a test, EU(treat) = 0.1×80 + 0.9×60 = 62, EU(no treat) = 0.1×20 + 0.9×100 = 92. Optimal is no treatment, EU=92.

Now with a test (sensitivity=0.9, specificity=0.9): P(test+) = 0.1×0.9 + 0.9×0.1 = 0.18. After a positive test, P(D|+) = 0.09/0.18 = 0.5. Now EU(treat|+) = 0.5×80 + 0.5×60 = 70; EU(no treat|+) = 0.5×20 + 0.5×100 = 60. Optimal given positive: treat (EU=70). This decision reverses! The VOI of the test comes entirely from this reversal. Computing the full VOI: 0.18×70 + 0.82×EU(optimal|neg) − 92 gives the expected gain.

VOI: Should You Run the Test?

Disease: costs −100 if untreated, −20 if treated (with side effects). No disease: treatment costs −20, no treatment costs 0. Adjust the prior and test quality to see when VOI justifies the test cost.

Prior P(disease) 15%

Test accuracy (sensitivity = specificity) 85%

Can the value of perfect information (VOPI) be negative?

No — a rational agent can always ignore perfect information, so knowing more can never reduce expected utility Yes — too much information causes analysis paralysis Only when the information is noisy

Chapter 8: Showcase — The Full Umbrella Decision

This is the textbook's umbrella problem, fully solved end-to-end as a decision network. You control the weather model (prior and forecast accuracy), the utility function, and the observation. The system computes all conditional probabilities, expected utilities, and the optimal action in real time.

States: (weather, umbrella) → four outcomes: {rain+umbrella, rain+no umbrella, sun+umbrella, sun+no umbrella}. The utility of each is shown in the table. The prior P(rain) and the forecast reliability determine P(rain | forecast).

Every slider changes the optimal action. Push the "utility of carrying umbrella when sunny" down far enough, and you should leave the umbrella even when rain is forecast. Push the "cost of getting wet" up enough, and you always bring it. The MEU principle computes this precisely.

Full Umbrella Decision Network

Adjust the sliders and select a forecast. The canvas shows: (1) the conditional probability table after inference, (2) expected utility for each action, and (3) the optimal action highlighted in teal.

Prior P(rain) 40%

Forecast accuracy P(forecast rain | rain) 80%

U(rain, no umbrella) — cost of getting wet -70

U(sun, umbrella) — hassle of carrying -10

Select a forecast above

In the umbrella problem, why might you choose to bring the umbrella even with a "sun" forecast?

If the cost of getting wet is very high and/or the forecast accuracy is low, EU(umbrella | sun) can still exceed EU(no umbrella | sun) You should never bring the umbrella with a sun forecast The forecast is always 100% accurate

Chapter 9: Connections & What's Next

One of the most important extensions in the textbook is the Allais paradox discussion (referenced throughout). The paradox demonstrates that most real people violate the independence axiom. Consider these two choice problems:

Choice 1: Which do you prefer?

A: $1M with certainty

B: 89% chance of $1M, 10% chance of $5M, 1% chance of $0

Choice 2: Which do you prefer?

C: 11% chance of $1M, 89% chance of $0

D: 10% chance of $5M, 90% chance of $0

Most people choose A in problem 1 (certainty is very attractive) and D in problem 2 (the extra $4M seems worth the small probability difference). But this combination violates independence! Let U($0)=0, U($1M)=u, U($5M)=v. Choosing A over B requires u > 0.89u + 0.10v, i.e., 0.11u > 0.10v. Choosing D over C requires 0.10v > 0.11u. Contradiction. The 89% chance of $1M that is "common" to both problems should cancel out by independence.

The Ellsberg paradox is a different kind of VNM violation. In Ellsberg's original experiment: an urn contains 30 red balls and 60 balls that are either black or yellow in unknown proportions. People consistently prefer "bet on red" over "bet on black" (Problem 1) and "bet on black or yellow" over "bet on red or yellow" (Problem 2). But these two preferences are inconsistent with any probability distribution over black/yellow. The paradox reveals ambiguity aversion: people dislike uncertainty about probabilities, not just uncertainty about outcomes. Bayesian decision theory assumes all uncertainty is representable as a probability distribution — the Ellsberg paradox shows this assumption fails for humans. For autonomous systems, the Bayesian approach remains standard; robust Bayesian methods use sets of priors to handle ambiguity.

The resolution for system designers: while human decision-makers violate independence, the systems we build for them should not. A collision avoidance algorithm that violates independence can be "money-pumped" into suboptimal decisions. Rationality is a constraint on the system, not necessarily on the human operator.

Concept	What We Learned	Where It Goes
VNM axioms	Four constraints that guarantee a utility function exists	Foundation for all rational agent design
Utility functions	Real-valued preferences; unique up to positive affine transform	Reward functions in MDPs (Chapter 7)
Utility elicitation	Indifference-lottery method; U(S) = indifference probability p	Human-AI interaction, safety constraints
Risk attitudes	Concave=averse, linear=neutral, convex=seeking	Finance, insurance, robust optimization
MEU principle	a* = argmax ∑ P(o\|a)U(o) — THE decision rule	Bellman equations, Q-functions, POMDP planning
Decision networks	BN + action nodes + utility nodes	Sequential decision networks, POMDPs
Value of information	VOI = EU(with info) − EU(without); always ≥ 0	Active sensing, exploration-exploitation

The expected utility framework also connects to robust optimization. Instead of a single probability model P(o|a), use a set of distributions and optimize for the worst case: a^* = argmax_a min_{P ∈ P-set} E_P[U(o|a)]. This maximin EU approach (Gilboa & Schmeidler, 1989) is rational under distributional ambiguity and avoids Ellsberg-paradox violations. The umbrella example with P(rain) uncertain in [0.3, 0.5] would yield a more conservative policy: choose umbrella whenever EU(umbrella|worst-case P) ≥ EU(no umbrella|worst-case P). Robust decision making is covered explicitly in Chapter 22 (Model Uncertainty) of the textbook.

From simple to sequential. Chapter 6 covered single-step decisions: observe, decide, receive outcome. The rest of the book extends to sequences of decisions where each action affects future states and future rewards. The MEU principle stays the same; we just sum it over time and discount future rewards.

MEU as the Foundation of the Bellman Equation

The MEU principle for a single-step decision is a* = argmax_a ∑_o P(o|a) U(o). The Bellman equation for an infinite-horizon discounted MDP (Chapter 7) is:

U*(s) = max_a [ R(s,a) + γ ∑_s' T(s'|s,a) U*(s') ]

These are the same equation. The MEU term max_a ∑P(o|a)U(o) becomes: max over action, expectation over next state (T is like P(o|a)), where the "utility" of next state is R + γU*(s'). The discount γ is how much you discount future utility relative to present utility — a parameter of the utility function over time, not a new idea. A rational agent with VNM preferences over trajectories has an implicit discount factor.

Utility vs Reward in Reinforcement Learning

In RL, the "reward" function R(s,a) plays the role of a local utility. The value function U*(s) is the total expected discounted utility from state s. This connection is deeper than it looks: a reinforcement learning agent that maximizes expected cumulative discounted reward is implementing the MEU principle, where the outcomes are infinite trajectories and the utility of a trajectory is its discounted sum of rewards.

The design question for any RL system is therefore a utility elicitation problem in disguise: what is the correct reward function R(s,a) such that maximizing expected cumulative reward corresponds to what we actually want the agent to do? This is the core of reward shaping, inverse RL, and RLHF — all are methods for inferring or specifying the right utility function.

From MEU to Reinforcement Learning

Every major RL algorithm is either computing or approximating the MEU principle for sequential problems. Here is how they map:

RL Algorithm	MEU Approximation Used	Key Difference from Prior
Q-learning	Approximates Q(s,a) = R + γmax_a'Q(s',a') directly from samples	Model-free: no T(s'\|s,a) needed
SARSA	Same as Q-learning but on-policy (uses actual next action, not max)	On-policy; safer for systems with constraints
Actor-Critic	Critic estimates U_θ; actor optimizes π to maximize E[U_θ]	Separates evaluation from optimization
PPO / TRPO	Policy gradient: maximize E_π[∑γ^tR_t] directly	Avoids explicit value function; constrained updates
Model-based RL (Dyna, MBPO)	Learns T, R; runs approximate value iteration on learned model	Data efficient; generalizes to unseen states

Every row of this table is implementing the MEU principle from Chapter 6, extended to sequential decisions via the Bellman equation. The differences are in: (1) whether T and R are known or learned, (2) whether the value function or the policy is the primary optimization target, (3) how sample efficiency and stability are traded off. Knowing Chapter 6 deeply means understanding the "what" behind every RL algorithm; the chapters on sequential methods provide the "how."

Multi-Attribute Utility

Real-world decisions often have multiple competing objectives. The standard approach is multi-attribute utility theory (MAUT): if attributes X₁, ..., X_n are utility-independent, then the joint utility factors as U(X₁,...,X_n) = ∑_i w_i U_i(X_i) (additive form) or as a product form. The weights w_i encode the trade-off between attributes and must be elicited from the decision-maker. ACAS X elicits utility over the two main attributes: safety (P(collision)) and efficiency (pilot workload from alerts). The utility function that governs the ACAS X collision avoidance system was elicited through extensive domain expert interviews using exactly the lottery method from Chapter 3.

Sequential Decisions: How Chapter 6 Connects to the Rest of the Book

Chapter 6's simple decision framework covers one decision followed by one outcome. The rest of the book extends this to sequential decisions where each action affects future states and future opportunities to act. The extension requires two new ideas:

Time discounting: Future utility is worth less than present utility by a factor γ^t at time t. This is not irrationality — it is a feature of preferences over time. The MEU principle applies at each step: maximize ∑_t γ^t U(o_t).
Dynamics: Actions determine not just the immediate outcome but the next state. The MEU equation becomes the Bellman equation: U*(s) = max_a[R(s,a) + γ∑T(s'|s,a)U*(s')].

Every chapter from here builds on these two extensions. Chapter 7 (Exact Methods) solves the Bellman equation exactly for discrete, finite MDPs. Chapter 8 (Approximate Value Functions) handles continuous or enormous state spaces. Chapter 9 (Online Planning) searches from the current state rather than precomputing U* everywhere. Chapters 12–17 (RL) learn U* from interaction without knowing T or R.

Risk Aversion in Sequential Decisions: How Discount Factors Encode Time Preferences

The discount factor γ in MDPs is not merely a mathematical convenience — it is a manifestation of the agent's utility function over time. Consider an agent that values $100 today equally to $105 in one year. Its implied temporal discount rate is 5%, and the implied per-year discount factor is γ = 1/1.05 ≈ 0.952. An agent with γ = 0.99 per timestep (each timestep = 1 second) is implicitly saying: a reward one minute from now is worth e^−0.6 ≈ 0.55 of the same reward right now. This is a utility statement, not an approximation trick. Setting γ is utility elicitation for time.

This connection matters for alignment. If you set γ too low, the agent becomes myopic: it accepts large future penalties to gain small immediate rewards. If you set γ too high (γ→1), the agent becomes inconsistent on finite horizons and the Bellman equation may diverge. The "right" γ for ACAS X is calibrated to match the temporal urgency of collision avoidance: a collision in 15 seconds is treated as almost as bad as an immediate collision (γ ≈ 0.999 per 0.1-second timestep). Choosing γ is not a hyperparameter — it is a deliberate utility-theoretic design decision.

Sensitivity Analysis and Utility Robustness

Real decision systems rarely have a single precisely-known utility function. A practical approach: rather than committing to one U, compute the optimal action for a range of plausible utility functions and act only when they agree. Formally, let Υ be a set of admissible utility functions. An action is dominant if it is optimal under every U ∈ Υ. When a dominant action exists, you can act without resolving utility uncertainty. When no dominant action exists, you need either more utility information (elicitation) or a decision rule for choosing among admissible actions (maximin, expected utility under a prior over Υ, etc.).

In the umbrella example: if the utility for "wet without umbrella" is somewhere in the range [−100, −10], the optimal action may flip from "bring" to "leave" at some threshold. Computing this threshold is sensitivity analysis. It answers: "How wrong would my utility estimate have to be before I'd make the wrong decision?" If the threshold is far from your best estimate, the decision is robust. If it's close, refine your utility estimate before acting. Sensitivity analysis is standard practice in medical decision analysis (where quality-of-life utility estimates have high variance) and in safety-critical engineering (where stakeholder preferences are uncertain and auditable).

Value of Information in Active Sensing and Exploration

The VOI framework from this chapter directly motivates exploration in sequential decision problems. An agent that does not know the true dynamics T(s'|s,a) should sometimes try actions whose outcomes are uncertain — the information gain about T may be worth more than the immediate reward loss.

Formally, the value of perfect information (VOPI) for variable X is: VOPI(X) = E_X[EU(optimal action | X=x)] − EU(optimal action without X). Any exploration algorithm that bounds regret or guarantees efficiency (UCB, Thompson sampling, MCTS) can be viewed as approximating VOPI: explore states/actions where the approximated VOPI is high, and exploit where it is low.

In the umbrella problem, this means: if you don't know whether the weather forecast is reliable (λ unknown), you should take the umbrella a few times when the forecast says sun, observe the outcomes, and update your belief about λ. The exploration cost is the EU loss from potentially leaving the umbrella when it rains. The exploration benefit is a better model for future decisions. VOPI quantifies when this tradeoff is worth making.

Connections to Safety and Alignment

The utility theory machinery from this chapter has deep implications for AI safety:

The textbook explicitly connects utility theory to inverse reinforcement learning (IRL): given observations of an agent's behavior, infer the utility function it is maximizing. If you observe a physician's treatment choices, you can infer their implicit utility function by finding U such that MEU(U) best predicts the observed choices. This is the inverse problem to decision network solving. IRL algorithms (Ng & Russell, 2000; Ziebart et al., 2008) solve this inversion and are the mathematical foundation of RLHF (Reinforcement Learning from Human Feedback) used in LLM alignment: collect human preferences (comparisons between outputs), fit a utility function to those preferences, then optimize the LLM to maximize expected utility under that function. The entire RLHF pipeline is applied VNM utility theory at scale.

Reward hacking: If the utility function is incorrectly specified, MEU will optimize the wrong thing to an extreme degree. An agent told to maximize "patient smiles" in a hospital will over-sedate patients. The MEU principle is only as good as the utility function used. This is the alignment problem in miniature.
Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." MEU agents will exploit proxies for utility if the proxy diverges from the true objective. Utility elicitation (Chapter 3) is the technical solution: specify U correctly, not approximately.
Corrigibility: A MEU agent with high confidence in its utility function will resist being corrected (correction would lower its expected utility). Stuart Russell proposes making the agent uncertain about U — then it values human oversight as information about its own objective.
Multi-objective MEU: When multiple stakeholders have different utility functions, whose should the system optimize? Multi-attribute utility theory (MAUT) provides one answer: elicit weights w_i for each stakeholder's utility U_i and maximize the weighted sum. This is a normative framework for resolving utility conflicts.

How does the value of information relate to the quality of a sensor/measurement system?

A more accurate sensor has higher VOI; a perfectly uninformative sensor (random noise) has VOI=0 Sensor quality is unrelated to VOI More accurate sensors always have VOI=1

← Chapter 5: Structure Learning Chapter 7: Exact Methods →

Chapter 0: The Preferences Problem