Kochenderfer, Wheeler & Wray, Chapter 2

Probabilistic Representation

How do we encode degrees of belief mathematically? This chapter builds the full language of probability — distributions, joint tables, Bayes' theorem, and Bayesian networks — the substrate every decision algorithm in the book runs on.

Prerequisites: basic algebra + summation notation. No prior probability required.
10
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: Reasoning Under Uncertainty

A satellite is drifting off course. Is it a battery failure? A solar panel fault? A software bug? You cannot see inside the spacecraft — you only have telemetry readings. How do you decide what to believe, and how to act?

Naive approaches fail. You cannot just pick the most dramatic explanation (that ignores the evidence). You cannot enumerate every possibility in full detail (the state space explodes). You need a language for representing and combining uncertainty systematically.

That language is probability theory. But not as a recipe — as the unique consistent extension of logical reasoning to uncertain situations. Kochenderfer et al. (§2.1) begin with a stunning result: if you want to rank beliefs by plausibility and satisfy two basic rationality requirements, you are forced into probability theory. There is no other option.

The two requirements are universal comparability — any two beliefs can be compared — and transitivity — if A is more plausible than B and B more plausible than C, then A is more plausible than C. From these two axioms alone, a numerical scale of belief emerges. Add a few more consistency conditions (Kolmogorov's axioms) and you have the full probability calculus.

Degrees of belief. Probability is not about frequencies of coin flips. It is about your degree of confidence in a statement given your current knowledge. P(battery failure) = 0.01 means you think a battery failure accounts for about 1% of your plausibility budget given what you know right now. Updating on new evidence shifts that number — this is what Bayes' theorem does.
Plausibility as a Number Line

Three possible causes of a spacecraft anomaly. Drag the sliders to set your belief in each. The order on the bar chart gives the plausibility ranking — comparability and transitivity are built in automatically.

P(battery failure) 0.10
P(solar failure) 0.25
P(software bug) 0.60
What two rationality requirements force a numerical plausibility scale into the structure of probability theory?

Chapter 1: The Probability Axioms

Once we accept that beliefs should be real numbers, we need rules for how they combine. Kochenderfer §2.1 follows Cox's theorem: three axioms produce a unique consistent system.

Axiom 1 — Non-negativity: P(A) ≥ 0 for any event A. Belief cannot be negative.

Axiom 2 — Normalization: P(Ω) = 1 where Ω is the set of all possible outcomes. You are certain that something happens.

Axiom 3 — Additivity: If A and B are mutually exclusive (they cannot both occur), then P(A ∪ B) = P(A) + P(B). Disjoint possibilities add.

P(A) ≥ 0     P(Ω) = 1     P(A ∪ B) = P(A) + P(B) if A ∩ B = ∅

From just these three axioms, everything else follows. The complement rule P(Ac) = 1 − P(A) is a theorem, not a postulate. The inclusion-exclusion formula P(A ∪ B) = P(A) + P(B) − P(A ∩ B) is a theorem. And crucially, conditional probability P(A|B) = P(A,B)/P(B) is the only consistent way to update beliefs on new information.

Why these three? Non-negativity and normalization set the scale: beliefs live in [0,1]. Additivity is the content: disjoint hypotheses contribute separately. Everything else — including Bayes' theorem — is algebraic consequence. The axioms are minimal; there is no redundancy.
Axiom Checker: Additivity

Events A and B partition a probability space. Drag to split: when A and B are disjoint, P(A∪B) = P(A)+P(B). Observe how the complement P(neither) = 1−P(A)−P(B) adjusts automatically to maintain normalization.

P(A) 0.35
P(B) 0.40

The Law of Total Probability

If B1, ..., Bn partition the sample space (exhaustive and mutually exclusive), then for any event A:

P(A) = ∑i P(A | Bi) P(Bi)

This law is the backbone of marginalization: we compute the probability of A by considering every scenario Bi, weighting by how probable that scenario is. We will use this constantly when computing P(evidence) in Bayes' theorem.

From the three axioms, what is P(Ac) — the probability that event A does NOT occur?

Chapter 2: Distributions

A probability distribution is a complete description of uncertainty over a variable. How you describe it depends on whether the variable is discrete or continuous.

Discrete: Probability Mass Functions

A probability mass function (PMF) assigns a probability to each discrete outcome. For a variable X taking values {x1, ..., xn}, the PMF satisfies P(X = xi) ≥ 0 and ∑i P(X = xi) = 1. The parameters of the distribution are the n−1 free probabilities (the last is fixed by normalization).

Example: a loaded die. Six faces, six probabilities that sum to 1. Five free parameters. If we bias face 1 toward probability 0.4, the remaining five faces share 0.6 equally at 0.12 each.

Continuous: Probability Density Functions

For a continuous variable X, we use a probability density function (PDF) p(x) where the probability of X falling in interval [a,b] is ∫ab p(x) dx. The density p(x) itself can exceed 1 — only areas under the curve are probabilities. The normalization condition is ∫-∞ p(x) dx = 1.

-∞ p(x) dx = 1     P(a ≤ X ≤ b) = ∫ab p(x) dx

Cumulative Distribution Function

The CDF F(x) = P(X ≤ x) is the integral of the PDF up to x. It starts at 0, ends at 1, and is always non-decreasing. Any valid CDF gives a valid probability distribution. The CDF is often easier to work with analytically than the PDF.

The Gaussian Distribution

The Gaussian N(μ, σ2) is parameterized by mean μ and variance σ2:

p(x) = (1/(σ√(2π))) exp(−(x−μ)2 / (2σ2))

It is unimodal and symmetric. The Central Limit Theorem guarantees that sums of independent random variables converge to a Gaussian, making it the natural model for aggregate noise. When distributions need multiple peaks, use a Gaussian mixture: a weighted sum of Gaussians ∑i ρi N(μi, σi2) where weights ρi sum to 1.

Discrete sums, continuous integrals. The same ideas appear in both cases — we just swap ∑ for ∫. In both, the total probability is 1. In both, marginalizing out a variable means summing (or integrating) over all its values. This symmetry lets us write unified formulas using measure theory, but for this chapter the intuition is: everything is addition or integration of probabilities.
Discrete PMF & Gaussian PDF

Left: loaded die PMF (bias face 1). Right: Gaussian PDF with adjustable mean and spread. Toggle mixture to see bimodal density.

Die bias (face 1) 0.20
Gaussian μ 0.0
Gaussian σ 1.0
A probability density function p(x) is valid as long as it integrates to 1. Can p(x) itself exceed 1 at some point x?

Chapter 3: Joint & Marginal Distributions

Real problems involve many variables at once. A joint distribution P(X, Y) specifies the probability of every combination of values. For two binary variables X ∈ {0,1} and Y ∈ {0,1}, this is a 2×2 table of four probabilities summing to 1.

The parameter explosion: for n binary variables, the full joint needs 2n − 1 free parameters. At n=20, that is over one million numbers. At n=100, it is more than atoms in the observable universe. Storing the full joint is impossible for realistic problems — this is why Bayesian networks exist.

Marginalization

From a joint distribution we can always recover the distribution over any subset of variables by marginalization — summing out the unwanted variables:

P(x) = ∑y P(x, y)

This is the law of total probability. If X and Y are continuous, replace the sum with an integral. Marginalization tells us: to find P(X=x), add up the joint probability over all possible Y values. The variable Y is summed away — it no longer appears in the result.

Independence

X and Y are independent if P(x,y) = P(x)P(y) for all values. This collapses 2n−1 parameters into n parameters — one marginal probability per variable. But independence is often too strong an assumption: knowing someone smokes and knowing they have lung cancer are not independent. Conditional independence is the practical middle ground, which Bayesian networks exploit.

The 2n problem. The full joint is the most expressive possible representation. It can represent any distribution. But it is also the least tractable. The entire art of probabilistic modeling is finding compact representations that capture the important dependencies without blowing up the parameter count.
Joint Table & Marginalization

A joint distribution over binary X and Y. Cell intensities reflect probabilities. Marginals P(X) and P(Y) are computed live by summing rows and columns. Hit Randomize to explore different joints, or Make Independent to set P(x,y)=P(x)P(y).

How many independent parameters does the full joint distribution over 3 binary variables require?

Chapter 4: Conditional Probability & Bayes' Theorem

You just got back a positive medical test. Should you panic? The answer depends on three numbers that most people — including doctors — confuse. Working through this example is the single most important lesson in probability.

Conditional Probability

The conditional distribution P(X | Y = y) is your belief about X after learning that Y has value y. The definition is:

P(x | y) = P(x, y) / P(y)

We take the slice of the joint table where Y=y, then renormalize so the probabilities sum to 1. This is all conditional probability is — rescaling to a reduced hypothesis space. Note that P(x|y) is a proper distribution over x for each fixed y.

The Chain Rule

Rearranging the definition gives the chain rule (also called the product rule):

P(x, y) = P(x | y) P(y) = P(y | x) P(x)

Extending to three variables: P(x, y, z) = P(x | y, z) P(y | z) P(z). This decomposition of a joint into conditionals is the foundation of Bayesian networks.

Bayes' Theorem

Since P(x,y) = P(y|x)P(x) and also P(x,y) = P(x|y)P(y), setting them equal gives Bayes' theorem:

P(x | y) = P(y | x) P(x) / P(y)

The four terms have names that recur throughout the book:

Prior P(x) — your belief before seeing evidence y. In the medical test: how common is this disease in the population?

Likelihood P(y|x) — how probable is the observation if hypothesis x is true? If the disease is present, how likely is a positive test (sensitivity)?

Evidence P(y) = ∑x P(y|x)P(x) — the total probability of the observation, summing over all hypotheses. This is the normalizing constant.

Posterior P(x|y) — your updated belief after seeing evidence y. This is the answer to the question "what should I believe now?"

The medical test calculation — always do this. Disease prevalence: 1%. Sensitivity P(+|disease) = 95%. Specificity P(−|healthy) = 90%, so false positive rate = 10%. A positive test gives posterior P(disease|+) = (0.95 × 0.01) / (0.95×0.01 + 0.10×0.99) ≈ 8.8%. You have an 8.8% chance of disease. Not 95%. The low prior swamps the high sensitivity. This is called the base rate fallacy and it kills people when doctors ignore it.
Bayes' Theorem: Interactive Medical Test

Adjust disease prevalence (prior), test sensitivity P(+|disease), and test specificity P(−|healthy). Watch the posterior P(disease|+) update. The step-by-step calculation is shown below the bars.

Prevalence P(disease) 1.0%
Sensitivity P(+|disease) 95.0%
Specificity P(−|healthy) 90.0%

Sequential Bayesian Updating

What if you take the test a second time? The posterior from the first test becomes the prior for the second test. Bayes' theorem applies iteratively — each new observation updates the belief. After k independent observations y1, ..., yk:

P(x | y1:k) ∝ P(x) ∏i=1k P(yi | x)

This sequential structure is exactly what Kalman filters, particle filters, and all Bayesian estimation algorithms exploit. The chapter you are reading is where it all starts.

A disease has 1% prevalence. A test has 95% sensitivity and 90% specificity. You test positive. What is approximately the probability you have the disease?

Chapter 5: Bayesian Networks

The joint distribution over n variables needs 2n−1 parameters. We need a compact alternative that still captures the real dependencies. The answer is the Bayesian network — a directed acyclic graph (DAG) where nodes are variables and edges encode direct probabilistic influence.

The DAG Structure

In a directed acyclic graph, arrows point from cause to effect, and there are no cycles. Each node Xi is associated with a conditional probability table (CPT): P(Xi | parents(Xi)). Nodes with no parents store marginal probabilities P(Xi).

The Chain Rule for Bayesian Networks

The graph structure encodes a factorization of the joint distribution:

P(x1, ..., xn) = ∏i=1n P(xi | parents(xi))

This is not an assumption — it is the definition of what the DAG means. Every variable is conditionally independent of its non-descendants given its parents. This is the Markov condition of a Bayesian network.

The Satellite Example

The textbook's spacecraft monitoring system has 5 binary variables: Battery failure, Solar panel failure, Electrical system failure, Deviation (trajectory), and Communication loss. The causal structure is: B and S both cause E; E causes D and C.

Full joint needs: 25−1 = 31 parameters

Bayesian network needs:

  • P(B): 1 parameter
  • P(S): 1 parameter
  • P(E|B,S): 4 parameters (2×2 parent combinations)
  • P(D|E): 2 parameters
  • P(C|E): 2 parameters
  • Total: 10 parameters
The savings compounds. For 20 variables with 3 parents each, the full joint needs >1M parameters. The network needs at most 20×23 = 160 parameters — a 6,000× compression. For 100 variables: full joint has ~1030 entries; network has at most a few thousand.
Satellite Bayesian Network

The 5-node satellite network. Click a node to cycle through: unobserved (dim) → true (orange) → false (teal) → unobserved. Posterior beliefs update via full enumeration of the joint. The chain rule computation is live.

Click a node to toggle evidence
In a Bayesian network, what does the conditional probability table (CPT) of node Xi store?

Chapter 6: Conditional Independence

Two variables X and Y are marginally independent if knowing one tells you nothing about the other: P(X|Y) = P(X), equivalently P(X,Y) = P(X)P(Y). But marginal independence is rarely the useful concept in complex systems.

The more powerful idea is conditional independence. X and Y are conditionally independent given Z if knowing Z makes them irrelevant to each other:

X ⊥ Y | Z   ⇔   P(X, Y | Z) = P(X | Z) P(Y | Z)

Equivalently, P(X | Y, Z) = P(X | Z) — once you know Z, learning Y adds nothing about X. This is written X ⊥⊥ Y | Z.

Three Structures That Generate Conditional Independence

Chain: X → Z → Y
X and Y are not independent marginally (X influences Y through Z). But given Z, they are conditionally independent: X ⊥⊥ Y | Z. Example: weather → road wetness → accident rate. Knowing road wetness screens off the weather.
Fork: X ← Z → Y
Z is a common cause. X and Y are marginally correlated (ice cream sales and drowning rates both rise in summer). But conditioning on Z (summer) blocks the spurious correlation: X ⊥⊥ Y | Z.
Collider (V-structure): X → Z ← Y
X and Y are marginally independent (two independent causes). But conditioning on their common effect Z creates a dependency. This is explaining away: if we see Z=1 and learn X=0, we infer Y=1 more strongly. X ⊥⊥ Y but X ¬⊥⊥ Y | Z.
Explaining away is counterintuitive. Battery failure (B) and solar panel failure (S) are independent: knowing one happens tells you nothing about the other. But once you observe electrical failure (E) — which both B and S cause — learning that B did NOT fail makes S failure more likely. The observation "explains away" one cause, boosting the other. This is collider activation, and it is central to understanding d-separation.
Conditional Independence Demonstrator

Three simple networks illustrating the three structures. Select a structure, toggle evidence on Z, and see how correlation between X and Y changes.

Chain selected
Battery failure B and solar panel failure S are independent. After observing electrical failure E (caused by both), what happens to the dependence between B and S?

Chapter 7: d-Separation

Given a complex Bayesian network with many nodes, how do we determine whether two variables X and Y are conditionally independent given some evidence set Z? The answer is d-separation (d = "directional") — a graphical criterion that reads independence directly from the DAG structure.

The Algorithm

To check whether X ⊥⊥ Y | Z in a DAG:

  1. Enumerate all undirected paths between X and Y (ignoring arrow directions).
  2. A path is blocked by Z if it contains a node that blocks flow (see below).
  3. If every path is blocked, X and Y are d-separated given Z, meaning X ⊥⊥ Y | Z.
  4. If any path is active (not blocked), X and Y are d-connected — not guaranteed independent.

When a Node on a Path Blocks or Activates Flow

Structure on pathEvidence on middle node?Flow status
Chain X→M→Y or X←M←YM in Z (observed)Blocked
Chain X→M→Y or X←M←YM not in ZActive
Fork X←M→YM in Z (observed)Blocked
Fork X←M→YM not in ZActive
Collider X→M←YM or descendant in ZActive (explains away)
Collider X→M←YM and descendants not in ZBlocked

The collider rule is the surprising one: unobserved colliders block flow (independent causes stay independent when we do not observe the effect). But observing a collider opens flow — independent causes become correlated through their shared effect.

d-separation is complete. The Markov condition guarantees that every conditional independence implied by d-separation is real in the distribution. For faithful distributions, the converse holds too: every conditional independence in the distribution corresponds to a d-separation in the graph. This makes d-separation a complete read-off of the distribution's independence structure from the graph.
d-Separation Explorer: Satellite Network

The satellite network (B, S, E, D, C). Click a node to toggle it as evidence (orange). Then click two different nodes while holding Shift to mark them as query nodes (teal). The explorer checks all paths and reports active/blocked status.

Click nodes to set evidence; Shift+click to query
In the satellite network, are B (battery) and S (solar) d-separated given E (electrical failure)?

Chapter 8: Building a Bayesian Network

The satellite network emerged naturally from the causal structure: battery and solar panels cause electrical failure; electrical failure causes trajectory deviation and communication loss. But how do we systematically construct a Bayesian network for a new domain?

Step 1 — Identify Variables and Values

List all relevant variables and their possible values. For discrete variables, decide on the resolution: binary (fail/ok) is often enough at the system level. Too many states per variable explodes the CPT size — a node with 3 parents each taking 4 values needs a CPT of 43×k entries where k is the variable's own cardinality.

Step 2 — Choose a Topological Order

Pick an ordering X1, ..., Xn such that causes come before effects. This ordering determines which parents each variable can have. The chain rule applied in this order always gives a valid factorization:

P(x1:n) = ∏i P(xi | x1:i-1)

But with n−1 possible parents for Xi, the full CPT is still exponential. The key step is identifying which predecessors are actually direct causes.

Step 3 — Select Parents (Apply Domain Knowledge)

For each variable Xi, ask: which variables in {X1, ..., Xi-1} directly influence Xi? A variable Xj is not needed as a parent of Xi if Xi is conditionally independent of Xj given the chosen parents. This conditional independence assumption is what makes the network compact — it is a modeling decision, not a mathematical fact.

Causal thinking is the best guide. Edges that follow causal direction (cause to effect) tend to produce sparse, interpretable networks with valid conditional independence assumptions. Edges running backwards or across causal chains tend to require more parents and produce denser networks. Kochenderfer et al. recommend: "When in doubt, follow the arrows of time and causation."

Step 4 — Elicit or Learn the CPTs

For each node Xi with parents pai, we need P(Xi | pai = v) for every parent configuration v. Sources:

The Noisy-OR Model

When Xi has many binary parent causes C1, ..., Ck, the full CPT needs 2k entries. The noisy-OR reduces this to k parameters q1, ..., qk (inhibition probabilities) via:

P(Xi=0 | c1:k) = ∏j: cj=1 qj

Interpretation: each active cause independently fails to produce the effect with probability qj. The effect occurs unless all causes fail to produce it. This reduces k parents from 2k to just k parameters.

CPT Size Calculator

Adjust number of parents and cardinalities to see how CPT size grows. Compare the full CPT vs noisy-OR. This is why parent selection matters so much.

Number of binary parents k 4
Child cardinality 2
A node has 6 binary parents. The full CPT requires how many parameters (assuming binary child)?

Chapter 9: Connections & What Comes Next

Chapter 2 has built the complete language of probabilistic representation. Before moving forward, here is where each concept leads in the book.

Concept from Ch 2Core formulaWhere it goes
Probability axiomsP(Ω)=1, additivityFoundation for all inference and decision theory
PMF / PDF / CDF∑P(x)=1, ∫p(x)dx=1Parameter learning (Ch 4), distributions over policies
Joint distributionP(x1,...,xn)Factor representation (Ch 2.5), inference targets
MarginalizationP(x)=∑yP(x,y)Variable elimination (Ch 3), particle filters
Conditional probabilityP(x|y)=P(x,y)/P(y)Every CPT in a Bayesian network
Bayes' theoremP(x|y)∝P(y|x)P(x)All inference (Ch 3), all estimation (Ch 19)
Bayesian networkiP(xi|pai)Inference (Ch 3), learning (Ch 4–5), decisions (Ch 6)
d-SeparationGraphical independenceStructure learning (Ch 5), POMDP belief updates (Ch 19)

Limitations of This Chapter's Representations

Discrete-only CPTs require exact enumeration of parent states. For continuous parents, we need parametric families (linear Gaussian, sigmoid, neural networks). Chapter 4 covers learning these from data.
Static networks represent a single time slice. Time-varying systems (the satellite's state over time) need dynamic Bayesian networks or hidden Markov models — Chapter 19 covers beliefs in sequential settings.
Exact inference in large networks is NP-hard in general. Chapter 3 develops variable elimination, belief propagation, and Monte Carlo sampling to handle large-scale inference.
Learning structure (not just parameters) requires searching over DAGs — an enormous combinatorial space. Chapter 5 covers constraint-based and score-based structure learning algorithms.

Connections to Other Parminces Lessons

The ideas in this chapter appear everywhere:

"Probability theory is nothing but common sense reduced to calculation."
— Pierre-Simon Laplace

You now have the vocabulary. Every distribution, network, and inference method in this book speaks this language. In Chapter 3, we learn to use it to reason backwards from observations to hidden causes.

What makes d-separation powerful as a tool for reading Bayesian networks?
← Chapter Index Chapter 3: Inference →