How do we encode degrees of belief mathematically? This chapter builds the full language of probability — distributions, joint tables, Bayes' theorem, and Bayesian networks — the substrate every decision algorithm in the book runs on.
A satellite is drifting off course. Is it a battery failure? A solar panel fault? A software bug? You cannot see inside the spacecraft — you only have telemetry readings. How do you decide what to believe, and how to act?
Naive approaches fail. You cannot just pick the most dramatic explanation (that ignores the evidence). You cannot enumerate every possibility in full detail (the state space explodes). You need a language for representing and combining uncertainty systematically.
That language is probability theory. But not as a recipe — as the unique consistent extension of logical reasoning to uncertain situations. Kochenderfer et al. (§2.1) begin with a stunning result: if you want to rank beliefs by plausibility and satisfy two basic rationality requirements, you are forced into probability theory. There is no other option.
The two requirements are universal comparability — any two beliefs can be compared — and transitivity — if A is more plausible than B and B more plausible than C, then A is more plausible than C. From these two axioms alone, a numerical scale of belief emerges. Add a few more consistency conditions (Kolmogorov's axioms) and you have the full probability calculus.
Three possible causes of a spacecraft anomaly. Drag the sliders to set your belief in each. The order on the bar chart gives the plausibility ranking — comparability and transitivity are built in automatically.
Once we accept that beliefs should be real numbers, we need rules for how they combine. Kochenderfer §2.1 follows Cox's theorem: three axioms produce a unique consistent system.
Axiom 1 — Non-negativity: P(A) ≥ 0 for any event A. Belief cannot be negative.
Axiom 2 — Normalization: P(Ω) = 1 where Ω is the set of all possible outcomes. You are certain that something happens.
Axiom 3 — Additivity: If A and B are mutually exclusive (they cannot both occur), then P(A ∪ B) = P(A) + P(B). Disjoint possibilities add.
From just these three axioms, everything else follows. The complement rule P(Ac) = 1 − P(A) is a theorem, not a postulate. The inclusion-exclusion formula P(A ∪ B) = P(A) + P(B) − P(A ∩ B) is a theorem. And crucially, conditional probability P(A|B) = P(A,B)/P(B) is the only consistent way to update beliefs on new information.
Events A and B partition a probability space. Drag to split: when A and B are disjoint, P(A∪B) = P(A)+P(B). Observe how the complement P(neither) = 1−P(A)−P(B) adjusts automatically to maintain normalization.
If B1, ..., Bn partition the sample space (exhaustive and mutually exclusive), then for any event A:
This law is the backbone of marginalization: we compute the probability of A by considering every scenario Bi, weighting by how probable that scenario is. We will use this constantly when computing P(evidence) in Bayes' theorem.
A probability distribution is a complete description of uncertainty over a variable. How you describe it depends on whether the variable is discrete or continuous.
A probability mass function (PMF) assigns a probability to each discrete outcome. For a variable X taking values {x1, ..., xn}, the PMF satisfies P(X = xi) ≥ 0 and ∑i P(X = xi) = 1. The parameters of the distribution are the n−1 free probabilities (the last is fixed by normalization).
Example: a loaded die. Six faces, six probabilities that sum to 1. Five free parameters. If we bias face 1 toward probability 0.4, the remaining five faces share 0.6 equally at 0.12 each.
For a continuous variable X, we use a probability density function (PDF) p(x) where the probability of X falling in interval [a,b] is ∫ab p(x) dx. The density p(x) itself can exceed 1 — only areas under the curve are probabilities. The normalization condition is ∫-∞∞ p(x) dx = 1.
The CDF F(x) = P(X ≤ x) is the integral of the PDF up to x. It starts at 0, ends at 1, and is always non-decreasing. Any valid CDF gives a valid probability distribution. The CDF is often easier to work with analytically than the PDF.
The Gaussian N(μ, σ2) is parameterized by mean μ and variance σ2:
It is unimodal and symmetric. The Central Limit Theorem guarantees that sums of independent random variables converge to a Gaussian, making it the natural model for aggregate noise. When distributions need multiple peaks, use a Gaussian mixture: a weighted sum of Gaussians ∑i ρi N(μi, σi2) where weights ρi sum to 1.
Left: loaded die PMF (bias face 1). Right: Gaussian PDF with adjustable mean and spread. Toggle mixture to see bimodal density.
Real problems involve many variables at once. A joint distribution P(X, Y) specifies the probability of every combination of values. For two binary variables X ∈ {0,1} and Y ∈ {0,1}, this is a 2×2 table of four probabilities summing to 1.
The parameter explosion: for n binary variables, the full joint needs 2n − 1 free parameters. At n=20, that is over one million numbers. At n=100, it is more than atoms in the observable universe. Storing the full joint is impossible for realistic problems — this is why Bayesian networks exist.
From a joint distribution we can always recover the distribution over any subset of variables by marginalization — summing out the unwanted variables:
This is the law of total probability. If X and Y are continuous, replace the sum with an integral. Marginalization tells us: to find P(X=x), add up the joint probability over all possible Y values. The variable Y is summed away — it no longer appears in the result.
X and Y are independent if P(x,y) = P(x)P(y) for all values. This collapses 2n−1 parameters into n parameters — one marginal probability per variable. But independence is often too strong an assumption: knowing someone smokes and knowing they have lung cancer are not independent. Conditional independence is the practical middle ground, which Bayesian networks exploit.
A joint distribution over binary X and Y. Cell intensities reflect probabilities. Marginals P(X) and P(Y) are computed live by summing rows and columns. Hit Randomize to explore different joints, or Make Independent to set P(x,y)=P(x)P(y).
You just got back a positive medical test. Should you panic? The answer depends on three numbers that most people — including doctors — confuse. Working through this example is the single most important lesson in probability.
The conditional distribution P(X | Y = y) is your belief about X after learning that Y has value y. The definition is:
We take the slice of the joint table where Y=y, then renormalize so the probabilities sum to 1. This is all conditional probability is — rescaling to a reduced hypothesis space. Note that P(x|y) is a proper distribution over x for each fixed y.
Rearranging the definition gives the chain rule (also called the product rule):
Extending to three variables: P(x, y, z) = P(x | y, z) P(y | z) P(z). This decomposition of a joint into conditionals is the foundation of Bayesian networks.
Since P(x,y) = P(y|x)P(x) and also P(x,y) = P(x|y)P(y), setting them equal gives Bayes' theorem:
The four terms have names that recur throughout the book:
Prior P(x) — your belief before seeing evidence y. In the medical test: how common is this disease in the population?
Likelihood P(y|x) — how probable is the observation if hypothesis x is true? If the disease is present, how likely is a positive test (sensitivity)?
Evidence P(y) = ∑x P(y|x)P(x) — the total probability of the observation, summing over all hypotheses. This is the normalizing constant.
Posterior P(x|y) — your updated belief after seeing evidence y. This is the answer to the question "what should I believe now?"
Adjust disease prevalence (prior), test sensitivity P(+|disease), and test specificity P(−|healthy). Watch the posterior P(disease|+) update. The step-by-step calculation is shown below the bars.
What if you take the test a second time? The posterior from the first test becomes the prior for the second test. Bayes' theorem applies iteratively — each new observation updates the belief. After k independent observations y1, ..., yk:
This sequential structure is exactly what Kalman filters, particle filters, and all Bayesian estimation algorithms exploit. The chapter you are reading is where it all starts.
The joint distribution over n variables needs 2n−1 parameters. We need a compact alternative that still captures the real dependencies. The answer is the Bayesian network — a directed acyclic graph (DAG) where nodes are variables and edges encode direct probabilistic influence.
In a directed acyclic graph, arrows point from cause to effect, and there are no cycles. Each node Xi is associated with a conditional probability table (CPT): P(Xi | parents(Xi)). Nodes with no parents store marginal probabilities P(Xi).
The graph structure encodes a factorization of the joint distribution:
This is not an assumption — it is the definition of what the DAG means. Every variable is conditionally independent of its non-descendants given its parents. This is the Markov condition of a Bayesian network.
The textbook's spacecraft monitoring system has 5 binary variables: Battery failure, Solar panel failure, Electrical system failure, Deviation (trajectory), and Communication loss. The causal structure is: B and S both cause E; E causes D and C.
Full joint needs: 25−1 = 31 parameters
Bayesian network needs:
The 5-node satellite network. Click a node to cycle through: unobserved (dim) → true (orange) → false (teal) → unobserved. Posterior beliefs update via full enumeration of the joint. The chain rule computation is live.
Two variables X and Y are marginally independent if knowing one tells you nothing about the other: P(X|Y) = P(X), equivalently P(X,Y) = P(X)P(Y). But marginal independence is rarely the useful concept in complex systems.
The more powerful idea is conditional independence. X and Y are conditionally independent given Z if knowing Z makes them irrelevant to each other:
Equivalently, P(X | Y, Z) = P(X | Z) — once you know Z, learning Y adds nothing about X. This is written X ⊥⊥ Y | Z.
Three simple networks illustrating the three structures. Select a structure, toggle evidence on Z, and see how correlation between X and Y changes.
Given a complex Bayesian network with many nodes, how do we determine whether two variables X and Y are conditionally independent given some evidence set Z? The answer is d-separation (d = "directional") — a graphical criterion that reads independence directly from the DAG structure.
To check whether X ⊥⊥ Y | Z in a DAG:
| Structure on path | Evidence on middle node? | Flow status |
|---|---|---|
| Chain X→M→Y or X←M←Y | M in Z (observed) | Blocked |
| Chain X→M→Y or X←M←Y | M not in Z | Active |
| Fork X←M→Y | M in Z (observed) | Blocked |
| Fork X←M→Y | M not in Z | Active |
| Collider X→M←Y | M or descendant in Z | Active (explains away) |
| Collider X→M←Y | M and descendants not in Z | Blocked |
The collider rule is the surprising one: unobserved colliders block flow (independent causes stay independent when we do not observe the effect). But observing a collider opens flow — independent causes become correlated through their shared effect.
The satellite network (B, S, E, D, C). Click a node to toggle it as evidence (orange). Then click two different nodes while holding Shift to mark them as query nodes (teal). The explorer checks all paths and reports active/blocked status.
The satellite network emerged naturally from the causal structure: battery and solar panels cause electrical failure; electrical failure causes trajectory deviation and communication loss. But how do we systematically construct a Bayesian network for a new domain?
List all relevant variables and their possible values. For discrete variables, decide on the resolution: binary (fail/ok) is often enough at the system level. Too many states per variable explodes the CPT size — a node with 3 parents each taking 4 values needs a CPT of 43×k entries where k is the variable's own cardinality.
Pick an ordering X1, ..., Xn such that causes come before effects. This ordering determines which parents each variable can have. The chain rule applied in this order always gives a valid factorization:
But with n−1 possible parents for Xi, the full CPT is still exponential. The key step is identifying which predecessors are actually direct causes.
For each variable Xi, ask: which variables in {X1, ..., Xi-1} directly influence Xi? A variable Xj is not needed as a parent of Xi if Xi is conditionally independent of Xj given the chosen parents. This conditional independence assumption is what makes the network compact — it is a modeling decision, not a mathematical fact.
For each node Xi with parents pai, we need P(Xi | pai = v) for every parent configuration v. Sources:
When Xi has many binary parent causes C1, ..., Ck, the full CPT needs 2k entries. The noisy-OR reduces this to k parameters q1, ..., qk (inhibition probabilities) via:
Interpretation: each active cause independently fails to produce the effect with probability qj. The effect occurs unless all causes fail to produce it. This reduces k parents from 2k to just k parameters.
Adjust number of parents and cardinalities to see how CPT size grows. Compare the full CPT vs noisy-OR. This is why parent selection matters so much.
Chapter 2 has built the complete language of probabilistic representation. Before moving forward, here is where each concept leads in the book.
| Concept from Ch 2 | Core formula | Where it goes |
|---|---|---|
| Probability axioms | P(Ω)=1, additivity | Foundation for all inference and decision theory |
| PMF / PDF / CDF | ∑P(x)=1, ∫p(x)dx=1 | Parameter learning (Ch 4), distributions over policies |
| Joint distribution | P(x1,...,xn) | Factor representation (Ch 2.5), inference targets |
| Marginalization | P(x)=∑yP(x,y) | Variable elimination (Ch 3), particle filters |
| Conditional probability | P(x|y)=P(x,y)/P(y) | Every CPT in a Bayesian network |
| Bayes' theorem | P(x|y)∝P(y|x)P(x) | All inference (Ch 3), all estimation (Ch 19) |
| Bayesian network | ∏iP(xi|pai) | Inference (Ch 3), learning (Ch 4–5), decisions (Ch 6) |
| d-Separation | Graphical independence | Structure learning (Ch 5), POMDP belief updates (Ch 19) |
The ideas in this chapter appear everywhere:
You now have the vocabulary. Every distribution, network, and inference method in this book speaks this language. In Chapter 3, we learn to use it to reason backwards from observations to hidden causes.