Piech, Chapter 10

Inference & Bayesian Networks

Updating beliefs across networks of random variables — from two-variable Bayes to graphical models.

Prerequisites: Chapter 9 (Joint distributions, marginals, conditional PMFs/PDFs). That's it.
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: Why Inference?

You're building a medical diagnosis app. A patient walks in with a fever and fatigue. You know the prior probabilities of various diseases, and you know how likely each symptom is given each disease. The patient gives you evidence — their symptoms — and you need to update your beliefs about what disease they have.

This is inference: computing the probability of an unknown random variable given observed values of other random variables. It is the central computational task in probabilistic modelling. Everything we've built so far — joint distributions, conditional PMFs, Bayes' theorem for events — leads here.

In this chapter we'll do inference first with two random variables (straightforward Bayes' theorem), then scale up to networks of many variables where a naive approach would require tracking 2N combinations. Bayesian networks tame that exponential explosion by encoding independence assumptions in a directed graph.

The core idea: Inference is "running Bayes' theorem on random variables." You have a prior belief about X, you observe evidence Y = y, and you compute the posterior P(X | Y = y). The twist: when you have 100 interacting variables, you need structure — Bayesian networks — to make inference tractable.
Prior
P(X) — what you believe before evidence
↓ observe Y = y
Likelihood
P(Y = y | X) — how probable the evidence is under each hypothesis
↓ apply Bayes' theorem
Posterior
P(X | Y = y) — updated belief

We'll also cover two related but distinct concepts: independence (knowing X tells you nothing about Y) and correlation (a quantitative measure of linear association). These ideas govern when and how information flows between variables.

ConceptWhat it answers
InferenceWhat is P(X | observed evidence)?
Bayesian networkHow do we represent a joint distribution compactly?
IndependenceDoes knowing X tell us anything about Y?
Covariance / CorrelationHow strongly and in what direction do X and Y co-vary?
Check: What is the central task of inference?

Chapter 1: Bayes' Theorem for Random Variables

In Chapter 3 we learned Bayes' theorem for events. Now we extend it to random variables. The logic is identical — every relational operator on a random variable defines an event — but the notation changes to handle PMFs and PDFs.

Discrete case. Let X and Y be discrete random variables. The conditional PMF is:

P(X = x | Y = y) = P(X = x, Y = y) / P(Y = y)

And Bayes' theorem becomes:

P(X = x | Y = y) = P(Y = y | X = x) · P(X = x) / P(Y = y)

The denominator P(Y = y) is computed via the law of total probability: sum P(Y = y | X = x') · P(X = x') over all possible values x' of X.

Shorthand notation. We write P(x | y) as shorthand for P(X = x | Y = y). This keeps formulas compact when juggling many variables.

Mixed case. When X is continuous and N is discrete:

f(X = x | N = n) = P(N = n | X = x) · f(X = x) / P(N = n)

The rule: whenever the variable on the left of the conditional is continuous, use a density f; when discrete, use a probability P. The derivation uses the approximation P(X = x) ≈ f(X = x) · ε, and the ε terms cancel.

Key insight: The rules of probability — Bayes' theorem, chain rule, total probability — all transfer to random variables. For continuous variables, replace probabilities with densities. Whenever ε appears (from the density-to-probability conversion), it cancels.

Worked example — elephant inference: Girl elephants weigh N(μ=160, σ2=49) at birth; boys weigh N(μ=165, σ2=9). A newborn weighs 163 kg. What is P(girl)?

Let G = 1 mean girl, P(G=1) = 0.5. We need P(G=1 | X=163):

P(G=1 | X=163) = f(X=163 | G=1) · P(G=1) / f(X=163)

The likelihood f(X=163 | G=1) is the Gaussian PDF with μ=160, σ=7, evaluated at 163:

f(163 | girl) = (1 / 7√(2π)) · exp(−(163−160)2 / (2 · 49)) = (1 / 7√(2π)) · exp(−9/98)

Similarly, f(X=163 | G=0) uses μ=165, σ=3:

f(163 | boy) = (1 / 3√(2π)) · exp(−(163−165)2 / (2 · 9)) = (1 / 3√(2π)) · exp(−4/18)

The denominator via total probability: f(163) = f(163|girl)·0.5 + f(163|boy)·0.5. Plugging in numerically:

P(girl | 163kg) ≈ 0.328

Despite 163 being closer to the girl mean (160 vs 165), the boy distribution is much tighter (σ=3 vs σ=7), so 163 is more probable under the boy hypothesis. The posterior favours boy at about 67%.

Intuition: A tighter distribution concentrates more probability mass near its mean. Even though 163 is 3 kg from the girl mean and only 2 kg from the boy mean, the boy's narrow bell curve assigns much higher density at 163. Spread matters as much as distance.
Check: In the elephant example, why does the posterior favour "boy" even though 163 is closer to the girl mean?

Chapter 2: MAP & MLE Preview

Inference gives us the full posterior distribution P(X | evidence). But often we just want a single "best guess." There are two natural choices, and they differ in whether you account for the prior.

Maximum A Posteriori (MAP) picks the value of X that maximises the posterior:

MAP = argmaxx P(X = x | Y = y) = argmaxx P(Y = y | X = x) · P(X = x)

We can drop the denominator P(Y = y) because it doesn't depend on x. MAP uses both the likelihood and the prior — it asks "what value of X makes the observed evidence most probable, weighted by how likely X was a priori?"

Maximum Likelihood Estimation (MLE) ignores the prior entirely:

MLE = argmaxx P(Y = y | X = x)

MLE asks "what value of X makes the observed data most probable?" without any prior belief. This is equivalent to MAP with a uniform prior.

MAP vs. MLE: MAP = likelihood × prior. MLE = likelihood only. When the prior is uniform (you have no reason to prefer any value of X), MAP and MLE give the same answer. When you have strong prior information, MAP pulls the estimate towards the prior.

Worked example: Back to the elephant. The MLE asks: which sex makes 163 kg most likely? We compare f(163 | girl) vs f(163 | boy). Since f(163 | boy) is larger (as we computed), MLE says "boy." MAP also says "boy" because the prior is 50/50 — with equal priors, MAP = MLE.

Now suppose 70% of elephants born at this zoo are girls. Then P(G=1) = 0.7 and P(G=0) = 0.3. The MAP calculation becomes:

MAP: f(163|girl) · 0.7 vs. f(163|boy) · 0.3

Working out the numbers: the girl side is roughly 0.0227 × 0.7 = 0.0159; the boy side is roughly 0.0547 × 0.3 = 0.0164. They're almost tied! The strong prior towards girl nearly overcomes the likelihood advantage of boy. MLE still says "boy" (it ignores the 70% girl prior), but MAP is essentially a coin flip.

MethodFormulaUses prior?
MAPargmaxx P(y|x) P(x)Yes
MLEargmaxx P(y|x)No
Full BayesianEntire posterior P(x|y)Yes
Looking ahead: MLE is the workhorse of machine learning — when you "train a model" by maximising likelihood on data, you're doing MLE. MAP shows up in regularisation: an L2 penalty on weights is equivalent to a Gaussian prior, turning MLE into MAP. We'll revisit both in detail in the next chapter on general inference.
Check: When is MAP identical to MLE?

Chapter 3: Bayesian Networks

Consider a medical model with 100 binary random variables: demographics, conditions, and symptoms. To fully specify the joint distribution, you'd need to fill in a table with 2100 > 1030 entries — approaching the number of atoms in the universe. That's clearly infeasible.

Bayesian networks (Bayes nets) solve this by encoding the generative process — the causal flow of influence between variables — as a directed acyclic graph (DAG). Each node is a random variable. An arrow from X to Y means "X directly influences Y." We say X is a parent of Y.

Here's a concrete example. Consider four binary variables: University (U), Influenza (I), Fever (F), Tired (T). Being in university influences whether you get influenza. Having influenza influences whether you have a fever. Both university and influenza influence whether you're tired.

Bayesian Network: Disease Model

A simple four-node Bayes net. Arrows show the direction of causal influence. Hover over a node to see its conditional probability table.

To fully specify this Bayes net, we provide the conditional probability of each variable given its parents:

VariableConditionP(=1)
Uni(no parents)0.80
InfluenzaUni=10.20
InfluenzaUni=00.10
FeverInf=10.90
FeverInf=00.05
TiredUni=0, Inf=00.10
TiredUni=0, Inf=10.90
TiredUni=1, Inf=00.80
TiredUni=1, Inf=11.00
Key insight — the factorisation: The joint probability over all variables factors as a product of each variable's conditional probability given its parents:

P(U, I, F, T) = P(U) · P(I|U) · P(F|I) · P(T|U, I)

Instead of 24 − 1 = 15 free parameters (a full joint table), we only need 1 + 2 + 2 + 4 = 9 entries. For 100 binary variables each with at most 3 parents, we'd need about 800 parameters instead of 1030.

Worked example: What is P(Fever=1, Tired=1, Influenza=1, Uni=1)?

= P(U=1) · P(I=1|U=1) · P(F=1|I=1) · P(T=1|U=1,I=1)
= 0.80 × 0.20 × 0.90 × 1.00 = 0.144

Why this works: From the chain rule, the exact joint is P(x1,...,xn) = ∏ P(xi | xi-1,...,x1). The Bayes net assumes P(xi | xi-1,...,x1) = P(xi | parents of Xi). This is a conditional independence assumption: each variable, given its parents, is independent of all its non-descendants.

Designing a Bayes net: (1) Choose your random variables as nodes. (2) Add directed edges based on causal influence. (3) Define P(Xi | parents) for each node. The edges should follow the causal direction — demographics cause conditions cause symptoms — which makes the conditional independence assumptions natural.
Check: In a Bayes net with 100 binary variables and at most 3 parents per node, roughly how many parameters are needed?

Chapter 4: Conditional Independence

The power of Bayesian networks comes from conditional independence. The Bayes net encodes a specific independence assumption: each variable Xi, given its parents, is conditionally independent of all its non-descendants.

What does this mean concretely? In our disease model, once you know whether someone has influenza, learning their university status tells you nothing extra about their fever. Formally:

P(Fever | Influenza, Uni) = P(Fever | Influenza)

Fever depends on Influenza (its parent). University is a non-descendant of Fever, and once we condition on Influenza (which is Fever's parent), University provides no additional information.

The chain rule decomposition. Without any independence assumptions, the chain rule for n variables is:

P(x1,...,xn) = ∏i P(xi | xi-1,...,x1)

Each factor conditions on all preceding variables. The Bayes net simplifies each factor:

P(xi | xi-1,...,x1) = P(xi | parents of Xi)

This is exactly the conditional independence statement. It says: knowing the parents is sufficient; non-descendants add nothing.

Non-descendants vs. descendants: A descendant of Xi is any node in the subtree rooted at Xi (children, grandchildren, etc.). A non-descendant is everything else: ancestors, siblings, and unconnected nodes. The Bayes net assumption is only about non-descendants — descendants can still carry information about Xi (that's the direction inference flows).

Worked example: In our 4-node disease model, compute P(Uni=1 | Fever=1).

We need to marginalise over Influenza. By Bayes' theorem:

P(U=1|F=1) = P(F=1|U=1) · P(U=1) / P(F=1)

First compute P(F=1|U=1) by marginalising over I:

P(F=1|U=1) = P(F=1|I=1)P(I=1|U=1) + P(F=1|I=0)P(I=0|U=1)
= 0.90 × 0.20 + 0.05 × 0.80 = 0.180 + 0.040 = 0.220

Similarly, P(F=1|U=0) = 0.90 × 0.10 + 0.05 × 0.90 = 0.135. Then:

P(F=1) = P(F=1|U=1)P(U=1) + P(F=1|U=0)P(U=0) = 0.220 × 0.80 + 0.135 × 0.20 = 0.203
P(U=1|F=1) = 0.220 × 0.80 / 0.203 = 0.176 / 0.203 ≈ 0.867

So observing a fever nudges P(Uni) from 0.80 up to 0.867 — a small increase, because university students are slightly more likely to have influenza and hence fever.

Check: In the disease Bayes net, if you know Influenza=1, does learning Uni=1 change P(Fever)?

Chapter 5: Independence in Random Variables

Two random variables X and Y are independent if knowing the value of one tells you absolutely nothing about the other. Formally:

P(X = x, Y = y) = P(X = x) · P(Y = y)  for all x, y

For continuous variables, the equivalent statement uses CDFs or PDFs:

f(X = x, Y = y) = f(X = x) · f(Y = y)  for all x, y

A useful test: if the joint distribution factors into a product of a function of x alone and a function of y alone, the variables are independent. That is, if P(X=x, Y=y) = h(x) · g(y) for some functions h and g, then X and Y are independent.

Key insight — symmetry: Independence is symmetric. If X is independent of Y, then Y is independent of X. This seems obvious but is surprisingly powerful. Sometimes it's much easier to prove independence in one direction than the other.

Expectation of products. If X and Y are independent, the expectation of their product factors:

E[XY] = E[X] · E[Y]  (if X, Y independent)

More generally, E[g(X) · h(Y)] = E[g(X)] · E[h(Y)] for any functions g, h. Note: the converse is false! E[XY] = E[X]E[Y] does not prove independence.

Contrast with sums: E[X + Y] = E[X] + E[Y] holds always, whether or not X and Y are independent. The product rule E[XY] = E[X]E[Y] requires independence. Don't mix them up.

Worked example — Poisson splitting: Requests to a web server follow Poi(λ). Each request is human with probability p or bot with probability 1−p. Let X = human requests, Y = bot requests. Show X ⊥ Y.

Since requests split independently, (X|N) ~ Bin(N, p) and (Y|N) ~ Bin(N, 1−p). We compute the joint via chain rule:

P(X=x, Y=y) = P(X=x, Y=y | N=x+y) · P(N=x+y)
= C(x+y, x) · px(1−p)y · λx+y e−λ / (x+y)!

Expanding C(x+y, x) = (x+y)! / (x! y!) and cancelling:

= [(λp)x e−λp / x!] · [(λ(1−p))y e−λ(1−p) / y!]

This factors into h(x) · g(y) — each factor is the PMF of a Poisson! So X ~ Poi(λp) and Y ~ Poi(λ(1−p)) independently. This is the Poisson splitting theorem.

Check: E[X + Y] = E[X] + E[Y] requires independence. True or false?

Chapter 6: Covariance

Independence is an all-or-nothing property. But in practice, we want a quantitative measure of how two variables relate. Covariance measures the extent to which X and Y deviate from their means together.

Cov(X, Y) = E[(X − E[X])(Y − E[Y])]

Unpacking this: if X is above its mean when Y is above its mean (and below when below), each term (X − E[X])(Y − E[Y]) is positive, so the covariance is positive. If they move in opposite directions, covariance is negative.

A more convenient computational formula:

Cov(X, Y) = E[XY] − E[X] · E[Y]
Covariance of independent variables: If X and Y are independent, then E[XY] = E[X]E[Y], so Cov(X,Y) = 0. But the converse is not true — Cov(X,Y) = 0 does not imply independence. (Classic counterexample: X uniform on [−1,1], Y = X2. They are clearly dependent but Cov(X,Y) = 0.)

Properties of covariance:

PropertyFormula
SymmetryCov(X, Y) = Cov(Y, X)
Self-covarianceCov(X, X) = Var(X)
ScalingCov(aX + b, Y) = a · Cov(X, Y)
BilinearityCov(X1+X2, Y) = Cov(X1,Y) + Cov(X2,Y)

Variance of sums. For X = X1 + ... + Xn:

Var(X) = ∑ij Cov(Xi, Xj)

When the Xi are independent, cross-terms vanish and Var(X) = ∑ Var(Xi). When they're not independent, the cross-covariance terms matter — this is why diversification reduces portfolio risk.

Worked example: X and Y have joint PMF: P(0,0)=0.1, P(0,1)=0.2, P(1,0)=0.3, P(1,1)=0.4.

E[X] = 0·0.3 + 1·0.7 = 0.7. E[Y] = 0·0.4 + 1·0.6 = 0.6. E[XY] = 1·1·0.4 = 0.4.

Cov(X,Y) = 0.4 − 0.7 × 0.6 = 0.4 − 0.42 = −0.02

Slightly negative: X and Y have a weak tendency to move in opposite directions.

Check: Cov(X, Y) = 0 implies X and Y are independent. True or false?

Chapter 7: Correlation

Covariance has a problem: its magnitude depends on the scale of the variables. If you measure height in centimetres vs. metres, the covariance changes by a factor of 100. Correlation fixes this by normalising:

ρ(X, Y) = Cov(X, Y) / √(Var(X) · Var(Y))

Correlation is dimensionless and always lies in [−1, +1]:

ρMeaning
+1Perfect positive linear relationship: Y = aX + b with a > 0
−1Perfect negative linear relationship: Y = aX + b with a < 0
0No linear relationship (but possibly nonlinear dependence!)

When ρ(X,Y) = 0, we say X and Y are uncorrelated. Uncorrelated is weaker than independent: independent ⇒ uncorrelated, but uncorrelated ⇏ independent.

Correlation Explorer

Drag the slider to set a target ρ. The scatter plot shows 200 samples from a bivariate Gaussian with that correlation. Watch how the point cloud elongates and tilts.

ρ0.70
Pearson vs. Spearman: The correlation ρ we've defined is Pearson correlation — it measures linear association. Spearman correlation applies the same formula but first converts values to ranks, capturing monotonic (not just linear) relationships. Spearman is more robust to outliers. Piech notes Spearman is outside the scope of CS109.

Worked example: Let X ~ Uniform{1,2,3} and Y = 2X + 1. Then E[X] = 2, E[X2] = (1+4+9)/3 = 14/3, Var(X) = 14/3 − 4 = 2/3.

E[Y] = 2·2+1 = 5, Var(Y) = 4 · Var(X) = 8/3.

Cov(X,Y) = Cov(X, 2X+1) = 2·Var(X) = 4/3.

ρ(X,Y) = (4/3) / √(2/3 · 8/3) = (4/3) / (4/3) = 1

Perfect correlation, as expected for an exact linear relationship with positive slope.

Check: If ρ(X,Y) = −0.95, which statement is most accurate?

Chapter 8: Showcase — Interactive Bayesian Network

This is the payoff simulation. Below is the four-node disease Bayesian network from Chapter 3, fully interactive. Click any node to toggle its observed value (0 or 1). The network instantly recomputes all posterior probabilities using exact inference (enumeration over all configurations).

Try these experiments:

Interactive Bayes Net Inference

Click a node to cycle: unobserved → observed=1 → observed=0 → unobserved. Posterior probabilities update instantly.

Click a node to set evidence
What to notice: Fever is the strongest signal for influenza because P(Fever|Inf) is very high (0.9) while P(Fever|no Inf) is very low (0.05). University status alone barely shifts influenza probability. But combining evidence is more powerful than either alone — the posterior after observing two variables is not simply the average of two single-evidence posteriors.

How inference works under the hood: For this small network, we can enumerate all 24 = 16 joint configurations. For each unobserved variable, we sum the joint probabilities over all configurations consistent with the evidence, then normalise. This is inference by enumeration — exact but exponential. Chapter 11 will introduce smarter algorithms.

Covariance Heatmap

The pairwise covariance between all four variables in the disease model, computed from the full joint distribution. Darker warm = positive, darker teal = negative.

Check: In the interactive Bayes net, which single observation causes the largest shift in P(Influenza)?

Chapter 9: Connections

This chapter brings together inference, Bayesian networks, independence, and correlation. Let's see how these ideas connect to the bigger picture.

The Bayesian worldview: Inference is the computational expression of Bayesian reasoning. Start with a prior, observe evidence, update via Bayes' theorem. Bayesian networks give this process structure — they encode which variables influence which, making inference tractable even in large models. Independence and correlation quantify when and how strongly information flows between variables.
Inference (Ch 10)
Bayes' theorem for random variables: computing P(X|Y=y)
↓ scales to many variables via
Bayesian Networks (Ch 10)
Directed graphs that factor the joint distribution compactly
↓ encode assumptions of
Conditional Independence
Xi ⊥ non-descendants | parents
↓ measured quantitatively by
Covariance & Correlation
Cov(X,Y) for magnitude, ρ(X,Y) for normalised strength

What comes next:

ConceptKey formulaWhere it leads
Bayes for RVsP(x|y) = P(y|x)P(x)/P(y)Parameter estimation, filtering
MAP / MLEargmax P(y|x)P(x) / argmax P(y|x)All of machine learning
Bayes net factorisation∏ P(xi | parents)Graphical models, variational inference
IndependenceP(x,y) = P(x)P(y)Simplifying computation
Correlationρ = Cov(X,Y)/(σXσY)Feature selection, portfolio theory
Check: Independent ⇒ uncorrelated. Uncorrelated ⇒ independent. Which is true?