Ch 10: Inference & Bayesian Networks

Chapter 0: Why Inference?

You're building a medical diagnosis app. A patient walks in with a fever and fatigue. You know the prior probabilities of various diseases, and you know how likely each symptom is given each disease. The patient gives you evidence — their symptoms — and you need to update your beliefs about what disease they have.

This is inference: computing the probability of an unknown random variable given observed values of other random variables. It is the central computational task in probabilistic modelling. Everything we've built so far — joint distributions, conditional PMFs, Bayes' theorem for events — leads here.

In this chapter we'll do inference first with two random variables (straightforward Bayes' theorem), then scale up to networks of many variables where a naive approach would require tracking 2^N combinations. Bayesian networks tame that exponential explosion by encoding independence assumptions in a directed graph.

The core idea: Inference is "running Bayes' theorem on random variables." You have a prior belief about X, you observe evidence Y = y, and you compute the posterior P(X | Y = y). The twist: when you have 100 interacting variables, you need structure — Bayesian networks — to make inference tractable.

Prior

P(X) — what you believe before evidence

↓ observe Y = y

Likelihood

P(Y = y | X) — how probable the evidence is under each hypothesis

↓ apply Bayes' theorem

Posterior

P(X | Y = y) — updated belief

We'll also cover two related but distinct concepts: independence (knowing X tells you nothing about Y) and correlation (a quantitative measure of linear association). These ideas govern when and how information flows between variables.

Concept	What it answers
Inference	What is P(X \| observed evidence)?
Bayesian network	How do we represent a joint distribution compactly?
Independence	Does knowing X tell us anything about Y?
Covariance / Correlation	How strongly and in what direction do X and Y co-vary?

Check: What is the central task of inference?

Computing the joint distribution from scratch Computing the probability of an unknown variable given observed evidence Finding the maximum of a probability distribution

Chapter 1: Bayes' Theorem for Random Variables

In Chapter 3 we learned Bayes' theorem for events. Now we extend it to random variables. The logic is identical — every relational operator on a random variable defines an event — but the notation changes to handle PMFs and PDFs.

Discrete case. Let X and Y be discrete random variables. The conditional PMF is:

P(X = x | Y = y) = P(X = x, Y = y) / P(Y = y)

And Bayes' theorem becomes:

P(X = x | Y = y) = P(Y = y | X = x) · P(X = x) / P(Y = y)

The denominator P(Y = y) is computed via the law of total probability: sum P(Y = y | X = x') · P(X = x') over all possible values x' of X.

Shorthand notation. We write P(x | y) as shorthand for P(X = x | Y = y). This keeps formulas compact when juggling many variables.

Mixed case. When X is continuous and N is discrete:

f(X = x | N = n) = P(N = n | X = x) · f(X = x) / P(N = n)

The rule: whenever the variable on the left of the conditional is continuous, use a density f; when discrete, use a probability P. The derivation uses the approximation P(X = x) ≈ f(X = x) · ε, and the ε terms cancel.

Key insight: The rules of probability — Bayes' theorem, chain rule, total probability — all transfer to random variables. For continuous variables, replace probabilities with densities. Whenever ε appears (from the density-to-probability conversion), it cancels.

Worked example — elephant inference: Girl elephants weigh N(μ=160, σ²=49) at birth; boys weigh N(μ=165, σ²=9). A newborn weighs 163 kg. What is P(girl)?

Let G = 1 mean girl, P(G=1) = 0.5. We need P(G=1 | X=163):

P(G=1 | X=163) = f(X=163 | G=1) · P(G=1) / f(X=163)

The likelihood f(X=163 | G=1) is the Gaussian PDF with μ=160, σ=7, evaluated at 163:

f(163 | girl) = (1 / 7√(2π)) · exp(−(163−160)² / (2 · 49)) = (1 / 7√(2π)) · exp(−9/98)

Similarly, f(X=163 | G=0) uses μ=165, σ=3:

f(163 | boy) = (1 / 3√(2π)) · exp(−(163−165)² / (2 · 9)) = (1 / 3√(2π)) · exp(−4/18)

The denominator via total probability: f(163) = f(163|girl)·0.5 + f(163|boy)·0.5. Plugging in numerically:

P(girl | 163kg) ≈ 0.328

Despite 163 being closer to the girl mean (160 vs 165), the boy distribution is much tighter (σ=3 vs σ=7), so 163 is more probable under the boy hypothesis. The posterior favours boy at about 67%.

Intuition: A tighter distribution concentrates more probability mass near its mean. Even though 163 is 3 kg from the girl mean and only 2 kg from the boy mean, the boy's narrow bell curve assigns much higher density at 163. Spread matters as much as distance.

Check: In the elephant example, why does the posterior favour "boy" even though 163 is closer to the girl mean?

Because boys are more common Because the boy distribution is much narrower (σ=3), concentrating more density near 163 Because Bayes' theorem always favours the higher mean

Chapter 2: MAP & MLE Preview

Inference gives us the full posterior distribution P(X | evidence). But often we just want a single "best guess." There are two natural choices, and they differ in whether you account for the prior.

Maximum A Posteriori (MAP) picks the value of X that maximises the posterior:

x̂_MAP = argmax_x P(X = x | Y = y) = argmax_x P(Y = y | X = x) · P(X = x)

We can drop the denominator P(Y = y) because it doesn't depend on x. MAP uses both the likelihood and the prior — it asks "what value of X makes the observed evidence most probable, weighted by how likely X was a priori?"

Maximum Likelihood Estimation (MLE) ignores the prior entirely:

x̂_MLE = argmax_x P(Y = y | X = x)

MLE asks "what value of X makes the observed data most probable?" without any prior belief. This is equivalent to MAP with a uniform prior.

MAP vs. MLE: MAP = likelihood × prior. MLE = likelihood only. When the prior is uniform (you have no reason to prefer any value of X), MAP and MLE give the same answer. When you have strong prior information, MAP pulls the estimate towards the prior.

Worked example: Back to the elephant. The MLE asks: which sex makes 163 kg most likely? We compare f(163 | girl) vs f(163 | boy). Since f(163 | boy) is larger (as we computed), MLE says "boy." MAP also says "boy" because the prior is 50/50 — with equal priors, MAP = MLE.

Now suppose 70% of elephants born at this zoo are girls. Then P(G=1) = 0.7 and P(G=0) = 0.3. The MAP calculation becomes:

MAP: f(163|girl) · 0.7 vs. f(163|boy) · 0.3

Working out the numbers: the girl side is roughly 0.0227 × 0.7 = 0.0159; the boy side is roughly 0.0547 × 0.3 = 0.0164. They're almost tied! The strong prior towards girl nearly overcomes the likelihood advantage of boy. MLE still says "boy" (it ignores the 70% girl prior), but MAP is essentially a coin flip.

Method	Formula	Uses prior?
MAP	argmax_x P(y\|x) P(x)	Yes
MLE	argmax_x P(y\|x)	No
Full Bayesian	Entire posterior P(x\|y)	Yes

Looking ahead: MLE is the workhorse of machine learning — when you "train a model" by maximising likelihood on data, you're doing MLE. MAP shows up in regularisation: an L2 penalty on weights is equivalent to a Gaussian prior, turning MLE into MAP. We'll revisit both in detail in the next chapter on general inference.

Check: When is MAP identical to MLE?

When the prior P(X) is uniform (all values equally likely) When the likelihood is Gaussian When there is only one random variable

Chapter 3: Bayesian Networks

Consider a medical model with 100 binary random variables: demographics, conditions, and symptoms. To fully specify the joint distribution, you'd need to fill in a table with 2¹⁰⁰ > 10³⁰ entries — approaching the number of atoms in the universe. That's clearly infeasible.

Bayesian networks (Bayes nets) solve this by encoding the generative process — the causal flow of influence between variables — as a directed acyclic graph (DAG). Each node is a random variable. An arrow from X to Y means "X directly influences Y." We say X is a parent of Y.

Here's a concrete example. Consider four binary variables: University (U), Influenza (I), Fever (F), Tired (T). Being in university influences whether you get influenza. Having influenza influences whether you have a fever. Both university and influenza influence whether you're tired.

Bayesian Network: Disease Model

A simple four-node Bayes net. Arrows show the direction of causal influence. Hover over a node to see its conditional probability table.

To fully specify this Bayes net, we provide the conditional probability of each variable given its parents:

Variable	Condition	P(=1)
Uni	(no parents)	0.80
Influenza	Uni=1	0.20
Influenza	Uni=0	0.10
Fever	Inf=1	0.90
Fever	Inf=0	0.05
Tired	Uni=0, Inf=0	0.10
Tired	Uni=0, Inf=1	0.90
Tired	Uni=1, Inf=0	0.80
Tired	Uni=1, Inf=1	1.00

Key insight — the factorisation: The joint probability over all variables factors as a product of each variable's conditional probability given its parents:

P(U, I, F, T) = P(U) · P(I|U) · P(F|I) · P(T|U, I)

Instead of 2⁴ − 1 = 15 free parameters (a full joint table), we only need 1 + 2 + 2 + 4 = 9 entries. For 100 binary variables each with at most 3 parents, we'd need about 800 parameters instead of 10³⁰.

Worked example: What is P(Fever=1, Tired=1, Influenza=1, Uni=1)?

= P(U=1) · P(I=1|U=1) · P(F=1|I=1) · P(T=1|U=1,I=1)

= 0.80 × 0.20 × 0.90 × 1.00 = 0.144

Why this works: From the chain rule, the exact joint is P(x₁,...,x_n) = ∏ P(x_i | x_i-1,...,x₁). The Bayes net assumes P(x_i | x_i-1,...,x₁) = P(x_i | parents of X_i). This is a conditional independence assumption: each variable, given its parents, is independent of all its non-descendants.

Designing a Bayes net: (1) Choose your random variables as nodes. (2) Add directed edges based on causal influence. (3) Define P(X_i | parents) for each node. The edges should follow the causal direction — demographics cause conditions cause symptoms — which makes the conditional independence assumptions natural.

Check: In a Bayes net with 100 binary variables and at most 3 parents per node, roughly how many parameters are needed?

2¹⁰⁰ (the full joint table) About 800 (each node needs at most 2³ = 8 entries) Exactly 100 (one per variable)

Chapter 4: Conditional Independence

The power of Bayesian networks comes from conditional independence. The Bayes net encodes a specific independence assumption: each variable X_i, given its parents, is conditionally independent of all its non-descendants.

What does this mean concretely? In our disease model, once you know whether someone has influenza, learning their university status tells you nothing extra about their fever. Formally:

P(Fever | Influenza, Uni) = P(Fever | Influenza)

Fever depends on Influenza (its parent). University is a non-descendant of Fever, and once we condition on Influenza (which is Fever's parent), University provides no additional information.

The chain rule decomposition. Without any independence assumptions, the chain rule for n variables is:

P(x₁,...,x_n) = ∏_i P(x_i | x_i-1,...,x₁)

Each factor conditions on all preceding variables. The Bayes net simplifies each factor:

P(x_i | x_i-1,...,x₁) = P(x_i | parents of X_i)

This is exactly the conditional independence statement. It says: knowing the parents is sufficient; non-descendants add nothing.

Non-descendants vs. descendants: A descendant of X_i is any node in the subtree rooted at X_i (children, grandchildren, etc.). A non-descendant is everything else: ancestors, siblings, and unconnected nodes. The Bayes net assumption is only about non-descendants — descendants can still carry information about X_i (that's the direction inference flows).

Worked example: In our 4-node disease model, compute P(Uni=1 | Fever=1).

We need to marginalise over Influenza. By Bayes' theorem:

P(U=1|F=1) = P(F=1|U=1) · P(U=1) / P(F=1)

First compute P(F=1|U=1) by marginalising over I:

P(F=1|U=1) = P(F=1|I=1)P(I=1|U=1) + P(F=1|I=0)P(I=0|U=1)
= 0.90 × 0.20 + 0.05 × 0.80 = 0.180 + 0.040 = 0.220

Similarly, P(F=1|U=0) = 0.90 × 0.10 + 0.05 × 0.90 = 0.135. Then:

P(F=1) = P(F=1|U=1)P(U=1) + P(F=1|U=0)P(U=0) = 0.220 × 0.80 + 0.135 × 0.20 = 0.203

P(U=1|F=1) = 0.220 × 0.80 / 0.203 = 0.176 / 0.203 ≈ 0.867

So observing a fever nudges P(Uni) from 0.80 up to 0.867 — a small increase, because university students are slightly more likely to have influenza and hence fever.

Check: In the disease Bayes net, if you know Influenza=1, does learning Uni=1 change P(Fever)?

No — Fever is conditionally independent of Uni given Influenza (its parent) Yes — Uni always affects Fever Only if Tired is also observed

Chapter 5: Independence in Random Variables

Two random variables X and Y are independent if knowing the value of one tells you absolutely nothing about the other. Formally:

P(X = x, Y = y) = P(X = x) · P(Y = y) for all x, y

For continuous variables, the equivalent statement uses CDFs or PDFs:

f(X = x, Y = y) = f(X = x) · f(Y = y) for all x, y

A useful test: if the joint distribution factors into a product of a function of x alone and a function of y alone, the variables are independent. That is, if P(X=x, Y=y) = h(x) · g(y) for some functions h and g, then X and Y are independent.

Key insight — symmetry: Independence is symmetric. If X is independent of Y, then Y is independent of X. This seems obvious but is surprisingly powerful. Sometimes it's much easier to prove independence in one direction than the other.

Expectation of products. If X and Y are independent, the expectation of their product factors:

E[XY] = E[X] · E[Y] (if X, Y independent)

More generally, E[g(X) · h(Y)] = E[g(X)] · E[h(Y)] for any functions g, h. Note: the converse is false! E[XY] = E[X]E[Y] does not prove independence.

Contrast with sums: E[X + Y] = E[X] + E[Y] holds always, whether or not X and Y are independent. The product rule E[XY] = E[X]E[Y] requires independence. Don't mix them up.

Worked example — Poisson splitting: Requests to a web server follow Poi(λ). Each request is human with probability p or bot with probability 1−p. Let X = human requests, Y = bot requests. Show X ⊥ Y.

Since requests split independently, (X|N) ~ Bin(N, p) and (Y|N) ~ Bin(N, 1−p). We compute the joint via chain rule:

P(X=x, Y=y) = P(X=x, Y=y | N=x+y) · P(N=x+y)

= C(x+y, x) · p^x(1−p)^y · λ^x+y e^−λ / (x+y)!

Expanding C(x+y, x) = (x+y)! / (x! y!) and cancelling:

= [(λp)^x e^−λp / x!] · [(λ(1−p))^y e^−λ(1−p) / y!]

This factors into h(x) · g(y) — each factor is the PMF of a Poisson! So X ~ Poi(λp) and Y ~ Poi(λ(1−p)) independently. This is the Poisson splitting theorem.

Check: E[X + Y] = E[X] + E[Y] requires independence. True or false?

False — the sum rule holds always, independence or not True — you need independence for any expectation decomposition

Chapter 6: Covariance

Independence is an all-or-nothing property. But in practice, we want a quantitative measure of how two variables relate. Covariance measures the extent to which X and Y deviate from their means together.

Cov(X, Y) = E[(X − E[X])(Y − E[Y])]

Unpacking this: if X is above its mean when Y is above its mean (and below when below), each term (X − E[X])(Y − E[Y]) is positive, so the covariance is positive. If they move in opposite directions, covariance is negative.

A more convenient computational formula:

Cov(X, Y) = E[XY] − E[X] · E[Y]

Covariance of independent variables: If X and Y are independent, then E[XY] = E[X]E[Y], so Cov(X,Y) = 0. But the converse is not true — Cov(X,Y) = 0 does not imply independence. (Classic counterexample: X uniform on [−1,1], Y = X². They are clearly dependent but Cov(X,Y) = 0.)

Properties of covariance:

Property	Formula
Symmetry	Cov(X, Y) = Cov(Y, X)
Self-covariance	Cov(X, X) = Var(X)
Scaling	Cov(aX + b, Y) = a · Cov(X, Y)
Bilinearity	Cov(X₁+X₂, Y) = Cov(X₁,Y) + Cov(X₂,Y)

Variance of sums. For X = X₁ + ... + X_n:

Var(X) = ∑_i ∑_j Cov(X_i, X_j)

When the X_i are independent, cross-terms vanish and Var(X) = ∑ Var(X_i). When they're not independent, the cross-covariance terms matter — this is why diversification reduces portfolio risk.

Worked example: X and Y have joint PMF: P(0,0)=0.1, P(0,1)=0.2, P(1,0)=0.3, P(1,1)=0.4.

E[X] = 0·0.3 + 1·0.7 = 0.7. E[Y] = 0·0.4 + 1·0.6 = 0.6. E[XY] = 1·1·0.4 = 0.4.

Cov(X,Y) = 0.4 − 0.7 × 0.6 = 0.4 − 0.42 = −0.02

Slightly negative: X and Y have a weak tendency to move in opposite directions.

Check: Cov(X, Y) = 0 implies X and Y are independent. True or false?

False — zero covariance means no linear relationship, but nonlinear dependence can still exist True — zero covariance is equivalent to independence

Chapter 7: Correlation

Covariance has a problem: its magnitude depends on the scale of the variables. If you measure height in centimetres vs. metres, the covariance changes by a factor of 100. Correlation fixes this by normalising:

ρ(X, Y) = Cov(X, Y) / √(Var(X) · Var(Y))

Correlation is dimensionless and always lies in [−1, +1]:

ρ	Meaning
+1	Perfect positive linear relationship: Y = aX + b with a > 0
−1	Perfect negative linear relationship: Y = aX + b with a < 0
0	No linear relationship (but possibly nonlinear dependence!)

When ρ(X,Y) = 0, we say X and Y are uncorrelated. Uncorrelated is weaker than independent: independent ⇒ uncorrelated, but uncorrelated ⇏ independent.

Correlation Explorer

Drag the slider to set a target ρ. The scatter plot shows 200 samples from a bivariate Gaussian with that correlation. Watch how the point cloud elongates and tilts.

ρ0.70

Pearson vs. Spearman: The correlation ρ we've defined is Pearson correlation — it measures linear association. Spearman correlation applies the same formula but first converts values to ranks, capturing monotonic (not just linear) relationships. Spearman is more robust to outliers. Piech notes Spearman is outside the scope of CS109.

Worked example: Let X ~ Uniform{1,2,3} and Y = 2X + 1. Then E[X] = 2, E[X²] = (1+4+9)/3 = 14/3, Var(X) = 14/3 − 4 = 2/3.

E[Y] = 2·2+1 = 5, Var(Y) = 4 · Var(X) = 8/3.

Cov(X,Y) = Cov(X, 2X+1) = 2·Var(X) = 4/3.

ρ(X,Y) = (4/3) / √(2/3 · 8/3) = (4/3) / (4/3) = 1

Perfect correlation, as expected for an exact linear relationship with positive slope.

Check: If ρ(X,Y) = −0.95, which statement is most accurate?

X and Y are independent X and Y have a strong negative linear relationship X causes Y to decrease

Chapter 8: Showcase — Interactive Bayesian Network

This is the payoff simulation. Below is the four-node disease Bayesian network from Chapter 3, fully interactive. Click any node to toggle its observed value (0 or 1). The network instantly recomputes all posterior probabilities using exact inference (enumeration over all configurations).

Try these experiments:

Click Fever = 1 and watch P(Influenza) jump from 0.18 to about 0.76.
Then also set Uni = 0 and see how the posterior changes further.
Set Tired = 1 alone — it slightly increases P(Influenza) but mainly confirms university.
Click Reset to clear all evidence and return to priors.

Interactive Bayes Net Inference

Click a node to cycle: unobserved → observed=1 → observed=0 → unobserved. Posterior probabilities update instantly.

Click a node to set evidence

What to notice: Fever is the strongest signal for influenza because P(Fever|Inf) is very high (0.9) while P(Fever|no Inf) is very low (0.05). University status alone barely shifts influenza probability. But combining evidence is more powerful than either alone — the posterior after observing two variables is not simply the average of two single-evidence posteriors.

How inference works under the hood: For this small network, we can enumerate all 2⁴ = 16 joint configurations. For each unobserved variable, we sum the joint probabilities over all configurations consistent with the evidence, then normalise. This is inference by enumeration — exact but exponential. Chapter 11 will introduce smarter algorithms.

Covariance Heatmap

The pairwise covariance between all four variables in the disease model, computed from the full joint distribution. Darker warm = positive, darker teal = negative.

Check: In the interactive Bayes net, which single observation causes the largest shift in P(Influenza)?

Observing Fever = 1 Observing Uni = 1 Observing Tired = 0

Chapter 9: Connections

This chapter brings together inference, Bayesian networks, independence, and correlation. Let's see how these ideas connect to the bigger picture.

The Bayesian worldview: Inference is the computational expression of Bayesian reasoning. Start with a prior, observe evidence, update via Bayes' theorem. Bayesian networks give this process structure — they encode which variables influence which, making inference tractable even in large models. Independence and correlation quantify when and how strongly information flows between variables.

Inference (Ch 10)

Bayes' theorem for random variables: computing P(X|Y=y)

↓ scales to many variables via

Bayesian Networks (Ch 10)

Directed graphs that factor the joint distribution compactly

↓ encode assumptions of

Conditional Independence

X_i ⊥ non-descendants | parents

↓ measured quantitatively by

Covariance & Correlation

Cov(X,Y) for magnitude, ρ(X,Y) for normalised strength

What comes next:

Chapter 11 — General inference: How to do inference efficiently in large Bayes nets with many variables. Variable elimination, sampling methods, and the complexity of exact inference.
Machine learning: MLE and MAP are the foundation of parameter estimation. When you "train a neural network," you're doing MLE over millions of parameters. Regularisation is MAP with a prior.
Probabilistic graphical models: Bayes nets are one flavour. Markov random fields (undirected graphs) are another. Factor graphs unify both. These structures are the backbone of modern probabilistic programming.
Causal inference: Bayes nets describe correlation and conditional dependence. Causal models (Pearl's do-calculus) go further and distinguish "seeing" from "doing." The graph structure is the same; the interpretation is deeper.

Concept	Key formula	Where it leads
Bayes for RVs	P(x\|y) = P(y\|x)P(x)/P(y)	Parameter estimation, filtering
MAP / MLE	argmax P(y\|x)P(x) / argmax P(y\|x)	All of machine learning
Bayes net factorisation	∏ P(x_i \| parents)	Graphical models, variational inference
Independence	P(x,y) = P(x)P(y)	Simplifying computation
Correlation	ρ = Cov(X,Y)/(σ_Xσ_Y)	Feature selection, portfolio theory

Check: Independent ⇒ uncorrelated. Uncorrelated ⇒ independent. Which is true?

Only the first: independent implies uncorrelated, but uncorrelated does not imply independent Both are true Neither is true