Updating beliefs across networks of random variables — from two-variable Bayes to graphical models.
You're building a medical diagnosis app. A patient walks in with a fever and fatigue. You know the prior probabilities of various diseases, and you know how likely each symptom is given each disease. The patient gives you evidence — their symptoms — and you need to update your beliefs about what disease they have.
This is inference: computing the probability of an unknown random variable given observed values of other random variables. It is the central computational task in probabilistic modelling. Everything we've built so far — joint distributions, conditional PMFs, Bayes' theorem for events — leads here.
In this chapter we'll do inference first with two random variables (straightforward Bayes' theorem), then scale up to networks of many variables where a naive approach would require tracking 2N combinations. Bayesian networks tame that exponential explosion by encoding independence assumptions in a directed graph.
We'll also cover two related but distinct concepts: independence (knowing X tells you nothing about Y) and correlation (a quantitative measure of linear association). These ideas govern when and how information flows between variables.
| Concept | What it answers |
|---|---|
| Inference | What is P(X | observed evidence)? |
| Bayesian network | How do we represent a joint distribution compactly? |
| Independence | Does knowing X tell us anything about Y? |
| Covariance / Correlation | How strongly and in what direction do X and Y co-vary? |
In Chapter 3 we learned Bayes' theorem for events. Now we extend it to random variables. The logic is identical — every relational operator on a random variable defines an event — but the notation changes to handle PMFs and PDFs.
Discrete case. Let X and Y be discrete random variables. The conditional PMF is:
And Bayes' theorem becomes:
The denominator P(Y = y) is computed via the law of total probability: sum P(Y = y | X = x') · P(X = x') over all possible values x' of X.
Shorthand notation. We write P(x | y) as shorthand for P(X = x | Y = y). This keeps formulas compact when juggling many variables.
Mixed case. When X is continuous and N is discrete:
The rule: whenever the variable on the left of the conditional is continuous, use a density f; when discrete, use a probability P. The derivation uses the approximation P(X = x) ≈ f(X = x) · ε, and the ε terms cancel.
Worked example — elephant inference: Girl elephants weigh N(μ=160, σ2=49) at birth; boys weigh N(μ=165, σ2=9). A newborn weighs 163 kg. What is P(girl)?
Let G = 1 mean girl, P(G=1) = 0.5. We need P(G=1 | X=163):
The likelihood f(X=163 | G=1) is the Gaussian PDF with μ=160, σ=7, evaluated at 163:
Similarly, f(X=163 | G=0) uses μ=165, σ=3:
The denominator via total probability: f(163) = f(163|girl)·0.5 + f(163|boy)·0.5. Plugging in numerically:
Despite 163 being closer to the girl mean (160 vs 165), the boy distribution is much tighter (σ=3 vs σ=7), so 163 is more probable under the boy hypothesis. The posterior favours boy at about 67%.
Inference gives us the full posterior distribution P(X | evidence). But often we just want a single "best guess." There are two natural choices, and they differ in whether you account for the prior.
Maximum A Posteriori (MAP) picks the value of X that maximises the posterior:
We can drop the denominator P(Y = y) because it doesn't depend on x. MAP uses both the likelihood and the prior — it asks "what value of X makes the observed evidence most probable, weighted by how likely X was a priori?"
Maximum Likelihood Estimation (MLE) ignores the prior entirely:
MLE asks "what value of X makes the observed data most probable?" without any prior belief. This is equivalent to MAP with a uniform prior.
Worked example: Back to the elephant. The MLE asks: which sex makes 163 kg most likely? We compare f(163 | girl) vs f(163 | boy). Since f(163 | boy) is larger (as we computed), MLE says "boy." MAP also says "boy" because the prior is 50/50 — with equal priors, MAP = MLE.
Now suppose 70% of elephants born at this zoo are girls. Then P(G=1) = 0.7 and P(G=0) = 0.3. The MAP calculation becomes:
Working out the numbers: the girl side is roughly 0.0227 × 0.7 = 0.0159; the boy side is roughly 0.0547 × 0.3 = 0.0164. They're almost tied! The strong prior towards girl nearly overcomes the likelihood advantage of boy. MLE still says "boy" (it ignores the 70% girl prior), but MAP is essentially a coin flip.
| Method | Formula | Uses prior? |
|---|---|---|
| MAP | argmaxx P(y|x) P(x) | Yes |
| MLE | argmaxx P(y|x) | No |
| Full Bayesian | Entire posterior P(x|y) | Yes |
Consider a medical model with 100 binary random variables: demographics, conditions, and symptoms. To fully specify the joint distribution, you'd need to fill in a table with 2100 > 1030 entries — approaching the number of atoms in the universe. That's clearly infeasible.
Bayesian networks (Bayes nets) solve this by encoding the generative process — the causal flow of influence between variables — as a directed acyclic graph (DAG). Each node is a random variable. An arrow from X to Y means "X directly influences Y." We say X is a parent of Y.
Here's a concrete example. Consider four binary variables: University (U), Influenza (I), Fever (F), Tired (T). Being in university influences whether you get influenza. Having influenza influences whether you have a fever. Both university and influenza influence whether you're tired.
A simple four-node Bayes net. Arrows show the direction of causal influence. Hover over a node to see its conditional probability table.
To fully specify this Bayes net, we provide the conditional probability of each variable given its parents:
| Variable | Condition | P(=1) |
|---|---|---|
| Uni | (no parents) | 0.80 |
| Influenza | Uni=1 | 0.20 |
| Influenza | Uni=0 | 0.10 |
| Fever | Inf=1 | 0.90 |
| Fever | Inf=0 | 0.05 |
| Tired | Uni=0, Inf=0 | 0.10 |
| Tired | Uni=0, Inf=1 | 0.90 |
| Tired | Uni=1, Inf=0 | 0.80 |
| Tired | Uni=1, Inf=1 | 1.00 |
Worked example: What is P(Fever=1, Tired=1, Influenza=1, Uni=1)?
Why this works: From the chain rule, the exact joint is P(x1,...,xn) = ∏ P(xi | xi-1,...,x1). The Bayes net assumes P(xi | xi-1,...,x1) = P(xi | parents of Xi). This is a conditional independence assumption: each variable, given its parents, is independent of all its non-descendants.
The power of Bayesian networks comes from conditional independence. The Bayes net encodes a specific independence assumption: each variable Xi, given its parents, is conditionally independent of all its non-descendants.
What does this mean concretely? In our disease model, once you know whether someone has influenza, learning their university status tells you nothing extra about their fever. Formally:
Fever depends on Influenza (its parent). University is a non-descendant of Fever, and once we condition on Influenza (which is Fever's parent), University provides no additional information.
The chain rule decomposition. Without any independence assumptions, the chain rule for n variables is:
Each factor conditions on all preceding variables. The Bayes net simplifies each factor:
This is exactly the conditional independence statement. It says: knowing the parents is sufficient; non-descendants add nothing.
Worked example: In our 4-node disease model, compute P(Uni=1 | Fever=1).
We need to marginalise over Influenza. By Bayes' theorem:
First compute P(F=1|U=1) by marginalising over I:
Similarly, P(F=1|U=0) = 0.90 × 0.10 + 0.05 × 0.90 = 0.135. Then:
So observing a fever nudges P(Uni) from 0.80 up to 0.867 — a small increase, because university students are slightly more likely to have influenza and hence fever.
Two random variables X and Y are independent if knowing the value of one tells you absolutely nothing about the other. Formally:
For continuous variables, the equivalent statement uses CDFs or PDFs:
A useful test: if the joint distribution factors into a product of a function of x alone and a function of y alone, the variables are independent. That is, if P(X=x, Y=y) = h(x) · g(y) for some functions h and g, then X and Y are independent.
Expectation of products. If X and Y are independent, the expectation of their product factors:
More generally, E[g(X) · h(Y)] = E[g(X)] · E[h(Y)] for any functions g, h. Note: the converse is false! E[XY] = E[X]E[Y] does not prove independence.
Worked example — Poisson splitting: Requests to a web server follow Poi(λ). Each request is human with probability p or bot with probability 1−p. Let X = human requests, Y = bot requests. Show X ⊥ Y.
Since requests split independently, (X|N) ~ Bin(N, p) and (Y|N) ~ Bin(N, 1−p). We compute the joint via chain rule:
Expanding C(x+y, x) = (x+y)! / (x! y!) and cancelling:
This factors into h(x) · g(y) — each factor is the PMF of a Poisson! So X ~ Poi(λp) and Y ~ Poi(λ(1−p)) independently. This is the Poisson splitting theorem.
Independence is an all-or-nothing property. But in practice, we want a quantitative measure of how two variables relate. Covariance measures the extent to which X and Y deviate from their means together.
Unpacking this: if X is above its mean when Y is above its mean (and below when below), each term (X − E[X])(Y − E[Y]) is positive, so the covariance is positive. If they move in opposite directions, covariance is negative.
A more convenient computational formula:
Properties of covariance:
| Property | Formula |
|---|---|
| Symmetry | Cov(X, Y) = Cov(Y, X) |
| Self-covariance | Cov(X, X) = Var(X) |
| Scaling | Cov(aX + b, Y) = a · Cov(X, Y) |
| Bilinearity | Cov(X1+X2, Y) = Cov(X1,Y) + Cov(X2,Y) |
Variance of sums. For X = X1 + ... + Xn:
When the Xi are independent, cross-terms vanish and Var(X) = ∑ Var(Xi). When they're not independent, the cross-covariance terms matter — this is why diversification reduces portfolio risk.
Worked example: X and Y have joint PMF: P(0,0)=0.1, P(0,1)=0.2, P(1,0)=0.3, P(1,1)=0.4.
E[X] = 0·0.3 + 1·0.7 = 0.7. E[Y] = 0·0.4 + 1·0.6 = 0.6. E[XY] = 1·1·0.4 = 0.4.
Slightly negative: X and Y have a weak tendency to move in opposite directions.
Covariance has a problem: its magnitude depends on the scale of the variables. If you measure height in centimetres vs. metres, the covariance changes by a factor of 100. Correlation fixes this by normalising:
Correlation is dimensionless and always lies in [−1, +1]:
| ρ | Meaning |
|---|---|
| +1 | Perfect positive linear relationship: Y = aX + b with a > 0 |
| −1 | Perfect negative linear relationship: Y = aX + b with a < 0 |
| 0 | No linear relationship (but possibly nonlinear dependence!) |
When ρ(X,Y) = 0, we say X and Y are uncorrelated. Uncorrelated is weaker than independent: independent ⇒ uncorrelated, but uncorrelated ⇏ independent.
Drag the slider to set a target ρ. The scatter plot shows 200 samples from a bivariate Gaussian with that correlation. Watch how the point cloud elongates and tilts.
Worked example: Let X ~ Uniform{1,2,3} and Y = 2X + 1. Then E[X] = 2, E[X2] = (1+4+9)/3 = 14/3, Var(X) = 14/3 − 4 = 2/3.
E[Y] = 2·2+1 = 5, Var(Y) = 4 · Var(X) = 8/3.
Cov(X,Y) = Cov(X, 2X+1) = 2·Var(X) = 4/3.
Perfect correlation, as expected for an exact linear relationship with positive slope.
This is the payoff simulation. Below is the four-node disease Bayesian network from Chapter 3, fully interactive. Click any node to toggle its observed value (0 or 1). The network instantly recomputes all posterior probabilities using exact inference (enumeration over all configurations).
Try these experiments:
Click a node to cycle: unobserved → observed=1 → observed=0 → unobserved. Posterior probabilities update instantly.
How inference works under the hood: For this small network, we can enumerate all 24 = 16 joint configurations. For each unobserved variable, we sum the joint probabilities over all configurations consistent with the evidence, then normalise. This is inference by enumeration — exact but exponential. Chapter 11 will introduce smarter algorithms.
The pairwise covariance between all four variables in the disease model, computed from the full joint distribution. Darker warm = positive, darker teal = negative.
This chapter brings together inference, Bayesian networks, independence, and correlation. Let's see how these ideas connect to the bigger picture.
What comes next:
| Concept | Key formula | Where it leads |
|---|---|---|
| Bayes for RVs | P(x|y) = P(y|x)P(x)/P(y) | Parameter estimation, filtering |
| MAP / MLE | argmax P(y|x)P(x) / argmax P(y|x) | All of machine learning |
| Bayes net factorisation | ∏ P(xi | parents) | Graphical models, variational inference |
| Independence | P(x,y) = P(x)P(y) | Simplifying computation |
| Correlation | ρ = Cov(X,Y)/(σXσY) | Feature selection, portfolio theory |