Given a Bayesian network and some observed evidence, how do we compute the probability of anything else? This chapter derives every major method — exact and approximate — from first principles.
Imagine a doctor who knows from statistics that 1% of patients have a rare disease. A test for the disease is 95% sensitive (catches 95% of true cases) and 90% specific (correctly rules out 90% of healthy patients). A patient just tested positive. What is the probability they actually have the disease?
This is an inference problem. We have a Bayesian network encoding beliefs about Disease, Test result, and their relationship. We have observed evidence (Test = positive). We want to compute a posterior probability — P(Disease = true | Test = positive).
The naive answer is "95% — the test is very accurate." The correct answer is about 8.7%. Understanding why requires computing posteriors precisely, which is exactly what inference algorithms do.
Adjust the disease prevalence and test accuracy. Watch how the posterior probability of disease given a positive test changes — often dramatically.
The network here is tiny — two variables. Real problems involve dozens or hundreds of variables. We need systematic algorithms. The rest of this chapter develops them.
The blunt-force approach to inference is to directly apply the definition of conditional probability. We want P(Q = q | E = e). By definition, this equals P(Q = q, E = e) / P(E = e). Both of these are just marginals of the joint distribution.
The joint distribution over all variables X1:n in a Bayesian network factorizes as a product of conditional probability tables (CPTs):
To compute P(Q = q, E = e), we sum this joint over all configurations of the remaining hidden variables H (those not in Q or E):
This is inference by enumeration: enumerate every possible assignment to the hidden variables, evaluate the joint for each, and sum up. It works. It is also catastrophically slow for large networks — if there are k hidden binary variables, there are 2k terms to sum.
Network: B (Battery) → S (Start) ← F (Fuel). We observe S = false (car won't start). Enumeration computes P(B=false|S=false) by summing over all values of F.
The enumeration algorithm has exponential time complexity in the number of hidden variables. For 30 binary hidden variables, that is over a billion terms. We need something smarter.
Enumeration is slow because it repeatedly computes the same sub-products. Variable elimination fixes this by pushing sums inside products — eliminating one variable at a time, caching the intermediate results as factors.
A factor is a function from a set of variables to a non-negative real number. Every CPT in the network is a factor. We can multiply two factors (produce a new factor over their union of variables) and marginalize a factor (sum out one variable to get a factor over fewer variables).
The algorithm:
Network: Rain → Wet ← Sprinkler ← Cloudy, Cloudy → Rain. Evidence: Wet = true. Query: P(Rain | Wet=true). Step through the elimination to see factors grow and shrink.
The cost of variable elimination depends critically on the elimination order. A good order keeps intermediate factors small. Finding the optimal order is itself NP-hard, but good heuristics (eliminate the variable that creates the smallest new factor) work well in practice.
| Operation | Inputs | Output | Cost |
|---|---|---|---|
| Factor product | f(X,Y), g(Y,Z) | h(X,Y,Z) | |X|·|Y|·|Z| |
| Marginalize out Y | h(X,Y,Z) | m(X,Z) | |X|·|Y|·|Z| |
| Reduce (observe Y=y) | f(X,Y) | r(X) | |X| |
Variable elimination answers a single query efficiently. But what if we want to compute posteriors for every variable simultaneously? Running elimination separately for each variable wastes computation. Belief propagation does all of them at once on tree-structured networks.
On a tree (a network with no undirected cycles), we pick any node as root and run two passes: a collect pass (leaves send messages to root) and a distribute pass (root sends messages back to leaves). After two passes, every node has the information it needs to compute its own marginal.
Formally, the message from node i to node j is:
The posterior at node j after collecting all messages is proportional to its own CPT times the product of all incoming messages:
Five-node chain: A → B → C → D → E. Observe E = true. Click "Collect" to propagate messages from E toward A, then "Distribute" to propagate back. Watch belief bars update.
Belief propagation is exact on trees. On networks with cycles (loopy graphs), the same message-passing equations are applied iteratively — this is called loopy belief propagation. It often converges to good approximate answers, but correctness is not guaranteed.
Here is the sobering truth: exact inference in general Bayesian networks is NP-hard. This means no algorithm can efficiently solve all instances unless P = NP. The hardness comes from networks with tightly coupled loops that prevent the factorization tricks from reducing the problem size.
The key quantity governing difficulty is treewidth — roughly, how tree-like is the network? A tree has treewidth 1. Each extra edge that creates a loop can increase treewidth. Variable elimination runs in time O(n · dw+1) where d is the domain size and w is the treewidth. On a tree (w=1), this is linear. On a dense network, it is exponential.
As network density grows, treewidth increases and inference cost explodes exponentially. Adjust domain size to see how the exponent multiplies.
The NP-hardness motivates two responses: (1) restrict to tractable network structures (trees, polytrees, networks with small treewidth), or (2) use approximate inference via sampling. The next chapters cover sampling methods.
| Method | Exact? | Works on loops? | Complexity |
|---|---|---|---|
| Enumeration | Yes | Yes | O(dn) |
| Variable Elimination | Yes | Yes | O(n · dw+1) |
| Belief Propagation | Yes (trees only) | Approximate | O(n · d2) |
| Sampling methods | Approximate | Yes | O(samples · n) |
Sampling methods sidestep the complexity barrier by estimating probabilities from random samples. The simplest approach is direct sampling (also called prior sampling or ancestral sampling): generate complete assignments to all variables by following the network structure from parents to children.
The algorithm is elegantly simple. Process variables in topological order (parents before children). For each variable, look up its CPT given the already-sampled values of its parents, then draw a random sample from that distribution. Repeat N times, then count.
Network: Cloudy → Rain → Wet ← Sprinkler ← Cloudy. Evidence: Wet = true. Watch accepted (colored) vs rejected (grey) samples. Rare evidence = high rejection rate.
Likelihood weighted (LW) sampling solves the rejection problem by forcing every sample to be consistent with the evidence. Instead of sampling evidence variables, we fix them to their observed values. To compensate for this "cheating," each sample gets a weight equal to the probability of the evidence given the sampled non-evidence variables.
For a sample x with evidence E = e, the weight is:
We then estimate the posterior as a weighted frequency:
LW sampling is much more efficient than rejection sampling for low-probability evidence. Every sample contributes, though samples with very low weights contribute little. The algorithm degrades gracefully: rare evidence leads to low-weight samples, not wasted samples.
Both methods estimate the same posterior P(Rain|Wet=true). LW (blue) uses all samples; direct (orange) rejects most. Watch effective sample size (ESS) diverge as evidence becomes rarer.
python def likelihood_weighted_sample(bn, evidence): # bn: Bayesian network with topo_order, cpd(node, parent_vals) w = 1.0 sample = {} for node in bn.topo_order: parents = bn.parents[node] parent_vals = tuple(sample[p] for p in parents) if node in evidence: val = evidence[node] w *= bn.cpd(node, parent_vals)[val] # multiply likelihood sample[node] = val else: dist = bn.cpd(node, parent_vals) sample[node] = np.random.choice(len(dist), p=dist) return sample, w
Both direct and LW sampling draw each sample independently. Gibbs sampling takes a completely different approach: it constructs a Markov chain whose stationary distribution is the posterior P(X | E = e). We run the chain for many steps and collect samples from it.
Start from any complete assignment consistent with the evidence. Then repeatedly: pick a non-evidence variable Xi at random, and resample it from its full conditional — the distribution P(Xi | all other variables fixed).
The full conditional for Xi given all others x−i is proportional to:
Two binary variables A and B with adjustable correlation. The Gibbs sampler alternately resamples each from its full conditional. Red = burn-in. Watch the posterior histogram for B converge.
python def gibbs_sample(bn, evidence, n_samples, burn_in=100): state = {**evidence} for node in bn.non_evidence_nodes: state[node] = np.random.choice(bn.domain_size[node]) samples = [] for step in range(burn_in + n_samples): node = np.random.choice(bn.non_evidence_nodes) # Full conditional from Markov blanket (local CPT lookups only) probs = bn.markov_blanket_conditional(node, state) state[node] = np.random.choice(len(probs), p=probs) if step >= burn_in: samples.append(state.copy()) return samples
When all variables are jointly Gaussian, inference has a beautiful closed-form solution. No sampling, no enumeration, no message passing — just matrix algebra.
If X and Y are jointly Gaussian with covariance structure Σ, then conditioning on Y = y gives:
This is the conditional Gaussian formula. The posterior mean is the prior mean plus a correction proportional to how surprising the observation was. The posterior covariance is the prior covariance minus the information gained from observing Y — observing evidence always reduces uncertainty.
Two jointly Gaussian variables: X (true position) and Y (noisy measurement). Grey = prior over X. Teal = posterior P(X|Y=y). Orange line = observation. High correlation means a surprising observation strongly shifts the posterior.
A Gaussian Bayesian network is one where every CPD is a linear Gaussian: Xi | pa(Xi) ~ N(μi + βiT pa(Xi), σi2). The joint over all variables is then a single multivariate Gaussian, and inference reduces to the conditional formula above.
python import numpy as np def gaussian_inference(mu_x, mu_y, sigma_xx, sigma_xy, sigma_yy, y_obs): # Scalar case: P(X | Y=y_obs) given joint Gaussian parameters K = sigma_xy / sigma_yy # Kalman gain mu_post = mu_x + K * (y_obs - mu_y) # Updated mean sigma_post = sigma_xx - K * sigma_xy # Updated variance return mu_post, sigma_post # Example: rho=0.7, observe Y=1.0 mu_p, s_p = gaussian_inference( mu_x=0, mu_y=0, sigma_xx=1.0, sigma_xy=0.7, sigma_yy=1.0, y_obs=1.0 ) # mu_post = 0.70, sigma_post = 0.51
Inference is the computational engine underlying everything else in this book. Every algorithm that reasons about uncertainty — filtering, planning, learning — calls some form of inference as a subroutine.
| Method | Best For | Limitation |
|---|---|---|
| Enumeration | Teaching, tiny networks (n < 10) | Exponential in hidden variables |
| Variable Elimination | General exact inference | Exponential in treewidth |
| Belief Propagation | Trees, all marginals at once | Approximate on loopy graphs |
| Direct Sampling | Simple estimation, common evidence | Rejects low-probability evidence |
| LW Sampling | Moderate-probability evidence | Weight degeneracy under extreme evidence |
| Gibbs Sampling | Complex posteriors, correlated vars | Slow mixing, burn-in required |
| Gaussian Inference | Linear-Gaussian models, exactly | Requires Gaussian assumption |
Chapter 4: Parameter Learning
To learn CPT parameters from data with hidden variables, the EM algorithm calls inference at every E-step to compute expected sufficient statistics under the posterior.
Chapter 6: Simple Decisions
Decision-making requires computing expected utilities — E[U(outcome) | action, evidence] — which is a weighted sum over the posterior.
Chapters 19–22: POMDPs
The "belief state" in a partially observable system is a posterior over hidden states. Updating it is exactly a Bayesian inference step at each time point.
Kalman Filter Connection
The Gaussian inference formula is the Kalman update equation. Linear-Gaussian state-space models are Gaussian Bayesian networks unrolled through time, and the Kalman filter runs exact inference on them.
Particle Filters
Direct sampling applied to dynamic models. Each particle is a sample from the prior propagated forward, reweighted by observation likelihood. This is LW sampling in disguise.
Variational Autoencoders
Neural networks trained to do approximate inference. The encoder learns P(Z|X) without explicit enumeration by maximizing a variational lower bound.
"In the long run it's not just whether you're right or wrong, but how confident you are, and how you respond to the evidence."
— Nate Silver, The Signal and the Noise