Kochenderfer, Wheeler & Wray, Chapter 3

Inference

Given a Bayesian network and some observed evidence, how do we compute the probability of anything else? This chapter derives every major method — exact and approximate — from first principles.

Prerequisites: Chapter 2 (Representation) — Bayesian networks, conditional probability, joint distributions.
10
Chapters
5+
Simulations
pp 43–69
Source

Chapter 0: The Question Behind the Network

Imagine a doctor who knows from statistics that 1% of patients have a rare disease. A test for the disease is 95% sensitive (catches 95% of true cases) and 90% specific (correctly rules out 90% of healthy patients). A patient just tested positive. What is the probability they actually have the disease?

This is an inference problem. We have a Bayesian network encoding beliefs about Disease, Test result, and their relationship. We have observed evidence (Test = positive). We want to compute a posterior probability — P(Disease = true | Test = positive).

The naive answer is "95% — the test is very accurate." The correct answer is about 8.7%. Understanding why requires computing posteriors precisely, which is exactly what inference algorithms do.

The core task: Given a Bayesian network over variables X1, …, Xn, a set of query variables Q, and a set of evidence variables E observed to take values e, compute P(Q | E = e).
The Doctor’s Dilemma

Adjust the disease prevalence and test accuracy. Watch how the posterior probability of disease given a positive test changes — often dramatically.

Prevalence P(D) 1%
Sensitivity P(T+|D) 95%
Specificity P(T−|¬D) 90%

The network here is tiny — two variables. Real problems involve dozens or hundreds of variables. We need systematic algorithms. The rest of this chapter develops them.

In the inference problem, what are the "evidence variables"?

Chapter 1: Exact Inference by Enumeration

The blunt-force approach to inference is to directly apply the definition of conditional probability. We want P(Q = q | E = e). By definition, this equals P(Q = q, E = e) / P(E = e). Both of these are just marginals of the joint distribution.

The joint distribution over all variables X1:n in a Bayesian network factorizes as a product of conditional probability tables (CPTs):

P(X1, …, Xn) = ∏i P(Xi | pa(Xi))

To compute P(Q = q, E = e), we sum this joint over all configurations of the remaining hidden variables H (those not in Q or E):

P(Q = q, E = e) = ∑hi P(Xi | pa(Xi))

This is inference by enumeration: enumerate every possible assignment to the hidden variables, evaluate the joint for each, and sum up. It works. It is also catastrophically slow for large networks — if there are k hidden binary variables, there are 2k terms to sum.

Why it is correct: Marginalization (summing over hidden variables) is exact because we are literally accounting for every possible world. The conditional probability formula P(Q|E) = P(Q,E)/P(E) requires no approximation. The only cost is computational.
Enumeration on a Small Network

Network: B (Battery) → S (Start) ← F (Fuel). We observe S = false (car won't start). Enumeration computes P(B=false|S=false) by summing over all values of F.

P(B=bad) 0.10
P(F=empty) 0.05

The enumeration algorithm has exponential time complexity in the number of hidden variables. For 30 binary hidden variables, that is over a billion terms. We need something smarter.

In enumeration, why do we divide P(Q,E) by P(E)?

Chapter 2: Variable Elimination — Smarter Summing

Enumeration is slow because it repeatedly computes the same sub-products. Variable elimination fixes this by pushing sums inside products — eliminating one variable at a time, caching the intermediate results as factors.

A factor is a function from a set of variables to a non-negative real number. Every CPT in the network is a factor. We can multiply two factors (produce a new factor over their union of variables) and marginalize a factor (sum out one variable to get a factor over fewer variables).

The key trick: Instead of ∑h1,h2,h3 f(h1) · g(h1,h2) · p(h2,h3) — which iterates over all triples — we compute ∑h1 f(h1) · g(h1,h2) first (a new factor over h2 only), then ∑h2 [result](h2) · p(h2,h3). Each step is smaller.

The algorithm:

1. Initialize
Start with all CPTs as factors. Set evidence variables to their observed values (reduce each factor).
2. Choose variable to eliminate
Pick a hidden variable Xi. Multiply all factors involving Xi into one big factor.
3. Marginalize
Sum the big factor over Xi, producing a new factor that no longer involves Xi.
↻ repeat for all hidden variables
4. Normalize
Multiply remaining factors (over query variables only). Normalize to get the posterior.
Step-Through Variable Elimination

Network: Rain → Wet ← Sprinkler ← Cloudy, Cloudy → Rain. Evidence: Wet = true. Query: P(Rain | Wet=true). Step through the elimination to see factors grow and shrink.

The cost of variable elimination depends critically on the elimination order. A good order keeps intermediate factors small. Finding the optimal order is itself NP-hard, but good heuristics (eliminate the variable that creates the smallest new factor) work well in practice.

OperationInputsOutputCost
Factor productf(X,Y), g(Y,Z)h(X,Y,Z)|X|·|Y|·|Z|
Marginalize out Yh(X,Y,Z)m(X,Z)|X|·|Y|·|Z|
Reduce (observe Y=y)f(X,Y)r(X)|X|
Why does variable elimination outperform naive enumeration on large networks?

Chapter 3: Belief Propagation

Variable elimination answers a single query efficiently. But what if we want to compute posteriors for every variable simultaneously? Running elimination separately for each variable wastes computation. Belief propagation does all of them at once on tree-structured networks.

On a tree (a network with no undirected cycles), we pick any node as root and run two passes: a collect pass (leaves send messages to root) and a distribute pass (root sends messages back to leaves). After two passes, every node has the information it needs to compute its own marginal.

What is a "message"? When node Xi sends a message to its neighbor Xj, it is saying: "Given everything I know from my subtree, here is a factor over your values." Multiplying all incoming messages gives the posterior at each node.

Formally, the message from node i to node j is:

mi→j(xj) = ∑xi P(xi | pa(xi)) · ∏k ∈ nb(i)\{j} mk→i(xi)

The posterior at node j after collecting all messages is proportional to its own CPT times the product of all incoming messages:

P(Xj | e) ∝ P(Xj | pa(Xj)) · ∏i ∈ nb(j) mi→j(Xj)
Message Passing on a Tree

Five-node chain: A → B → C → D → E. Observe E = true. Click "Collect" to propagate messages from E toward A, then "Distribute" to propagate back. Watch belief bars update.

Belief propagation is exact on trees. On networks with cycles (loopy graphs), the same message-passing equations are applied iteratively — this is called loopy belief propagation. It often converges to good approximate answers, but correctness is not guaranteed.

On a tree-structured Bayesian network, how many message-passing passes does belief propagation need to compute all marginals exactly?

Chapter 4: Computational Complexity

Here is the sobering truth: exact inference in general Bayesian networks is NP-hard. This means no algorithm can efficiently solve all instances unless P = NP. The hardness comes from networks with tightly coupled loops that prevent the factorization tricks from reducing the problem size.

The key quantity governing difficulty is treewidth — roughly, how tree-like is the network? A tree has treewidth 1. Each extra edge that creates a loop can increase treewidth. Variable elimination runs in time O(n · dw+1) where d is the domain size and w is the treewidth. On a tree (w=1), this is linear. On a dense network, it is exponential.

Intuition for treewidth: Imagine eliminating variables one at a time. Each elimination potentially "connects" the remaining neighbors of the eliminated variable. Treewidth measures the worst-case clique size that forms during the best possible elimination order. High treewidth means unavoidable large intermediate factors.
Treewidth vs Inference Cost

As network density grows, treewidth increases and inference cost explodes exponentially. Adjust domain size to see how the exponent multiplies.

Effective treewidth w 3
Domain size d 2

The NP-hardness motivates two responses: (1) restrict to tractable network structures (trees, polytrees, networks with small treewidth), or (2) use approximate inference via sampling. The next chapters cover sampling methods.

MethodExact?Works on loops?Complexity
EnumerationYesYesO(dn)
Variable EliminationYesYesO(n · dw+1)
Belief PropagationYes (trees only)ApproximateO(n · d2)
Sampling methodsApproximateYesO(samples · n)
What does treewidth measure about a Bayesian network?

Chapter 5: Direct Sampling

Sampling methods sidestep the complexity barrier by estimating probabilities from random samples. The simplest approach is direct sampling (also called prior sampling or ancestral sampling): generate complete assignments to all variables by following the network structure from parents to children.

The algorithm is elegantly simple. Process variables in topological order (parents before children). For each variable, look up its CPT given the already-sampled values of its parents, then draw a random sample from that distribution. Repeat N times, then count.

1. Topological sort
Order variables so every parent appears before its children.
2. Sample each variable
For each Xi in order: draw from P(Xi | pa(Xi) = already sampled values).
3. Record the sample
One complete assignment to all variables. Discard if inconsistent with evidence.
↻ repeat N times
4. Estimate
P(Q=q|E=e) ≈ (# samples with Q=q AND E=e) / (# samples with E=e)
Why it works: Each complete sample is drawn from the joint distribution P(X1, …, Xn). By the law of large numbers, the empirical frequency of any event converges to its true probability as the number of samples grows. Direct sampling produces unbiased estimates.
The rejection problem: To estimate P(Q|E=e), we throw away any sample where E ≠ e. If P(E=e) is small, almost all samples are rejected. With 1% evidence probability, we need ~100 samples to get one useful sample. This is direct sampling's fatal flaw for rare evidence.
Direct Sampling Demo

Network: Cloudy → Rain → Wet ← Sprinkler ← Cloudy. Evidence: Wet = true. Watch accepted (colored) vs rejected (grey) samples. Rare evidence = high rejection rate.

P(Cloudy) 0.50
What is the main weakness of direct sampling when evidence probability is low?

Chapter 6: Likelihood Weighted Sampling

Likelihood weighted (LW) sampling solves the rejection problem by forcing every sample to be consistent with the evidence. Instead of sampling evidence variables, we fix them to their observed values. To compensate for this "cheating," each sample gets a weight equal to the probability of the evidence given the sampled non-evidence variables.

For a sample x with evidence E = e, the weight is:

w(x) = ∏Ei ∈ E P(Ei = ei | pa(Ei) = sampled parent values)

We then estimate the posterior as a weighted frequency:

P(Q = q | E = e) ≈ ∑samples with Q=q w(x)  /  ∑all samples w(x)
Correctness: LW sampling is an instance of importance sampling. We sample from a proposal distribution (the prior with evidence fixed) and reweight by the likelihood ratio. The estimator is unbiased and converges to the true posterior with enough samples.

LW sampling is much more efficient than rejection sampling for low-probability evidence. Every sample contributes, though samples with very low weights contribute little. The algorithm degrades gracefully: rare evidence leads to low-weight samples, not wasted samples.

When LW struggles: If the evidence is "surprising" given the prior, most samples receive very small weights. The effective sample size collapses. This is called weight degeneracy and motivates Gibbs sampling for extreme cases.
LW vs Direct Sampling Efficiency

Both methods estimate the same posterior P(Rain|Wet=true). LW (blue) uses all samples; direct (orange) rejects most. Watch effective sample size (ESS) diverge as evidence becomes rarer.

Evidence probability 5%
Number of samples 200
python
def likelihood_weighted_sample(bn, evidence):
    # bn: Bayesian network with topo_order, cpd(node, parent_vals)
    w = 1.0
    sample = {}
    for node in bn.topo_order:
        parents = bn.parents[node]
        parent_vals = tuple(sample[p] for p in parents)
        if node in evidence:
            val = evidence[node]
            w *= bn.cpd(node, parent_vals)[val]  # multiply likelihood
            sample[node] = val
        else:
            dist = bn.cpd(node, parent_vals)
            sample[node] = np.random.choice(len(dist), p=dist)
    return sample, w
In likelihood weighted sampling, what does the weight w(x) represent?

Chapter 7: Gibbs Sampling — MCMC Inference

Both direct and LW sampling draw each sample independently. Gibbs sampling takes a completely different approach: it constructs a Markov chain whose stationary distribution is the posterior P(X | E = e). We run the chain for many steps and collect samples from it.

Start from any complete assignment consistent with the evidence. Then repeatedly: pick a non-evidence variable Xi at random, and resample it from its full conditional — the distribution P(Xi | all other variables fixed).

The Markov blanket: In a Bayesian network, Xi is conditionally independent of all non-blanket variables given its Markov blanket (parents + children + children's other parents). So the full conditional only involves a small local neighborhood, computable from a few CPT lookups.

The full conditional for Xi given all others x−i is proportional to:

P(Xi | x−i) ∝ P(Xi | pa(Xi)) · ∏Cj: child of Xi P(Cj | pa(Cj))
Burn-in: The first few hundred Gibbs samples may not represent the posterior well — the chain hasn't "mixed" yet. We discard these initial samples (burn-in period). After mixing, consecutive samples are correlated (unlike independent sampling), so we may need more total samples to get equivalent effective sample size.
Gibbs Sampler in Action

Two binary variables A and B with adjustable correlation. The Gibbs sampler alternately resamples each from its full conditional. Red = burn-in. Watch the posterior histogram for B converge.

Correlation ρ 0.50
Burn-in steps 50
python
def gibbs_sample(bn, evidence, n_samples, burn_in=100):
    state = {**evidence}
    for node in bn.non_evidence_nodes:
        state[node] = np.random.choice(bn.domain_size[node])

    samples = []
    for step in range(burn_in + n_samples):
        node = np.random.choice(bn.non_evidence_nodes)
        # Full conditional from Markov blanket (local CPT lookups only)
        probs = bn.markov_blanket_conditional(node, state)
        state[node] = np.random.choice(len(probs), p=probs)
        if step >= burn_in:
            samples.append(state.copy())
    return samples
Why does Gibbs sampling only require the Markov blanket of Xi to compute its full conditional?

Chapter 8: Inference in Gaussian Models

When all variables are jointly Gaussian, inference has a beautiful closed-form solution. No sampling, no enumeration, no message passing — just matrix algebra.

If X and Y are jointly Gaussian with covariance structure Σ, then conditioning on Y = y gives:

P(X | Y = y) = 𝒩(μX + ΣXYΣYY−1(y − μY),  ΣXX − ΣXYΣYY−1ΣYX)

This is the conditional Gaussian formula. The posterior mean is the prior mean plus a correction proportional to how surprising the observation was. The posterior covariance is the prior covariance minus the information gained from observing Y — observing evidence always reduces uncertainty.

Kalman gain: The matrix ΣXYΣYY−1 is exactly the Kalman gain from filtering theory. It is the optimal linear combination of observations to update the mean. The Kalman filter is simply this Gaussian inference formula applied recursively over time.
Gaussian Posterior Update

Two jointly Gaussian variables: X (true position) and Y (noisy measurement). Grey = prior over X. Teal = posterior P(X|Y=y). Orange line = observation. High correlation means a surprising observation strongly shifts the posterior.

Correlation ρ 0.70
Observation Y = y 1.0

A Gaussian Bayesian network is one where every CPD is a linear Gaussian: Xi | pa(Xi) ~ N(μi + βiT pa(Xi), σi2). The joint over all variables is then a single multivariate Gaussian, and inference reduces to the conditional formula above.

python
import numpy as np

def gaussian_inference(mu_x, mu_y, sigma_xx, sigma_xy, sigma_yy, y_obs):
    # Scalar case: P(X | Y=y_obs) given joint Gaussian parameters
    K = sigma_xy / sigma_yy              # Kalman gain
    mu_post = mu_x + K * (y_obs - mu_y) # Updated mean
    sigma_post = sigma_xx - K * sigma_xy # Updated variance
    return mu_post, sigma_post

# Example: rho=0.7, observe Y=1.0
mu_p, s_p = gaussian_inference(
    mu_x=0, mu_y=0, sigma_xx=1.0, sigma_xy=0.7, sigma_yy=1.0, y_obs=1.0
)
# mu_post = 0.70, sigma_post = 0.51
In the conditional Gaussian formula, the term ΣXYΣYY−1(y−μY) represents what?

Chapter 9: Connections & What Comes Next

Inference is the computational engine underlying everything else in this book. Every algorithm that reasons about uncertainty — filtering, planning, learning — calls some form of inference as a subroutine.

MethodBest ForLimitation
EnumerationTeaching, tiny networks (n < 10)Exponential in hidden variables
Variable EliminationGeneral exact inferenceExponential in treewidth
Belief PropagationTrees, all marginals at onceApproximate on loopy graphs
Direct SamplingSimple estimation, common evidenceRejects low-probability evidence
LW SamplingModerate-probability evidenceWeight degeneracy under extreme evidence
Gibbs SamplingComplex posteriors, correlated varsSlow mixing, burn-in required
Gaussian InferenceLinear-Gaussian models, exactlyRequires Gaussian assumption
Beyond this chapter: Modern practitioners turn to variational inference (approximate P(X|E) with a simpler distribution from a tractable family) or neural amortized inference (train a network to predict posteriors directly, as in a variational autoencoder). Both avoid the per-query cost of exact methods.

How this connects to later chapters

Chapter 4: Parameter Learning
To learn CPT parameters from data with hidden variables, the EM algorithm calls inference at every E-step to compute expected sufficient statistics under the posterior.

Chapter 6: Simple Decisions
Decision-making requires computing expected utilities — E[U(outcome) | action, evidence] — which is a weighted sum over the posterior.

Chapters 19–22: POMDPs
The "belief state" in a partially observable system is a posterior over hidden states. Updating it is exactly a Bayesian inference step at each time point.

Kalman Filter Connection
The Gaussian inference formula is the Kalman update equation. Linear-Gaussian state-space models are Gaussian Bayesian networks unrolled through time, and the Kalman filter runs exact inference on them.

Particle Filters
Direct sampling applied to dynamic models. Each particle is a sample from the prior propagated forward, reweighted by observation likelihood. This is LW sampling in disguise.

Variational Autoencoders
Neural networks trained to do approximate inference. The encoder learns P(Z|X) without explicit enumeration by maximizing a variational lower bound.

The big picture: Every exact inference algorithm is a smart way to avoid recomputing the same products. Every approximate inference algorithm trades exactness for scalability. The right choice depends on network structure, evidence probability, and the cost of being wrong.

Related micro-lessons

The representation behind this chapter — structure, CPTs, d-separation.
Inference over time — the Bayes filter as a dynamic Bayesian network.
Gaussian inference applied to linear dynamical systems.
"In the long run it's not just whether you're right or wrong, but how confident you are, and how you respond to the evidence."
— Nate Silver, The Signal and the Noise