Inference — Kochenderfer et al., Chapter 3

Chapter 0: The Question Behind the Network

Imagine a doctor who knows from statistics that 1% of patients have a rare disease. A test for the disease is 95% sensitive (catches 95% of true cases) and 90% specific (correctly rules out 90% of healthy patients). A patient just tested positive. What is the probability they actually have the disease?

This is an inference problem. We have a Bayesian network encoding beliefs about Disease, Test result, and their relationship. We have observed evidence (Test = positive). We want to compute a posterior probability — P(Disease = true | Test = positive).

The naive answer is "95% — the test is very accurate." The correct answer is about 8.7%. Understanding why requires computing posteriors precisely, which is exactly what inference algorithms do.

The core task: Given a Bayesian network over variables X₁, …, X_n, a set of query variables Q, and a set of evidence variables E observed to take values e, compute P(Q | E = e).

The Doctor’s Dilemma

Adjust the disease prevalence and test accuracy. Watch how the posterior probability of disease given a positive test changes — often dramatically.

Prevalence P(D) 1%

Sensitivity P(T+|D) 95%

Specificity P(T−|¬D) 90%

The network here is tiny — two variables. Real problems involve dozens or hundreds of variables. We need systematic algorithms. The rest of this chapter develops them.

In the inference problem, what are the "evidence variables"?

Variables we want to predict All variables in the network Variables whose values we have observed

Chapter 1: Exact Inference by Enumeration

The blunt-force approach to inference is to directly apply the definition of conditional probability. We want P(Q = q | E = e). By definition, this equals P(Q = q, E = e) / P(E = e). Both of these are just marginals of the joint distribution.

The joint distribution over all variables X_1:n in a Bayesian network factorizes as a product of conditional probability tables (CPTs):

P(X₁, …, X_n) = ∏_i P(X_i | pa(X_i))

To compute P(Q = q, E = e), we sum this joint over all configurations of the remaining hidden variables H (those not in Q or E):

P(Q = q, E = e) = ∑_h ∏_i P(X_i | pa(X_i))

This is inference by enumeration: enumerate every possible assignment to the hidden variables, evaluate the joint for each, and sum up. It works. It is also catastrophically slow for large networks — if there are k hidden binary variables, there are 2^k terms to sum.

Why it is correct: Marginalization (summing over hidden variables) is exact because we are literally accounting for every possible world. The conditional probability formula P(Q|E) = P(Q,E)/P(E) requires no approximation. The only cost is computational.

Enumeration on a Small Network

Network: B (Battery) → S (Start) ← F (Fuel). We observe S = false (car won't start). Enumeration computes P(B=false|S=false) by summing over all values of F.

P(B=bad) 0.10

P(F=empty) 0.05

The enumeration algorithm has exponential time complexity in the number of hidden variables. For 30 binary hidden variables, that is over a billion terms. We need something smarter.

In enumeration, why do we divide P(Q,E) by P(E)?

To normalize so the result sums to 1 across all possible hidden variable states To subtract the prior probability of the evidence Because P(Q|E) = P(Q,E)/P(E) by the definition of conditional probability

Chapter 2: Variable Elimination — Smarter Summing

Enumeration is slow because it repeatedly computes the same sub-products. Variable elimination fixes this by pushing sums inside products — eliminating one variable at a time, caching the intermediate results as factors.

A factor is a function from a set of variables to a non-negative real number. Every CPT in the network is a factor. We can multiply two factors (produce a new factor over their union of variables) and marginalize a factor (sum out one variable to get a factor over fewer variables).

The key trick: Instead of ∑_h1,h2,h3 f(h1) · g(h1,h2) · p(h2,h3) — which iterates over all triples — we compute ∑_h1 f(h1) · g(h1,h2) first (a new factor over h2 only), then ∑_h2 [result](h2) · p(h2,h3). Each step is smaller.

The algorithm:

1. Initialize

Start with all CPTs as factors. Set evidence variables to their observed values (reduce each factor).

↓

2. Choose variable to eliminate

Pick a hidden variable X_i. Multiply all factors involving X_i into one big factor.

↓

3. Marginalize

Sum the big factor over X_i, producing a new factor that no longer involves X_i.

↻ repeat for all hidden variables

4. Normalize

Multiply remaining factors (over query variables only). Normalize to get the posterior.

Step-Through Variable Elimination

Network: Rain → Wet ← Sprinkler ← Cloudy, Cloudy → Rain. Evidence: Wet = true. Query: P(Rain | Wet=true). Step through the elimination to see factors grow and shrink.

The cost of variable elimination depends critically on the elimination order. A good order keeps intermediate factors small. Finding the optimal order is itself NP-hard, but good heuristics (eliminate the variable that creates the smallest new factor) work well in practice.

Operation	Inputs	Output	Cost
Factor product	f(X,Y), g(Y,Z)	h(X,Y,Z)	\|X\|·\|Y\|·\|Z\|
Marginalize out Y	h(X,Y,Z)	m(X,Z)	\|X\|·\|Y\|·\|Z\|
Reduce (observe Y=y)	f(X,Y)	r(X)	\|X\|

Why does variable elimination outperform naive enumeration on large networks?

It uses approximate probabilities instead of exact ones It pushes summations inside products, avoiding redundant computation by caching intermediate factors It only processes variables that are ancestors of the query

Chapter 3: Belief Propagation

Variable elimination answers a single query efficiently. But what if we want to compute posteriors for every variable simultaneously? Running elimination separately for each variable wastes computation. Belief propagation does all of them at once on tree-structured networks.

On a tree (a network with no undirected cycles), we pick any node as root and run two passes: a collect pass (leaves send messages to root) and a distribute pass (root sends messages back to leaves). After two passes, every node has the information it needs to compute its own marginal.

What is a "message"? When node X_i sends a message to its neighbor X_j, it is saying: "Given everything I know from my subtree, here is a factor over your values." Multiplying all incoming messages gives the posterior at each node.

Formally, the message from node i to node j is:

m_i→j(x_j) = ∑_{x_i} P(x_i | pa(x_i)) · ∏_{k ∈ nb(i)\{j}} m_k→i(x_i)

The posterior at node j after collecting all messages is proportional to its own CPT times the product of all incoming messages:

P(X_j | e) ∝ P(X_j | pa(X_j)) · ∏_{i ∈ nb(j)} m_i→j(X_j)

Message Passing on a Tree

Five-node chain: A → B → C → D → E. Observe E = true. Click "Collect" to propagate messages from E toward A, then "Distribute" to propagate back. Watch belief bars update.

Belief propagation is exact on trees. On networks with cycles (loopy graphs), the same message-passing equations are applied iteratively — this is called loopy belief propagation. It often converges to good approximate answers, but correctness is not guaranteed.

On a tree-structured Bayesian network, how many message-passing passes does belief propagation need to compute all marginals exactly?

One pass from root to leaves One pass per node in the tree Two passes — collect (leaves to root) then distribute (root to leaves)

Chapter 4: Computational Complexity

Here is the sobering truth: exact inference in general Bayesian networks is NP-hard. This means no algorithm can efficiently solve all instances unless P = NP. The hardness comes from networks with tightly coupled loops that prevent the factorization tricks from reducing the problem size.

The key quantity governing difficulty is treewidth — roughly, how tree-like is the network? A tree has treewidth 1. Each extra edge that creates a loop can increase treewidth. Variable elimination runs in time O(n · d^w+1) where d is the domain size and w is the treewidth. On a tree (w=1), this is linear. On a dense network, it is exponential.

Intuition for treewidth: Imagine eliminating variables one at a time. Each elimination potentially "connects" the remaining neighbors of the eliminated variable. Treewidth measures the worst-case clique size that forms during the best possible elimination order. High treewidth means unavoidable large intermediate factors.

Treewidth vs Inference Cost

As network density grows, treewidth increases and inference cost explodes exponentially. Adjust domain size to see how the exponent multiplies.

Effective treewidth w 3

Domain size d 2

The NP-hardness motivates two responses: (1) restrict to tractable network structures (trees, polytrees, networks with small treewidth), or (2) use approximate inference via sampling. The next chapters cover sampling methods.

Method	Exact?	Works on loops?	Complexity
Enumeration	Yes	Yes	O(dⁿ)
Variable Elimination	Yes	Yes	O(n · d^w+1)
Belief Propagation	Yes (trees only)	Approximate	O(n · d²)
Sampling methods	Approximate	Yes	O(samples · n)

What does treewidth measure about a Bayesian network?

The size of the largest clique formed during optimal variable elimination (minus 1) The total number of edges in the network The maximum number of parents any node has

Chapter 5: Direct Sampling

Sampling methods sidestep the complexity barrier by estimating probabilities from random samples. The simplest approach is direct sampling (also called prior sampling or ancestral sampling): generate complete assignments to all variables by following the network structure from parents to children.

The algorithm is elegantly simple. Process variables in topological order (parents before children). For each variable, look up its CPT given the already-sampled values of its parents, then draw a random sample from that distribution. Repeat N times, then count.

1. Topological sort

Order variables so every parent appears before its children.

↓

2. Sample each variable

For each X_i in order: draw from P(X_i | pa(X_i) = already sampled values).

↓

3. Record the sample

One complete assignment to all variables. Discard if inconsistent with evidence.

↻ repeat N times

4. Estimate

P(Q=q|E=e) ≈ (# samples with Q=q AND E=e) / (# samples with E=e)

Why it works: Each complete sample is drawn from the joint distribution P(X₁, …, X_n). By the law of large numbers, the empirical frequency of any event converges to its true probability as the number of samples grows. Direct sampling produces unbiased estimates.

The rejection problem: To estimate P(Q|E=e), we throw away any sample where E ≠ e. If P(E=e) is small, almost all samples are rejected. With 1% evidence probability, we need ~100 samples to get one useful sample. This is direct sampling's fatal flaw for rare evidence.

Direct Sampling Demo

Network: Cloudy → Rain → Wet ← Sprinkler ← Cloudy. Evidence: Wet = true. Watch accepted (colored) vs rejected (grey) samples. Rare evidence = high rejection rate.

P(Cloudy) 0.50

What is the main weakness of direct sampling when evidence probability is low?

It produces biased estimates because samples are correlated Most samples are rejected (inconsistent with evidence), so few useful samples are obtained It requires topological sort, which is expensive for large networks

Chapter 6: Likelihood Weighted Sampling

Likelihood weighted (LW) sampling solves the rejection problem by forcing every sample to be consistent with the evidence. Instead of sampling evidence variables, we fix them to their observed values. To compensate for this "cheating," each sample gets a weight equal to the probability of the evidence given the sampled non-evidence variables.

For a sample x with evidence E = e, the weight is:

w(x) = ∏_{E_i ∈ E} P(E_i = e_i | pa(E_i) = sampled parent values)

We then estimate the posterior as a weighted frequency:

P(Q = q | E = e) ≈ ∑_{samples with Q=q} w(x) / ∑_{all samples} w(x)

Correctness: LW sampling is an instance of importance sampling. We sample from a proposal distribution (the prior with evidence fixed) and reweight by the likelihood ratio. The estimator is unbiased and converges to the true posterior with enough samples.

LW sampling is much more efficient than rejection sampling for low-probability evidence. Every sample contributes, though samples with very low weights contribute little. The algorithm degrades gracefully: rare evidence leads to low-weight samples, not wasted samples.

When LW struggles: If the evidence is "surprising" given the prior, most samples receive very small weights. The effective sample size collapses. This is called weight degeneracy and motivates Gibbs sampling for extreme cases.

LW vs Direct Sampling Efficiency

Both methods estimate the same posterior P(Rain|Wet=true). LW (blue) uses all samples; direct (orange) rejects most. Watch effective sample size (ESS) diverge as evidence becomes rarer.

Evidence probability 5%

Number of samples 200

python
def likelihood_weighted_sample(bn, evidence):
    # bn: Bayesian network with topo_order, cpd(node, parent_vals)
    w = 1.0
    sample = {}
    for node in bn.topo_order:
        parents = bn.parents[node]
        parent_vals = tuple(sample[p] for p in parents)
        if node in evidence:
            val = evidence[node]
            w *= bn.cpd(node, parent_vals)[val]  # multiply likelihood
            sample[node] = val
        else:
            dist = bn.cpd(node, parent_vals)
            sample[node] = np.random.choice(len(dist), p=dist)
    return sample, w

In likelihood weighted sampling, what does the weight w(x) represent?

The probability of the evidence given the sampled non-evidence variables The probability of the entire sample under the joint distribution How many times this sample should be counted in the final estimate

Chapter 7: Gibbs Sampling — MCMC Inference

Both direct and LW sampling draw each sample independently. Gibbs sampling takes a completely different approach: it constructs a Markov chain whose stationary distribution is the posterior P(X | E = e). We run the chain for many steps and collect samples from it.

Start from any complete assignment consistent with the evidence. Then repeatedly: pick a non-evidence variable X_i at random, and resample it from its full conditional — the distribution P(X_i | all other variables fixed).

The Markov blanket: In a Bayesian network, X_i is conditionally independent of all non-blanket variables given its Markov blanket (parents + children + children's other parents). So the full conditional only involves a small local neighborhood, computable from a few CPT lookups.

The full conditional for X_i given all others x_−i is proportional to:

P(X_i | x_−i) ∝ P(X_i | pa(X_i)) · ∏_{C_j: child of X_i} P(C_j | pa(C_j))

Burn-in: The first few hundred Gibbs samples may not represent the posterior well — the chain hasn't "mixed" yet. We discard these initial samples (burn-in period). After mixing, consecutive samples are correlated (unlike independent sampling), so we may need more total samples to get equivalent effective sample size.

Gibbs Sampler in Action

Two binary variables A and B with adjustable correlation. The Gibbs sampler alternately resamples each from its full conditional. Red = burn-in. Watch the posterior histogram for B converge.

Correlation ρ 0.50

Burn-in steps 50

python
def gibbs_sample(bn, evidence, n_samples, burn_in=100):
    state = {**evidence}
    for node in bn.non_evidence_nodes:
        state[node] = np.random.choice(bn.domain_size[node])

    samples = []
    for step in range(burn_in + n_samples):
        node = np.random.choice(bn.non_evidence_nodes)
        # Full conditional from Markov blanket (local CPT lookups only)
        probs = bn.markov_blanket_conditional(node, state)
        state[node] = np.random.choice(len(probs), p=probs)
        if step >= burn_in:
            samples.append(state.copy())
    return samples

Why does Gibbs sampling only require the Markov blanket of X_i to compute its full conditional?

Because non-blanket variables are always set to their prior means Because we marginalize out all other variables during the update step Because X_i is conditionally independent of all non-blanket variables given its blanket

Chapter 8: Inference in Gaussian Models

When all variables are jointly Gaussian, inference has a beautiful closed-form solution. No sampling, no enumeration, no message passing — just matrix algebra.

If X and Y are jointly Gaussian with covariance structure Σ, then conditioning on Y = y gives:

P(X | Y = y) = 𝒩(μ_X + Σ_XYΣ_YY⁻¹(y − μ_Y), Σ_XX − Σ_XYΣ_YY⁻¹Σ_YX)

This is the conditional Gaussian formula. The posterior mean is the prior mean plus a correction proportional to how surprising the observation was. The posterior covariance is the prior covariance minus the information gained from observing Y — observing evidence always reduces uncertainty.

Kalman gain: The matrix Σ_XYΣ_YY⁻¹ is exactly the Kalman gain from filtering theory. It is the optimal linear combination of observations to update the mean. The Kalman filter is simply this Gaussian inference formula applied recursively over time.

Gaussian Posterior Update

Two jointly Gaussian variables: X (true position) and Y (noisy measurement). Grey = prior over X. Teal = posterior P(X|Y=y). Orange line = observation. High correlation means a surprising observation strongly shifts the posterior.

Correlation ρ 0.70

Observation Y = y 1.0

A Gaussian Bayesian network is one where every CPD is a linear Gaussian: X_i | pa(X_i) ~ N(μ_i + β_i^T pa(X_i), σ_i²). The joint over all variables is then a single multivariate Gaussian, and inference reduces to the conditional formula above.

python
import numpy as np

def gaussian_inference(mu_x, mu_y, sigma_xx, sigma_xy, sigma_yy, y_obs):
    # Scalar case: P(X | Y=y_obs) given joint Gaussian parameters
    K = sigma_xy / sigma_yy              # Kalman gain
    mu_post = mu_x + K * (y_obs - mu_y) # Updated mean
    sigma_post = sigma_xx - K * sigma_xy # Updated variance
    return mu_post, sigma_post

# Example: rho=0.7, observe Y=1.0
mu_p, s_p = gaussian_inference(
    mu_x=0, mu_y=0, sigma_xx=1.0, sigma_xy=0.7, sigma_yy=1.0, y_obs=1.0
)
# mu_post = 0.70, sigma_post = 0.51

In the conditional Gaussian formula, the term Σ_XYΣ_YY⁻¹(y−μ_Y) represents what?

The correction added to the prior mean based on how surprising the observation was The reduction in variance from observing Y The marginal distribution of X before observing Y

Chapter 9: Connections & What Comes Next

Inference is the computational engine underlying everything else in this book. Every algorithm that reasons about uncertainty — filtering, planning, learning — calls some form of inference as a subroutine.

Method	Best For	Limitation
Enumeration	Teaching, tiny networks (n < 10)	Exponential in hidden variables
Variable Elimination	General exact inference	Exponential in treewidth
Belief Propagation	Trees, all marginals at once	Approximate on loopy graphs
Direct Sampling	Simple estimation, common evidence	Rejects low-probability evidence
LW Sampling	Moderate-probability evidence	Weight degeneracy under extreme evidence
Gibbs Sampling	Complex posteriors, correlated vars	Slow mixing, burn-in required
Gaussian Inference	Linear-Gaussian models, exactly	Requires Gaussian assumption

Beyond this chapter: Modern practitioners turn to variational inference (approximate P(X|E) with a simpler distribution from a tractable family) or neural amortized inference (train a network to predict posteriors directly, as in a variational autoencoder). Both avoid the per-query cost of exact methods.

How this connects to later chapters

Chapter 4: Parameter Learning
To learn CPT parameters from data with hidden variables, the EM algorithm calls inference at every E-step to compute expected sufficient statistics under the posterior.

Chapter 6: Simple Decisions
Decision-making requires computing expected utilities — E[U(outcome) | action, evidence] — which is a weighted sum over the posterior.

Chapters 19–22: POMDPs
The "belief state" in a partially observable system is a posterior over hidden states. Updating it is exactly a Bayesian inference step at each time point.

Kalman Filter Connection
The Gaussian inference formula is the Kalman update equation. Linear-Gaussian state-space models are Gaussian Bayesian networks unrolled through time, and the Kalman filter runs exact inference on them.

Particle Filters
Direct sampling applied to dynamic models. Each particle is a sample from the prior propagated forward, reweighted by observation likelihood. This is LW sampling in disguise.

Variational Autoencoders
Neural networks trained to do approximate inference. The encoder learns P(Z|X) without explicit enumeration by maximizing a variational lower bound.

The big picture: Every exact inference algorithm is a smart way to avoid recomputing the same products. Every approximate inference algorithm trades exactness for scalability. The right choice depends on network structure, evidence probability, and the cost of being wrong.

Related micro-lessons

Bayesian Networks

The representation behind this chapter — structure, CPTs, d-separation.

↓

Bayes Filter

Inference over time — the Bayes filter as a dynamic Bayesian network.

↓

Kalman Filter

Gaussian inference applied to linear dynamical systems.

"In the long run it's not just whether you're right or wrong, but how confident you are, and how you respond to the evidence."
— Nate Silver, The Signal and the Noise