Bishop PRML, Chapter 10

Approximate Inference

Variational inference, mean field approximation, variational EM, and expectation propagation — when exact inference is intractable.

Prerequisites: Chapters 2, 8–9 (distributions, graphical models, EM, ELBO).
11
Chapters
2
Simulations
11
Quizzes

Chapter 0: Why Approximate Inference?

Bayesian inference requires computing the posterior p(Z|X) = p(X|Z)p(Z) / p(X). The denominator p(X) = ∫ p(X|Z)p(Z) dZ is an integral (or sum) over all latent variables.

For most interesting models, this integral is intractable:

• The latent space is too high-dimensional (deep latent variable models).

• The likelihood and prior are not conjugate (non-Gaussian posteriors).

• The graphical model has loops (no exact message passing).

Two families of approximation:
Deterministic: Variational methods (this chapter) — approximate the posterior with a simpler distribution by optimization. Fast but biased.
Stochastic: Sampling methods (Chapter 11) — generate samples from the posterior via MCMC. Exact in the limit but slow to converge.
Modern ML uses both: variational autoencoders use variational inference for training and sometimes MCMC for evaluation.
Check: Why is exact Bayesian inference often intractable?

Chapter 1: Variational Inference

Recall the ELBO decomposition from Chapter 9:

ln p(X) = L(q) + KL(q || p(Z|X))

In EM, we set q = p(Z|X) exactly. In variational inference, we restrict q to a tractable family Q and find the best approximation:

q*(Z) = arg minq ∈ Q KL(q(Z) || p(Z|X))

Equivalently (since ln p(X) is constant w.r.t. q), we maximize the ELBO:

q* = arg maxq ∈ Q L(q) = arg maxq ∈ Q Eq[ln p(X,Z)] − Eq[ln q(Z)]
Inference becomes optimization: Instead of computing an intractable integral, we solve an optimization problem: find the distribution q in our family that is closest to the true posterior. The ELBO serves as the objective. This is the key insight of variational inference — it turns inference into optimization, which we know how to do.
Check: What does variational inference optimize?

Chapter 2: Mean Field Approximation

The most common variational family is the mean field (fully factorized) approximation:

q(Z) = ∏i=1M qi(Zi)

Each group of latent variables Zi is treated independently. We optimize each factor qi while holding the others fixed.

The optimal factor for Zj has a beautiful closed form:

ln qj*(Zj) = Ei≠j[ln p(X, Z)] + const
The mean field update: The optimal qj is the exponential of the expected log joint, where the expectation is over all other factors. This is coordinate ascent: optimize one factor at a time, cycling through all factors until convergence. Each update is guaranteed to increase (or maintain) the ELBO. The result depends on the form of the joint distribution — for conjugate exponential family models, each qj has the same functional form as its conditional in the original model.

Mean field ignores correlations between variable groups. This makes the approximation fast but potentially inaccurate: it tends to underestimate posterior variances and miss multimodality.

Check: What does the mean field approximation assume?

Chapter 3: Example: Univariate Gaussian

To illustrate, consider inferring the mean μ and precision τ of a Gaussian from data x1, ..., xN. The true posterior p(μ, τ|x) is correlated — the mean depends on the precision and vice versa.

The mean field approximation: q(μ, τ) = qμ(μ) qτ(τ).

Applying the optimal update formula:

qμ(μ) = N(μ | μN, λN−1)    (Gaussian)
qτ(τ) = Gam(τ | aN, bN)    (Gamma)
Conjugacy is preserved: The variational factors have the same functional form as the conditional priors. The Gaussian conjugate prior on μ gives a Gaussian variational factor. The Gamma conjugate prior on τ gives a Gamma variational factor. This is a general property of conjugate exponential family models: mean field variational inference stays within the same family.

The updates couple through expectations: qμ depends on E[τ] (the expected precision), and qτ depends on E2]. We iterate until convergence.

Check: In the Gaussian example, what form does the variational factor q_mu take?

Chapter 4: Variational Gaussian Mixtures

The Bayesian GMM puts priors on all parameters: Dirichlet on π, Gaussian-Wishart on (μk, Λk). The mean field approximation factorizes:

q(Z, π, μ, Λ) = q(Z) q(π) ∏k=1K q(μk, Λk)

The resulting update equations resemble EM but are fully Bayesian:

Update q(Z): Responsibilities now use expected parameters (expected log mixing coefficients, expected log precision, etc.).

Update q(π): Dirichlet with updated concentration parameters.

Update q(μk, Λk): Gaussian-Wishart with updated parameters.

Automatic model selection: Unlike ML estimation, the variational Bayesian GMM can determine K automatically. Start with a large K. As the algorithm converges, unnecessary components have their mixing coefficients πk driven to zero — they're pruned. The evidence (ELBO) naturally penalizes complexity. No cross-validation needed. This also avoids the singularity problem of ML: the priors prevent components from collapsing onto single points.
Check: What key advantage does the variational Bayesian GMM have over the ML GMM?

Chapter 5: Variational Linear Regression

For Bayesian linear regression with unknown weight precision α and noise precision β, the variational treatment factorizes q(w, α, β) = q(w) q(α) q(β).

The optimal factors are:

q(w) = N(w|mN, SN)
q(α) = Gam(α|aα, bα),    q(β) = Gam(β|aβ, bβ)
Comparison to the evidence framework: The evidence framework (Ch 3) point-estimates α and β by maximizing the marginal likelihood. Variational inference maintains full posterior distributions over them. In practice, the variational approach is more robust when data is scarce — it properly accounts for uncertainty in the hyperparameters rather than committing to point estimates.

The predictive distribution integrates over the weight posterior, giving both a prediction and calibrated uncertainty bands.

Check: How does variational linear regression differ from the evidence framework?

Chapter 6: Local Variational Methods

Instead of approximating the posterior globally (mean field), local variational methods introduce bounds on individual terms in the likelihood.

The key idea: replace a difficult nonlinear function with a simpler bound that can be optimized. For example, the logistic sigmoid σ(a) = 1/(1+e−a) has no conjugate prior. But we can bound it:

σ(a) ≥ σ(ξ) exp{(a − ξ)/2 − λ(ξ)(a2 − ξ2)}

where ξ is a variational parameter and λ(ξ) = [σ(ξ) − 1/2]/(2ξ).

The bound trick: By replacing the sigmoid with its variational lower bound, the intractable integral becomes Gaussian — which we can compute in closed form. The variational parameter ξ is then optimized to make the bound as tight as possible. This trades one hard integral for an easier optimization problem. The bound is tight when ξ = |a|.

This technique is particularly useful for models with non-conjugate likelihoods (classification with Gaussian priors), where mean field alone doesn't yield tractable updates.

Check: What do local variational methods do?

Chapter 7: Variational Logistic Regression

Combining the local variational bound on the sigmoid with a Gaussian prior on weights gives a fully variational treatment of logistic regression.

The variational posterior over weights is Gaussian: q(w) = N(w|m, S), where:

S−1 = A + 2 ∑n λ(ξn) φnφnT
m = Sn (tn − ½) φn

where A = diag(α1, ..., αM) are the prior precisions.

Comparison of approaches to Bayesian logistic regression: The Laplace approximation (Ch 4) fits a Gaussian at the posterior mode. The variational approach fits a Gaussian by minimizing KL divergence, which better captures the posterior's shape (especially the tails). The variational bound also gives a lower bound on the evidence, useful for model selection.

The variational parameters ξn and hyperparameters αi are updated alternately until convergence. When ARD priors are used (αi per weight), sparsity emerges naturally — just as in the RVM from Chapter 7.

Check: How does the variational approach to logistic regression differ from the Laplace approximation?

Chapter 8: Expectation Propagation

Expectation propagation (EP) is an alternative to variational inference that uses a different KL direction. While mean field minimizes KL(q||p), EP minimizes KL(p||q) locally.

The key difference in KL direction:

KL(q||p) (variational):
q avoids regions where p is small. Tends to be mode-seeking — locks onto one mode of a multimodal posterior. Underestimates variance.

KL(p||q) (EP):
q covers all regions where p is significant. Tends to be moment-matching — spreads mass to cover all modes. Overestimates variance for multimodal posteriors.

EP algorithm: EP approximates the true posterior p(Z|X) ∝ ∏n fn(Z) by a product of simpler factors q(Z) ∝ ∏nn(Z). It refines each approximate factor f̃n one at a time: remove it from q, incorporate the exact factor fn, then project back to the approximate family by matching moments. This is "assumed density filtering" iterated until convergence.

EP is often more accurate than mean field for classification problems because the moment-matching property gives better-calibrated uncertainties. It's used in practical systems like the Bayes Point Machine.

Check: How does EP differ from mean field variational inference?

Chapter 9: KL Direction Comparison

The two KL directions lead to qualitatively different approximations. Explore the difference below.

KL(q||p) vs KL(p||q)

The true posterior p (gray) is bimodal. Orange: KL(q||p) minimizer (variational, mode-seeking). Teal: KL(p||q) minimizer (EP-style, moment-matching).

Mode separation 3.0
Choose your KL wisely: KL(q||p) (variational) finds the single best mode — good when you care about the most likely region. KL(p||q) (EP) tries to cover everything — good when you need to average over the full posterior (e.g., for predictions). In practice, modern methods like stochastic variational inference and normalizing flows can partially overcome the limitations of both.
Check: For a bimodal posterior, how does minimizing KL(q||p) behave?

Chapter 10: Summary

MethodKL directionBehaviorStrengths
Mean field VIKL(q||p)Mode-seekingFast, scalable, guaranteed convergence
EPKL(p||q) locallyMoment-matchingBetter calibrated, covers posterior mass
Local variationalBound-basedTight boundsHandles non-conjugate terms
Laplace (Ch 4)N/A (mode + curvature)Local GaussianSimple, one-shot
The variational principle: Variational inference converts intractable integration into tractable optimization. The ELBO connects it to EM. The mean field assumption makes it scalable. The KL direction determines the approximation's character. This framework underpins modern generative models: VAEs use amortized variational inference, and diffusion models can be interpreted through variational bounds.

What comes next: Chapter 11 covers sampling methods (MCMC) — the stochastic alternative to variational inference. Instead of approximating the posterior analytically, we generate samples from it.

"Variational methods allow us to recast inference problems
as optimization problems, which we can then solve using a
variety of efficient techniques."
— Christopher Bishop, PRML §10
Check: What is the core idea of variational inference?