Bishop PRML, Chapter 10

Approximate Inference

Variational inference, mean field approximation, variational EM, and expectation propagation — when exact inference is intractable.

Prerequisites: Chapters 2, 8–9 (distributions, graphical models, EM, ELBO).

Chapters

Simulations

Quizzes

Chapter 0: Why Approximate Inference?

Bayesian inference requires computing the posterior p(Z|X) = p(X|Z)p(Z) / p(X). The denominator p(X) = ∫ p(X|Z)p(Z) dZ is an integral (or sum) over all latent variables.

For most interesting models, this integral is intractable:

• The latent space is too high-dimensional (deep latent variable models).

• The likelihood and prior are not conjugate (non-Gaussian posteriors).

• The graphical model has loops (no exact message passing).

Two families of approximation:
Deterministic: Variational methods (this chapter) — approximate the posterior with a simpler distribution by optimization. Fast but biased.
Stochastic: Sampling methods (Chapter 11) — generate samples from the posterior via MCMC. Exact in the limit but slow to converge.
Modern ML uses both: variational autoencoders use variational inference for training and sometimes MCMC for evaluation.

Check: Why is exact Bayesian inference often intractable?

The normalization integral over all latent variables is too expensive to compute for most models Bayes' theorem is too complex to apply The prior distribution is always unknown

Chapter 1: Variational Inference

Recall the ELBO decomposition from Chapter 9:

ln p(X) = L(q) + KL(q || p(Z|X))

In EM, we set q = p(Z|X) exactly. In variational inference, we restrict q to a tractable family Q and find the best approximation:

q*(Z) = arg min_{q ∈ Q} KL(q(Z) || p(Z|X))

Equivalently (since ln p(X) is constant w.r.t. q), we maximize the ELBO:

q* = arg max_{q ∈ Q} L(q) = arg max_{q ∈ Q} E_q[ln p(X,Z)] − E_q[ln q(Z)]

Inference becomes optimization: Instead of computing an intractable integral, we solve an optimization problem: find the distribution q in our family that is closest to the true posterior. The ELBO serves as the objective. This is the key insight of variational inference — it turns inference into optimization, which we know how to do.

Check: What does variational inference optimize?

The ELBO (equivalently, minimizing KL divergence between the approximate and true posterior) The log-likelihood directly The prior distribution

Chapter 2: Mean Field Approximation

The most common variational family is the mean field (fully factorized) approximation:

q(Z) = ∏_i=1^M q_i(Z_i)

Each group of latent variables Z_i is treated independently. We optimize each factor q_i while holding the others fixed.

The optimal factor for Z_j has a beautiful closed form:

ln q_j*(Z_j) = E_i≠j[ln p(X, Z)] + const

The mean field update: The optimal q_j is the exponential of the expected log joint, where the expectation is over all other factors. This is coordinate ascent: optimize one factor at a time, cycling through all factors until convergence. Each update is guaranteed to increase (or maintain) the ELBO. The result depends on the form of the joint distribution — for conjugate exponential family models, each q_j has the same functional form as its conditional in the original model.

Mean field ignores correlations between variable groups. This makes the approximation fast but potentially inaccurate: it tends to underestimate posterior variances and miss multimodality.

Check: What does the mean field approximation assume?

The variational posterior factorizes into independent groups: q(Z) = product of q_i(Z_i) The posterior is Gaussian All variables are observed

Chapter 3: Example: Univariate Gaussian

To illustrate, consider inferring the mean μ and precision τ of a Gaussian from data x₁, ..., x_N. The true posterior p(μ, τ|x) is correlated — the mean depends on the precision and vice versa.

The mean field approximation: q(μ, τ) = q_μ(μ) q_τ(τ).

Applying the optimal update formula:

q_μ(μ) = N(μ | μ_N, λ_N⁻¹) (Gaussian)

q_τ(τ) = Gam(τ | a_N, b_N) (Gamma)

Conjugacy is preserved: The variational factors have the same functional form as the conditional priors. The Gaussian conjugate prior on μ gives a Gaussian variational factor. The Gamma conjugate prior on τ gives a Gamma variational factor. This is a general property of conjugate exponential family models: mean field variational inference stays within the same family.

The updates couple through expectations: q_μ depends on E_qτ[τ] (the expected precision), and q_τ depends on E_qμ[μ²]. We iterate until convergence.

Check: In the Gaussian example, what form does the variational factor q_mu take?

A Gaussian — conjugacy is preserved under the mean field approximation A uniform distribution A Dirichlet distribution

Chapter 4: Variational Gaussian Mixtures

The Bayesian GMM puts priors on all parameters: Dirichlet on π, Gaussian-Wishart on (μ_k, Λ_k). The mean field approximation factorizes:

q(Z, π, μ, Λ) = q(Z) q(π) ∏_k=1^K q(μ_k, Λ_k)

The resulting update equations resemble EM but are fully Bayesian:

Update q(Z): Responsibilities now use expected parameters (expected log mixing coefficients, expected log precision, etc.).

Update q(π): Dirichlet with updated concentration parameters.

Update q(μ_k, Λ_k): Gaussian-Wishart with updated parameters.

Automatic model selection: Unlike ML estimation, the variational Bayesian GMM can determine K automatically. Start with a large K. As the algorithm converges, unnecessary components have their mixing coefficients π_k driven to zero — they're pruned. The evidence (ELBO) naturally penalizes complexity. No cross-validation needed. This also avoids the singularity problem of ML: the priors prevent components from collapsing onto single points.

Check: What key advantage does the variational Bayesian GMM have over the ML GMM?

It can automatically determine the number of components by pruning unnecessary ones, and avoids singularities It always converges faster It uses fewer data points

Chapter 5: Variational Linear Regression

For Bayesian linear regression with unknown weight precision α and noise precision β, the variational treatment factorizes q(w, α, β) = q(w) q(α) q(β).

The optimal factors are:

q(w) = N(w|m_N, S_N)

q(α) = Gam(α|a_α, b_α), q(β) = Gam(β|a_β, b_β)

Comparison to the evidence framework: The evidence framework (Ch 3) point-estimates α and β by maximizing the marginal likelihood. Variational inference maintains full posterior distributions over them. In practice, the variational approach is more robust when data is scarce — it properly accounts for uncertainty in the hyperparameters rather than committing to point estimates.

The predictive distribution integrates over the weight posterior, giving both a prediction and calibrated uncertainty bands.

Check: How does variational linear regression differ from the evidence framework?

It maintains full posterior distributions over hyperparameters instead of point estimates It uses a different basis function set It cannot compute predictions

Chapter 6: Local Variational Methods

Instead of approximating the posterior globally (mean field), local variational methods introduce bounds on individual terms in the likelihood.

The key idea: replace a difficult nonlinear function with a simpler bound that can be optimized. For example, the logistic sigmoid σ(a) = 1/(1+e^−a) has no conjugate prior. But we can bound it:

σ(a) ≥ σ(ξ) exp{(a − ξ)/2 − λ(ξ)(a² − ξ²)}

where ξ is a variational parameter and λ(ξ) = [σ(ξ) − 1/2]/(2ξ).

The bound trick: By replacing the sigmoid with its variational lower bound, the intractable integral becomes Gaussian — which we can compute in closed form. The variational parameter ξ is then optimized to make the bound as tight as possible. This trades one hard integral for an easier optimization problem. The bound is tight when ξ = |a|.

This technique is particularly useful for models with non-conjugate likelihoods (classification with Gaussian priors), where mean field alone doesn't yield tractable updates.

Check: What do local variational methods do?

Replace difficult nonlinear terms with optimizable bounds, making intractable integrals tractable Approximate the prior distribution Sample from the posterior

Chapter 7: Variational Logistic Regression

Combining the local variational bound on the sigmoid with a Gaussian prior on weights gives a fully variational treatment of logistic regression.

The variational posterior over weights is Gaussian: q(w) = N(w|m, S), where:

S⁻¹ = A + 2 ∑_n λ(ξ_n) φ_nφ_n^T

m = S ∑_n (t_n − ½) φ_n

where A = diag(α₁, ..., α_M) are the prior precisions.

Comparison of approaches to Bayesian logistic regression: The Laplace approximation (Ch 4) fits a Gaussian at the posterior mode. The variational approach fits a Gaussian by minimizing KL divergence, which better captures the posterior's shape (especially the tails). The variational bound also gives a lower bound on the evidence, useful for model selection.

The variational parameters ξ_n and hyperparameters α_i are updated alternately until convergence. When ARD priors are used (α_i per weight), sparsity emerges naturally — just as in the RVM from Chapter 7.

Check: How does the variational approach to logistic regression differ from the Laplace approximation?

Variational minimizes KL divergence (better posterior shape), while Laplace fits a Gaussian at the mode They are identical Variational requires more data

Chapter 8: Expectation Propagation

Expectation propagation (EP) is an alternative to variational inference that uses a different KL direction. While mean field minimizes KL(q||p), EP minimizes KL(p||q) locally.

The key difference in KL direction:

KL(q||p) (variational):
q avoids regions where p is small. Tends to be mode-seeking — locks onto one mode of a multimodal posterior. Underestimates variance.

KL(p||q) (EP):
q covers all regions where p is significant. Tends to be moment-matching — spreads mass to cover all modes. Overestimates variance for multimodal posteriors.

EP algorithm: EP approximates the true posterior p(Z|X) ∝ ∏_n f_n(Z) by a product of simpler factors q(Z) ∝ ∏_n f̃_n(Z). It refines each approximate factor f̃_n one at a time: remove it from q, incorporate the exact factor f_n, then project back to the approximate family by matching moments. This is "assumed density filtering" iterated until convergence.

EP is often more accurate than mean field for classification problems because the moment-matching property gives better-calibrated uncertainties. It's used in practical systems like the Bayes Point Machine.

Check: How does EP differ from mean field variational inference?

EP minimizes KL(p||q) via moment matching, while variational inference minimizes KL(q||p) EP is always more accurate EP cannot handle classification

Chapter 9: KL Direction Comparison

The two KL directions lead to qualitatively different approximations. Explore the difference below.

KL(q||p) vs KL(p||q)

The true posterior p (gray) is bimodal. Orange: KL(q||p) minimizer (variational, mode-seeking). Teal: KL(p||q) minimizer (EP-style, moment-matching).

Mode separation 3.0

Choose your KL wisely: KL(q||p) (variational) finds the single best mode — good when you care about the most likely region. KL(p||q) (EP) tries to cover everything — good when you need to average over the full posterior (e.g., for predictions). In practice, modern methods like stochastic variational inference and normalizing flows can partially overcome the limitations of both.

Check: For a bimodal posterior, how does minimizing KL(q||p) behave?

The Gaussian q locks onto one mode (mode-seeking), underestimating variance The Gaussian q centers between the modes, covering both The approximation becomes exact

Chapter 10: Summary

Method	KL direction	Behavior	Strengths
Mean field VI	KL(q\|\|p)	Mode-seeking	Fast, scalable, guaranteed convergence
EP	KL(p\|\|q) locally	Moment-matching	Better calibrated, covers posterior mass
Local variational	Bound-based	Tight bounds	Handles non-conjugate terms
Laplace (Ch 4)	N/A (mode + curvature)	Local Gaussian	Simple, one-shot

The variational principle: Variational inference converts intractable integration into tractable optimization. The ELBO connects it to EM. The mean field assumption makes it scalable. The KL direction determines the approximation's character. This framework underpins modern generative models: VAEs use amortized variational inference, and diffusion models can be interpreted through variational bounds.

What comes next: Chapter 11 covers sampling methods (MCMC) — the stochastic alternative to variational inference. Instead of approximating the posterior analytically, we generate samples from it.

"Variational methods allow us to recast inference problems
as optimization problems, which we can then solve using a
variety of efficient techniques."
— Christopher Bishop, PRML §10

Check: What is the core idea of variational inference?

Turn intractable integration into tractable optimization by finding the best approximation within a simpler family Generate samples from the posterior Compute the exact posterior using message passing