Variational inference, mean field approximation, variational EM, and expectation propagation — when exact inference is intractable.
Bayesian inference requires computing the posterior p(Z|X) = p(X|Z)p(Z) / p(X). The denominator p(X) = ∫ p(X|Z)p(Z) dZ is an integral (or sum) over all latent variables.
For most interesting models, this integral is intractable:
• The latent space is too high-dimensional (deep latent variable models).
• The likelihood and prior are not conjugate (non-Gaussian posteriors).
• The graphical model has loops (no exact message passing).
Recall the ELBO decomposition from Chapter 9:
In EM, we set q = p(Z|X) exactly. In variational inference, we restrict q to a tractable family Q and find the best approximation:
Equivalently (since ln p(X) is constant w.r.t. q), we maximize the ELBO:
The most common variational family is the mean field (fully factorized) approximation:
Each group of latent variables Zi is treated independently. We optimize each factor qi while holding the others fixed.
The optimal factor for Zj has a beautiful closed form:
Mean field ignores correlations between variable groups. This makes the approximation fast but potentially inaccurate: it tends to underestimate posterior variances and miss multimodality.
To illustrate, consider inferring the mean μ and precision τ of a Gaussian from data x1, ..., xN. The true posterior p(μ, τ|x) is correlated — the mean depends on the precision and vice versa.
The mean field approximation: q(μ, τ) = qμ(μ) qτ(τ).
Applying the optimal update formula:
The updates couple through expectations: qμ depends on Eqτ[τ] (the expected precision), and qτ depends on Eqμ[μ2]. We iterate until convergence.
The Bayesian GMM puts priors on all parameters: Dirichlet on π, Gaussian-Wishart on (μk, Λk). The mean field approximation factorizes:
The resulting update equations resemble EM but are fully Bayesian:
Update q(Z): Responsibilities now use expected parameters (expected log mixing coefficients, expected log precision, etc.).
Update q(π): Dirichlet with updated concentration parameters.
Update q(μk, Λk): Gaussian-Wishart with updated parameters.
For Bayesian linear regression with unknown weight precision α and noise precision β, the variational treatment factorizes q(w, α, β) = q(w) q(α) q(β).
The optimal factors are:
The predictive distribution integrates over the weight posterior, giving both a prediction and calibrated uncertainty bands.
Instead of approximating the posterior globally (mean field), local variational methods introduce bounds on individual terms in the likelihood.
The key idea: replace a difficult nonlinear function with a simpler bound that can be optimized. For example, the logistic sigmoid σ(a) = 1/(1+e−a) has no conjugate prior. But we can bound it:
where ξ is a variational parameter and λ(ξ) = [σ(ξ) − 1/2]/(2ξ).
This technique is particularly useful for models with non-conjugate likelihoods (classification with Gaussian priors), where mean field alone doesn't yield tractable updates.
Combining the local variational bound on the sigmoid with a Gaussian prior on weights gives a fully variational treatment of logistic regression.
The variational posterior over weights is Gaussian: q(w) = N(w|m, S), where:
where A = diag(α1, ..., αM) are the prior precisions.
The variational parameters ξn and hyperparameters αi are updated alternately until convergence. When ARD priors are used (αi per weight), sparsity emerges naturally — just as in the RVM from Chapter 7.
Expectation propagation (EP) is an alternative to variational inference that uses a different KL direction. While mean field minimizes KL(q||p), EP minimizes KL(p||q) locally.
The key difference in KL direction:
KL(q||p) (variational):
q avoids regions where p is small. Tends to be mode-seeking — locks onto one mode of a multimodal posterior. Underestimates variance.
KL(p||q) (EP):
q covers all regions where p is significant. Tends to be moment-matching — spreads mass to cover all modes. Overestimates variance for multimodal posteriors.
EP is often more accurate than mean field for classification problems because the moment-matching property gives better-calibrated uncertainties. It's used in practical systems like the Bayes Point Machine.
The two KL directions lead to qualitatively different approximations. Explore the difference below.
The true posterior p (gray) is bimodal. Orange: KL(q||p) minimizer (variational, mode-seeking). Teal: KL(p||q) minimizer (EP-style, moment-matching).
| Method | KL direction | Behavior | Strengths |
|---|---|---|---|
| Mean field VI | KL(q||p) | Mode-seeking | Fast, scalable, guaranteed convergence |
| EP | KL(p||q) locally | Moment-matching | Better calibrated, covers posterior mass |
| Local variational | Bound-based | Tight bounds | Handles non-conjugate terms |
| Laplace (Ch 4) | N/A (mode + curvature) | Local Gaussian | Simple, one-shot |
What comes next: Chapter 11 covers sampling methods (MCMC) — the stochastic alternative to variational inference. Instead of approximating the posterior analytically, we generate samples from it.