Rejection sampling, importance sampling, MCMC, Metropolis-Hastings, Gibbs sampling, and Hamiltonian Monte Carlo — approximating intractable integrals with random samples.
Many quantities in Bayesian inference require computing expectations:
When this integral is intractable, we can approximate it using samples z(1), ..., z(L) drawn from p(z):
The challenge: how do we draw samples from a complex, high-dimensional distribution that we can only evaluate (up to a normalizing constant)?
The simplest sampling method. Suppose we can evaluate p̃(z) (the unnormalized target) and we have a proposal distribution q(z) that we can sample from and evaluate, with kq(z) ≥ p̃(z) for all z.
Algorithm:
1. Sample z0 from q(z).
2. Sample u uniformly from [0, kq(z0)].
3. If u ≤ p̃(z0), accept z0. Otherwise reject.
Adaptive rejection sampling: For log-concave distributions, we can construct an envelope from tangent lines to ln p̃(z), which automatically tightens as we add more tangent points. This improves the acceptance rate without requiring a hand-crafted proposal.
Instead of generating samples from p and averaging f(z), we can sample from a different distribution q and reweight:
where z(l) ~ q and wl = p(z(l))/q(z(l)) are the importance weights.
If we only know p̃ (unnormalized), we use self-normalized importance sampling:
where w̃l = p̃(z(l))/q(z(l)). This is biased but consistent.
SIR combines importance sampling with resampling to produce unweighted samples from (approximately) p:
1. Draw L samples z(1), ..., z(L) from proposal q(z).
2. Compute importance weights wl = p̃(z(l))/q(z(l)), normalize to w̃l.
3. Resample L' points from {z(l)} with probabilities {w̃l}.
The resampled points are approximately distributed as p, but with the limitation that they're drawn from a finite set. Taking L large and L' small gives good approximations.
The methods above don't scale to high dimensions. MCMC takes a fundamentally different approach: instead of independent samples, generate a chain of correlated samples that converges to the target distribution.
A Markov chain is a sequence z(1), z(2), ... where each sample depends only on the previous one via a transition kernel T(z(τ+1)|z(τ)). If the chain is:
• Ergodic (can reach any state from any state), and
• Satisfies detailed balance: p(z) T(z'|z) = p(z') T(z|z') for all z, z',
then the chain has p as its unique stationary distribution, and time averages converge to expectations under p.
The Metropolis-Hastings algorithm constructs a valid MCMC chain for any target distribution:
1. From current state z(τ), propose z* ~ q(z*|z(τ)).
2. Accept with probability:
3. If accepted, z(τ+1) = z*. Otherwise, z(τ+1) = z(τ).
If the proposal is symmetric (q(z*|z) = q(z|z*), e.g., a Gaussian centered at the current point), the ratio simplifies to p̃(z*)/p̃(z(τ)). This is the original Metropolis algorithm.
The proposal scale matters critically: too small → high acceptance but slow exploration (random walk). Too large → most proposals rejected. The optimal acceptance rate for Gaussian proposals in high dimensions is about 23%.
Watch Metropolis-Hastings sample from a 2D distribution. Observe how the proposal scale affects mixing.
Sampling from a banana-shaped target. Teal dots = accepted, dim dots = chain history. Adjust proposal scale.
Gibbs sampling is a special case of MH where the proposal is the full conditional distribution. For a D-dimensional vector z = (z1, ..., zD):
For each dimension i = 1, ..., D:
Gibbs sampling is widely used in Bayesian statistics (BUGS, JAGS, Stan's NUTS is HMC). It excels when conditionals are easy to sample from. It struggles with strong correlations between variables — each step can only move along one coordinate axis.
Slice sampling is an adaptive method that automatically tunes the proposal scale. The idea: sample uniformly from the region under the curve of p̃(z).
Algorithm:
1. At current z, sample u uniformly from [0, p̃(z)].
2. The "slice" S = {z' : p̃(z') > u} is the set of z-values where the target exceeds u.
3. Sample z' uniformly from S.
Hamiltonian Monte Carlo (HMC) uses gradient information to make large, directed moves through the distribution, avoiding the random walk behavior of vanilla MH.
The trick: introduce auxiliary "momentum" variables r and define the Hamiltonian:
Simulate Hamiltonian dynamics using the leapfrog integrator for L steps, then accept/reject.
HMC requires:
• Gradients of ln p̃(z) (available via automatic differentiation).
• Two tuning parameters: step size ε and number of leapfrog steps L.
The No-U-Turn Sampler (NUTS), used in Stan, automatically adapts both.
MCMC produces samples from p without knowing the normalizing constant Z. But sometimes we need Z itself (e.g., for model comparison via Bayes factors). How?
Bridge sampling: Introduce a sequence of distributions that "bridge" between a simple distribution q0 (known Z) and the target p (unknown Z):
where each pt is close to pt+1. The ratio of successive normalizing constants can be estimated from samples of pt, and the telescoping product gives Zp/Zq.
| Method | Key idea | Best for |
|---|---|---|
| Rejection sampling | Accept/reject under envelope | Low-D, simple targets |
| Importance sampling | Reweight samples from proposal | Estimating expectations, IS within SMC |
| Metropolis-Hastings | Random walk with accept/reject | General MCMC, any dimension |
| Gibbs sampling | Sample from full conditionals | Conjugate models, graphical models |
| Slice sampling | Sample uniform slice under curve | Adaptive 1D sampling within Gibbs |
| HMC | Hamiltonian dynamics with gradients | High-D continuous posteriors (the gold standard) |
What comes next: Chapter 12 introduces continuous latent variable models (PCA, factor analysis, ICA) where both variational and sampling methods are applied.