Bishop PRML, Chapter 11

Sampling Methods

Rejection sampling, importance sampling, MCMC, Metropolis-Hastings, Gibbs sampling, and Hamiltonian Monte Carlo — approximating intractable integrals with random samples.

Prerequisites: Chapters 2, 10 (distributions, approximate inference).
12
Chapters
2
Simulations
12
Quizzes

Chapter 0: Why Sampling?

Many quantities in Bayesian inference require computing expectations:

E[f] = ∫ f(z) p(z) dz

When this integral is intractable, we can approximate it using samples z(1), ..., z(L) drawn from p(z):

E[f] ≈ (1/L) ∑l=1L f(z(l))
Samples vs. variational: Chapter 10 gave us variational methods — fast but biased (the approximation family may not contain the true posterior). Sampling methods are asymptotically exact: as L → ∞, the sample estimate converges to the true expectation regardless of the posterior's shape. The price: convergence can be slow, especially in high dimensions.

The challenge: how do we draw samples from a complex, high-dimensional distribution that we can only evaluate (up to a normalizing constant)?

Check: What is the fundamental advantage of sampling over variational inference?

Chapter 1: Rejection Sampling

The simplest sampling method. Suppose we can evaluate p̃(z) (the unnormalized target) and we have a proposal distribution q(z) that we can sample from and evaluate, with kq(z) ≥ p̃(z) for all z.

Algorithm:

1. Sample z0 from q(z).

2. Sample u uniformly from [0, kq(z0)].

3. If u ≤ p̃(z0), accept z0. Otherwise reject.

The curse of dimensionality strikes again: In D dimensions, the acceptance rate is proportional to the ratio of the volume under p̃ to the volume under kq. This ratio typically decreases exponentially with D. In 1000 dimensions, you might accept 1 in 10300 proposals. Rejection sampling is practical only in low dimensions (D ≤ ~5).

Adaptive rejection sampling: For log-concave distributions, we can construct an envelope from tangent lines to ln p̃(z), which automatically tightens as we add more tangent points. This improves the acceptance rate without requiring a hand-crafted proposal.

Check: Why does rejection sampling fail in high dimensions?

Chapter 2: Importance Sampling

Instead of generating samples from p and averaging f(z), we can sample from a different distribution q and reweight:

Ep[f] = ∫ f(z) p(z) dz = ∫ f(z) (p(z)/q(z)) q(z) dz ≈ (1/L) ∑l=1L f(z(l)) wl

where z(l) ~ q and wl = p(z(l))/q(z(l)) are the importance weights.

No rejection, no waste: Every sample is used (no rejection). But the variance of the estimate depends critically on how well q matches p. If q is very different from p, a few samples will have huge weights and dominate the estimate — high variance. The effective sample size Neff = (∑ wl)2 / ∑ wl2 measures how many samples are actually contributing.

If we only know p̃ (unnormalized), we use self-normalized importance sampling:

Ep[f] ≈ ∑ll f(z(l)) / ∑ll

where w̃l = p̃(z(l))/q(z(l)). This is biased but consistent.

Check: What determines the quality of importance sampling?

Chapter 3: Sampling-Importance-Resampling

SIR combines importance sampling with resampling to produce unweighted samples from (approximately) p:

1. Draw L samples z(1), ..., z(L) from proposal q(z).

2. Compute importance weights wl = p̃(z(l))/q(z(l)), normalize to w̃l.

3. Resample L' points from {z(l)} with probabilities {w̃l}.

SIR in particle filters: SIR is the core algorithm behind particle filters (sequential Monte Carlo), used in robotics, tracking, and state estimation. At each time step, particles (weighted samples) are propagated forward, reweighted by the likelihood, and resampled. This is the sampling analog of the Kalman filter for nonlinear, non-Gaussian models.

The resampled points are approximately distributed as p, but with the limitation that they're drawn from a finite set. Taking L large and L' small gives good approximations.

Check: What does the resampling step in SIR achieve?

Chapter 4: Markov Chain Monte Carlo

The methods above don't scale to high dimensions. MCMC takes a fundamentally different approach: instead of independent samples, generate a chain of correlated samples that converges to the target distribution.

A Markov chain is a sequence z(1), z(2), ... where each sample depends only on the previous one via a transition kernel T(z(τ+1)|z(τ)). If the chain is:

Ergodic (can reach any state from any state), and

Satisfies detailed balance: p(z) T(z'|z) = p(z') T(z|z') for all z, z',

then the chain has p as its unique stationary distribution, and time averages converge to expectations under p.

The MCMC idea: Design a Markov chain whose stationary distribution is our target posterior p(z|X). Run the chain long enough (past the "burn-in" period), and the samples will be distributed according to p. The key challenge: designing transition kernels that mix quickly (explore the distribution efficiently) while satisfying detailed balance.
Check: What property must an MCMC chain satisfy to converge to the target distribution?

Chapter 5: Metropolis-Hastings

The Metropolis-Hastings algorithm constructs a valid MCMC chain for any target distribution:

1. From current state z(τ), propose z* ~ q(z*|z(τ)).

2. Accept with probability:

A(z*, z(τ)) = min(1, p̃(z*) q(z(τ)|z*) / [p̃(z(τ)) q(z*|z(τ))])

3. If accepted, z(τ+1) = z*. Otherwise, z(τ+1) = z(τ).

Only ratios needed: The acceptance probability uses p̃(z*)/p̃(z(τ)) — a ratio. The normalizing constant cancels! This is why MCMC is so powerful: we only need the unnormalized target. For Bayesian posteriors, p̃(z) = p(X|z)p(z) — no need to compute the evidence p(X).

If the proposal is symmetric (q(z*|z) = q(z|z*), e.g., a Gaussian centered at the current point), the ratio simplifies to p̃(z*)/p̃(z(τ)). This is the original Metropolis algorithm.

The proposal scale matters critically: too small → high acceptance but slow exploration (random walk). Too large → most proposals rejected. The optimal acceptance rate for Gaussian proposals in high dimensions is about 23%.

Check: Why doesn't Metropolis-Hastings need the normalizing constant?

Chapter 6: MH in Action

Watch Metropolis-Hastings sample from a 2D distribution. Observe how the proposal scale affects mixing.

Metropolis-Hastings Sampling

Sampling from a banana-shaped target. Teal dots = accepted, dim dots = chain history. Adjust proposal scale.

Proposal σ 0.50 Accept: --
Check: What happens when the MH proposal scale is too small?

Chapter 7: Gibbs Sampling

Gibbs sampling is a special case of MH where the proposal is the full conditional distribution. For a D-dimensional vector z = (z1, ..., zD):

For each dimension i = 1, ..., D:

zi(τ+1) ~ p(zi | z1(τ+1), ..., zi-1(τ+1), zi+1(τ), ..., zD(τ))
Every proposal is accepted: Because we sample from the exact conditional, the MH acceptance probability is always 1. No wasted proposals! The catch: we need to be able to sample from the full conditionals p(zi|z−i), which requires conjugacy or other tractable structure. For conjugate exponential family models (Gaussians, Dirichlet-Multinomial, etc.), the conditionals have known forms.

Gibbs sampling is widely used in Bayesian statistics (BUGS, JAGS, Stan's NUTS is HMC). It excels when conditionals are easy to sample from. It struggles with strong correlations between variables — each step can only move along one coordinate axis.

Check: What is the acceptance rate of Gibbs sampling?

Chapter 8: Slice Sampling

Slice sampling is an adaptive method that automatically tunes the proposal scale. The idea: sample uniformly from the region under the curve of p̃(z).

Algorithm:

1. At current z, sample u uniformly from [0, p̃(z)].

2. The "slice" S = {z' : p̃(z') > u} is the set of z-values where the target exceeds u.

3. Sample z' uniformly from S.

No tuning required: Slice sampling adapts to the local scale of the distribution. In wide regions, the slice is large → big steps. In narrow regions, the slice is small → small steps. In practice, finding the slice boundaries uses a "stepping out" and "shrinking" procedure. Slice sampling is particularly useful as a component within Gibbs sampling, where we need to sample from an awkward 1D conditional.
Check: What is the main advantage of slice sampling over Metropolis-Hastings?

Chapter 9: Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) uses gradient information to make large, directed moves through the distribution, avoiding the random walk behavior of vanilla MH.

The trick: introduce auxiliary "momentum" variables r and define the Hamiltonian:

H(z, r) = −ln p̃(z) + ½rTr

Simulate Hamiltonian dynamics using the leapfrog integrator for L steps, then accept/reject.

The physics analogy: Think of z as position and r as momentum of a particle on a surface where height = −ln p(z). The particle rolls downhill (toward high probability), gaining momentum. It swings through the distribution's contours, making large moves while staying in high-probability regions. The leapfrog integrator preserves volume (symplectic), ensuring high acceptance rates. This is why HMC is the gold standard for continuous posteriors.

HMC requires:

• Gradients of ln p̃(z) (available via automatic differentiation).

• Two tuning parameters: step size ε and number of leapfrog steps L.

The No-U-Turn Sampler (NUTS), used in Stan, automatically adapts both.

Check: Why is HMC more efficient than random-walk MH?

Chapter 10: Estimating Partition Functions

MCMC produces samples from p without knowing the normalizing constant Z. But sometimes we need Z itself (e.g., for model comparison via Bayes factors). How?

Bridge sampling: Introduce a sequence of distributions that "bridge" between a simple distribution q0 (known Z) and the target p (unknown Z):

p0 = q0,   p1,   ...,   pT = p

where each pt is close to pt+1. The ratio of successive normalizing constants can be estimated from samples of pt, and the telescoping product gives Zp/Zq.

Annealed importance sampling: A practical version uses geometric averages: pt(z) ∝ q0(z)1−βt p(z)βt with 0 = β0 < ... < βT = 1. This "anneals" from the simple distribution to the target. Each step uses MCMC transitions, and importance weights correct for the approximation.
Check: Why is estimating the partition function harder than sampling?

Chapter 11: Summary

MethodKey ideaBest for
Rejection samplingAccept/reject under envelopeLow-D, simple targets
Importance samplingReweight samples from proposalEstimating expectations, IS within SMC
Metropolis-HastingsRandom walk with accept/rejectGeneral MCMC, any dimension
Gibbs samplingSample from full conditionalsConjugate models, graphical models
Slice samplingSample uniform slice under curveAdaptive 1D sampling within Gibbs
HMCHamiltonian dynamics with gradientsHigh-D continuous posteriors (the gold standard)
The sampling hierarchy: Simple methods (rejection, importance) work in low dimensions. MH and Gibbs handle moderate dimensions. HMC (especially NUTS) is the state of the art for continuous posteriors in hundreds to thousands of dimensions. For very high dimensions (millions of parameters, as in deep learning), even HMC is too slow — and we return to variational methods (Ch 10). The choice between variational and sampling is the fundamental tradeoff: speed vs. accuracy.

What comes next: Chapter 12 introduces continuous latent variable models (PCA, factor analysis, ICA) where both variational and sampling methods are applied.

"Sampling methods provide a flexible alternative to
analytical and variational approximation schemes."
— Christopher Bishop, PRML §11
Check: When is HMC preferred over simpler MCMC methods?