Bishop PRML, Chapter 11

Sampling Methods

Rejection sampling, importance sampling, MCMC, Metropolis-Hastings, Gibbs sampling, and Hamiltonian Monte Carlo — approximating intractable integrals with random samples.

Prerequisites: Chapters 2, 10 (distributions, approximate inference).

Chapters

Simulations

Quizzes

Chapter 0: Why Sampling?

Many quantities in Bayesian inference require computing expectations:

E[f] = ∫ f(z) p(z) dz

When this integral is intractable, we can approximate it using samples z⁽¹⁾, ..., z^(L) drawn from p(z):

E[f] ≈ (1/L) ∑_l=1^L f(z^(l))

Samples vs. variational: Chapter 10 gave us variational methods — fast but biased (the approximation family may not contain the true posterior). Sampling methods are asymptotically exact: as L → ∞, the sample estimate converges to the true expectation regardless of the posterior's shape. The price: convergence can be slow, especially in high dimensions.

The challenge: how do we draw samples from a complex, high-dimensional distribution that we can only evaluate (up to a normalizing constant)?

Check: What is the fundamental advantage of sampling over variational inference?

Sampling is asymptotically exact — it converges to the true expectation as the number of samples grows Sampling is always faster Sampling requires fewer assumptions

Chapter 1: Rejection Sampling

The simplest sampling method. Suppose we can evaluate p̃(z) (the unnormalized target) and we have a proposal distribution q(z) that we can sample from and evaluate, with kq(z) ≥ p̃(z) for all z.

Algorithm:

1. Sample z₀ from q(z).

2. Sample u uniformly from [0, kq(z₀)].

3. If u ≤ p̃(z₀), accept z₀. Otherwise reject.

The curse of dimensionality strikes again: In D dimensions, the acceptance rate is proportional to the ratio of the volume under p̃ to the volume under kq. This ratio typically decreases exponentially with D. In 1000 dimensions, you might accept 1 in 10³⁰⁰ proposals. Rejection sampling is practical only in low dimensions (D ≤ ~5).

Adaptive rejection sampling: For log-concave distributions, we can construct an envelope from tangent lines to ln p̃(z), which automatically tightens as we add more tangent points. This improves the acceptance rate without requiring a hand-crafted proposal.

Check: Why does rejection sampling fail in high dimensions?

The acceptance rate decreases exponentially with dimension, wasting almost all proposals It cannot evaluate the target distribution It requires the normalizing constant

Chapter 2: Importance Sampling

Instead of generating samples from p and averaging f(z), we can sample from a different distribution q and reweight:

E_p[f] = ∫ f(z) p(z) dz = ∫ f(z) (p(z)/q(z)) q(z) dz ≈ (1/L) ∑_l=1^L f(z^(l)) w_l

where z^(l) ~ q and w_l = p(z^(l))/q(z^(l)) are the importance weights.

No rejection, no waste: Every sample is used (no rejection). But the variance of the estimate depends critically on how well q matches p. If q is very different from p, a few samples will have huge weights and dominate the estimate — high variance. The effective sample size N_eff = (∑ w_l)² / ∑ w_l² measures how many samples are actually contributing.

If we only know p̃ (unnormalized), we use self-normalized importance sampling:

E_p[f] ≈ ∑_l w̃_l f(z^(l)) / ∑_l w̃_l

where w̃_l = p̃(z^(l))/q(z^(l)). This is biased but consistent.

Check: What determines the quality of importance sampling?

How well the proposal distribution q matches the target p — poor match leads to high-variance weight estimates The number of dimensions only The normalizing constant of p

Chapter 3: Sampling-Importance-Resampling

SIR combines importance sampling with resampling to produce unweighted samples from (approximately) p:

1. Draw L samples z⁽¹⁾, ..., z^(L) from proposal q(z).

2. Compute importance weights w_l = p̃(z^(l))/q(z^(l)), normalize to w̃_l.

3. Resample L' points from {z^(l)} with probabilities {w̃_l}.

SIR in particle filters: SIR is the core algorithm behind particle filters (sequential Monte Carlo), used in robotics, tracking, and state estimation. At each time step, particles (weighted samples) are propagated forward, reweighted by the likelihood, and resampled. This is the sampling analog of the Kalman filter for nonlinear, non-Gaussian models.

The resampled points are approximately distributed as p, but with the limitation that they're drawn from a finite set. Taking L large and L' small gives good approximations.

Check: What does the resampling step in SIR achieve?

It converts weighted samples into approximately unweighted samples from the target distribution It removes all duplicate samples It normalizes the target distribution

Chapter 4: Markov Chain Monte Carlo

The methods above don't scale to high dimensions. MCMC takes a fundamentally different approach: instead of independent samples, generate a chain of correlated samples that converges to the target distribution.

A Markov chain is a sequence z⁽¹⁾, z⁽²⁾, ... where each sample depends only on the previous one via a transition kernel T(z^(τ+1)|z^(τ)). If the chain is:

• Ergodic (can reach any state from any state), and

• Satisfies detailed balance: p(z) T(z'|z) = p(z') T(z|z') for all z, z',

then the chain has p as its unique stationary distribution, and time averages converge to expectations under p.

The MCMC idea: Design a Markov chain whose stationary distribution is our target posterior p(z|X). Run the chain long enough (past the "burn-in" period), and the samples will be distributed according to p. The key challenge: designing transition kernels that mix quickly (explore the distribution efficiently) while satisfying detailed balance.

Check: What property must an MCMC chain satisfy to converge to the target distribution?

Detailed balance: p(z)T(z'|z) = p(z')T(z|z'), ensuring the target is the stationary distribution The samples must be independent The chain must visit every state at least once

Chapter 5: Metropolis-Hastings

The Metropolis-Hastings algorithm constructs a valid MCMC chain for any target distribution:

1. From current state z^(τ), propose z* ~ q(z*|z^(τ)).

2. Accept with probability:

A(z*, z^(τ)) = min(1, p̃(z*) q(z^(τ)|z*) / [p̃(z^(τ)) q(z*|z^(τ))])

3. If accepted, z^(τ+1) = z*. Otherwise, z^(τ+1) = z^(τ).

Only ratios needed: The acceptance probability uses p̃(z*)/p̃(z^(τ)) — a ratio. The normalizing constant cancels! This is why MCMC is so powerful: we only need the unnormalized target. For Bayesian posteriors, p̃(z) = p(X|z)p(z) — no need to compute the evidence p(X).

If the proposal is symmetric (q(z*|z) = q(z|z*), e.g., a Gaussian centered at the current point), the ratio simplifies to p̃(z*)/p̃(z^(τ)). This is the original Metropolis algorithm.

The proposal scale matters critically: too small → high acceptance but slow exploration (random walk). Too large → most proposals rejected. The optimal acceptance rate for Gaussian proposals in high dimensions is about 23%.

Check: Why doesn't Metropolis-Hastings need the normalizing constant?

The acceptance ratio uses p(z*)/p(z), so the normalizing constant cancels in the ratio It computes the normalizing constant as a side effect It only works for normalized distributions

Chapter 6: MH in Action

Watch Metropolis-Hastings sample from a 2D distribution. Observe how the proposal scale affects mixing.

Metropolis-Hastings Sampling

Sampling from a banana-shaped target. Teal dots = accepted, dim dots = chain history. Adjust proposal scale.

Proposal σ 0.50 Accept: --

Check: What happens when the MH proposal scale is too small?

High acceptance rate but slow exploration — the chain takes tiny steps (random walk behavior) Most proposals are rejected The chain converges faster

Chapter 7: Gibbs Sampling

Gibbs sampling is a special case of MH where the proposal is the full conditional distribution. For a D-dimensional vector z = (z₁, ..., z_D):

For each dimension i = 1, ..., D:

z_i^(τ+1) ~ p(z_i | z₁^(τ+1), ..., z_i-1^(τ+1), z_i+1^(τ), ..., z_D^(τ))

Every proposal is accepted: Because we sample from the exact conditional, the MH acceptance probability is always 1. No wasted proposals! The catch: we need to be able to sample from the full conditionals p(z_i|z_−i), which requires conjugacy or other tractable structure. For conjugate exponential family models (Gaussians, Dirichlet-Multinomial, etc.), the conditionals have known forms.

Gibbs sampling is widely used in Bayesian statistics (BUGS, JAGS, Stan's NUTS is HMC). It excels when conditionals are easy to sample from. It struggles with strong correlations between variables — each step can only move along one coordinate axis.

Check: What is the acceptance rate of Gibbs sampling?

Always 100% — because each proposal samples from the exact full conditional distribution About 23%, the optimal MH rate It depends on the proposal distribution

Chapter 8: Slice Sampling

Slice sampling is an adaptive method that automatically tunes the proposal scale. The idea: sample uniformly from the region under the curve of p̃(z).

Algorithm:

1. At current z, sample u uniformly from [0, p̃(z)].

2. The "slice" S = {z' : p̃(z') > u} is the set of z-values where the target exceeds u.

3. Sample z' uniformly from S.

No tuning required: Slice sampling adapts to the local scale of the distribution. In wide regions, the slice is large → big steps. In narrow regions, the slice is small → small steps. In practice, finding the slice boundaries uses a "stepping out" and "shrinking" procedure. Slice sampling is particularly useful as a component within Gibbs sampling, where we need to sample from an awkward 1D conditional.

Check: What is the main advantage of slice sampling over Metropolis-Hastings?

It automatically adapts to the local scale of the distribution — no proposal scale to tune It produces independent samples It works only in 1D

Chapter 9: Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) uses gradient information to make large, directed moves through the distribution, avoiding the random walk behavior of vanilla MH.

The trick: introduce auxiliary "momentum" variables r and define the Hamiltonian:

H(z, r) = −ln p̃(z) + ½r^Tr

Simulate Hamiltonian dynamics using the leapfrog integrator for L steps, then accept/reject.

The physics analogy: Think of z as position and r as momentum of a particle on a surface where height = −ln p(z). The particle rolls downhill (toward high probability), gaining momentum. It swings through the distribution's contours, making large moves while staying in high-probability regions. The leapfrog integrator preserves volume (symplectic), ensuring high acceptance rates. This is why HMC is the gold standard for continuous posteriors.

HMC requires:

• Gradients of ln p̃(z) (available via automatic differentiation).

• Two tuning parameters: step size ε and number of leapfrog steps L.

The No-U-Turn Sampler (NUTS), used in Stan, automatically adapts both.

Check: Why is HMC more efficient than random-walk MH?

HMC uses gradients to make large, directed moves along the distribution's contours, avoiding random walk behavior HMC doesn't need the unnormalized target HMC produces independent samples

Chapter 10: Estimating Partition Functions

MCMC produces samples from p without knowing the normalizing constant Z. But sometimes we need Z itself (e.g., for model comparison via Bayes factors). How?

Bridge sampling: Introduce a sequence of distributions that "bridge" between a simple distribution q₀ (known Z) and the target p (unknown Z):

p₀ = q₀, p₁, ..., p_T = p

where each p_t is close to p_t+1. The ratio of successive normalizing constants can be estimated from samples of p_t, and the telescoping product gives Z_p/Z_q.

Annealed importance sampling: A practical version uses geometric averages: p_t(z) ∝ q₀(z)^1−β_t p(z)^β_t with 0 = β₀ < ... < β_T = 1. This "anneals" from the simple distribution to the target. Each step uses MCMC transitions, and importance weights correct for the approximation.

Check: Why is estimating the partition function harder than sampling?

MCMC samples from p without knowing Z (only ratios needed), but computing Z itself requires integrating over the entire space The partition function is always 1 It requires knowing the exact posterior

Chapter 11: Summary

Method	Key idea	Best for
Rejection sampling	Accept/reject under envelope	Low-D, simple targets
Importance sampling	Reweight samples from proposal	Estimating expectations, IS within SMC
Metropolis-Hastings	Random walk with accept/reject	General MCMC, any dimension
Gibbs sampling	Sample from full conditionals	Conjugate models, graphical models
Slice sampling	Sample uniform slice under curve	Adaptive 1D sampling within Gibbs
HMC	Hamiltonian dynamics with gradients	High-D continuous posteriors (the gold standard)

The sampling hierarchy: Simple methods (rejection, importance) work in low dimensions. MH and Gibbs handle moderate dimensions. HMC (especially NUTS) is the state of the art for continuous posteriors in hundreds to thousands of dimensions. For very high dimensions (millions of parameters, as in deep learning), even HMC is too slow — and we return to variational methods (Ch 10). The choice between variational and sampling is the fundamental tradeoff: speed vs. accuracy.

What comes next: Chapter 12 introduces continuous latent variable models (PCA, factor analysis, ICA) where both variational and sampling methods are applied.

"Sampling methods provide a flexible alternative to
analytical and variational approximation schemes."
— Christopher Bishop, PRML §11

Check: When is HMC preferred over simpler MCMC methods?

For high-dimensional continuous posteriors where random walk MH would be too slow, and gradients are available Only for 1D distributions When the distribution is discrete