Multivariate Distributions — From Correlation to Covariance Matrices

Chapter 0: Why Multivariate?

You're building a self-driving car. The vehicle has a position x, a velocity v, and a heading θ. These three quantities aren't independent — a car moving fast probably moved far from where it started, and its heading determines which way it went. Treating them as separate scalar distributions would throw away this critical structure.

Multivariate distributions model several random variables together, capturing how they relate to each other. The magic ingredient is the covariance matrix — a symmetric, positive-definite matrix whose off-diagonal entries encode correlations.

The core idea: A univariate distribution says "this value is uncertain." A multivariate distribution says "these values are uncertain and linked to each other." That linkage — covariance — is what makes Kalman filters, Gaussian processes, and topic models work.

The covariance matrix Σ is a d×d matrix for d variables. Its diagonal entries Σ_ii are variances (how spread each variable is). Its off-diagonal entries Σ_ij capture how variables i and j co-vary. If Σ_ij > 0, knowing that variable i is above average tells you variable j probably is too.

The covariance matrix must always be symmetric (Σ_ij = Σ_ji) and positive semi-definite (no direction of negative variance). These constraints matter: when you construct priors on covariance matrices (chapters 3-4), the distribution must respect them. The correlation matrix R normalizes Σ so diagonal entries are 1: R_ij = Σ_ij / (σ_iσ_j). Entries of R lie between −1 and 1.

Joint Distribution

Models all variables together: p(x₁, ..., x_d)

↓

Covariance Matrix

Σ_ij = E[(x_i − μ_i)(x_j − μ_j)]

↓

Correlation Matrix

R_ij = Σ_ij / (σ_iσ_j), entries in [−1, 1]

Covariance Shapes Uncertainty

Adjust the correlation ρ between two variables. When ρ=0, the scatter is circular. As ρ grows, the cloud stretches into an ellipse — knowing one variable tells you about the other.

Correlation ρ0.00

Why can't we treat correlated variables as independent scalars?

It's slower to compute We lose the correlation structure that links them Scalar distributions don't have variances

Chapter 1: The Multivariate Gaussian

If you learn one multivariate distribution, make it this one. The multivariate Gaussian (or multivariate normal) is the workhorse of estimation, robotics, and machine learning. It's defined by a mean vector μ (the center) and a covariance matrix Σ (the shape of uncertainty).

p(x) = (2π)^-d/2 |Σ|^-1/2 exp(−½ (x − μ)^T Σ⁻¹ (x − μ))

That exponent (x − μ)^T Σ⁻¹ (x − μ) is the Mahalanobis distance — it measures how far x is from the mean, scaled by the covariance. Points along the direction of high variance are "closer" in Mahalanobis terms. This is why the contours are ellipses, not circles.

Why it's everywhere: The multivariate Gaussian is closed under conditioning, marginalization, and linear transformation. If your state is Gaussian and your model is linear, the output is still Gaussian. This is why the Kalman filter works — it never leaves the Gaussian family.

Key properties: If x ~ N(μ, Σ), then any linear transformation Ax + b is also Gaussian: N(Aμ+b, AΣA^T). Any subset of variables is Gaussian (marginalization). And the conditional distribution of some variables given others is also Gaussian — with a mean that shifts linearly and a covariance that shrinks.

Conditioning example: If [x, y]^T is jointly Gaussian with correlation ρ, then knowing y = y₀ makes x conditionally Gaussian with mean μ_x + ρ(σ_x/σ_y)(y₀ − μ_y) and variance σ_x²(1 − ρ²). The higher the correlation, the more knowing y shrinks the uncertainty in x.

Where it appears: Kalman filters (state belief), Gaussian processes (function priors), variational autoencoders (latent space), sensor fusion, any system where you track mean + covariance.

2D Gaussian: Contours & Samples

Adjust the variances and correlation. The teal ellipses are 1σ and 2σ contours. Orange dots are samples.

σ_x1.00

σ_y1.00

ρ0.50

python
import numpy as np

mu = np.array([0, 0])
Sigma = np.array([[1.0, 0.5],
                   [0.5, 1.0]])

samples = np.random.multivariate_normal(mu, Sigma, size=1000)
# samples.shape = (1000, 2)

What does the Mahalanobis distance measure?

The Euclidean distance from the origin Distance from the mean, scaled by the covariance The determinant of the covariance matrix

Chapter 2: The Dirichlet Distribution

Imagine you're building a topic model. Each document is a mixture of topics: 30% sports, 50% politics, 20% tech. Where do those mixing proportions come from? You need a distribution over probability vectors — vectors whose entries are non-negative and sum to 1. That's the simplex, and the Dirichlet distribution lives on it.

p(x₁, ..., x_K) = (1/B(α)) ∏_i=1^K x_i^α_i−1

The parameter vector α = (α₁, ..., α_K) controls where mass concentrates on the simplex. When all α_i are large and equal, the distribution concentrates near the center (uniform mixing). When all α_i are small (<1), it concentrates near the corners (sparse, one-dominant-category vectors). When α_i are unequal, it favors certain categories.

Think of it this way: The Dirichlet is to the categorical distribution what the Beta is to the Bernoulli. The Beta gives you a prior over a single coin's bias. The Dirichlet gives you a prior over the probabilities of a K-sided die.

Where it appears: Latent Dirichlet Allocation (LDA) for topic models, Bayesian priors on multinomial parameters, mixture model weights, any time you need a distribution over "how to split probability among K categories."

Worked example: You're estimating the probabilities of a 3-sided die. You start with a Dirichlet(1,1,1) prior (uniform — any combination is equally likely). You roll the die 10 times and see: 4 ones, 3 twos, 3 threes. The posterior is Dirichlet(1+4, 1+3, 1+3) = Dirichlet(5, 4, 4). The posterior mean is (5/13, 4/13, 4/13) ≈ (0.38, 0.31, 0.31) — shifted toward the observed frequencies, but regularized by the prior.

python
import numpy as np

# Prior: uniform Dirichlet
alpha_prior = np.array([1, 1, 1])
counts = np.array([4, 3, 3])  # observed data

alpha_post = alpha_prior + counts  # [5, 4, 4]
posterior_mean = alpha_post / alpha_post.sum()
# array([0.385, 0.308, 0.308])

# Sample probability vectors from the posterior
samples = np.random.dirichlet(alpha_post, size=1000)

Dirichlet on the 3-Simplex

The triangle is the 3-simplex (all probability vectors of length 3). Each orange dot is a sampled probability vector. Adjust α to see how the distribution changes.

α₁1.0

α₂1.0

α₃1.0

When all α_i < 1, where do Dirichlet samples concentrate?

Near the corners of the simplex (sparse, one-dominant-category vectors) Near the center (uniform mixing) Uniformly across the entire simplex

Chapter 3: The Wishart Distribution

You've measured 50 data points in 3D and computed their sample covariance matrix. How much should you trust it? If you had only 4 data points, that 3×3 covariance estimate would be wildly noisy. You need a distribution over covariance matrices to quantify this uncertainty. That's the Wishart.

The Wishart distribution W(ν, V) has two parameters: the degrees of freedom ν (how many "pseudo-observations" inform the estimate) and the scale matrix V (the expected shape, scaled by ν). Its mean is ν · V.

X ~ W_d(ν, V) ⇔ X = ∑_i=1^ν z_i z_i^T, z_i ~ N(0, V)

That's the key intuition: a Wishart matrix is a sum of outer products of Gaussian draws. This is exactly the form of a sample covariance matrix (up to a 1/n factor). So the Wishart naturally describes the distribution of sample covariance matrices from Gaussian data.

Degrees of freedom as confidence: When ν is small (close to d), the Wishart produces highly variable matrices — your covariance estimate is uncertain. As ν grows, the distribution concentrates around νV — your estimate becomes reliable. It's a confidence dial for covariance.

Where it appears: Sample covariance distribution in multivariate statistics, Bayesian estimation of precision matrices, prior in Gaussian mixture models, MANOVA tests.

Key properties: The Wishart is the multivariate generalization of the chi-squared distribution. In 1D, if you sum the squares of ν standard normal draws, you get a chi-squared(ν) variable. In d dimensions, if you sum the outer products of ν multivariate normal draws, you get a Wishart matrix. The degrees of freedom ν must be at least d (the dimension) for the distribution to be non-degenerate.

Wishart Samples as Covariance Ellipses

Each orange ellipse is one sampled 2×2 covariance matrix from a Wishart. The teal ellipse is the mean. Higher ν means tighter clustering around the mean.

ν (degrees of freedom)5

What happens to Wishart samples as ν increases?

They concentrate around the mean νV (less variation) They become more spread out They become diagonal matrices

Chapter 4: The Inverse Wishart

The Wishart describes the distribution of sums of outer products (like sample covariance matrices scaled by n). But when we do Bayesian inference, we usually want a prior directly on the covariance matrix Σ of a Gaussian. The Inverse Wishart is the conjugate prior for this job.

If X ~ W(ν, V), then X⁻¹ ~ IW(ν, V⁻¹). The Inverse Wishart IW(ν, Ψ) has degrees of freedom ν and a scale matrix Ψ. Its mean is Ψ / (ν − d − 1), which only exists when ν > d + 1.

Σ ~ IW(ν, Ψ) ⇒ E[Σ] = Ψ / (ν − d − 1)

Conjugate magic: If your prior on Σ is IW(ν₀, Ψ₀) and you observe n data points from N(μ, Σ), the posterior on Σ is also Inverse Wishart: IW(ν₀ + n, Ψ₀ + S), where S is the sum of squared deviations. Prior parameters act as "pseudo-data" that get added to real data.

Where it appears: Prior on the covariance in Bayesian linear regression, Gaussian mixture model fitting with unknown covariance, any Bayesian analysis where Σ is unknown.

A practical limitation: The Inverse Wishart couples the variances and correlations together in ways that can be restrictive. For a d×d matrix, increasing ν simultaneously shrinks the uncertainty in all variance and correlation parameters. In high dimensions, this inflexibility can be problematic. That's one reason the LKJ + half-Cauchy decomposition (Chapter 6) has become the modern standard in probabilistic programming.

Property	Wishart	Inverse Wishart
Samples are	PD matrices (like sample covariance)	PD matrices (like true covariance)
Conjugate prior for	Precision Λ = Σ⁻¹	Covariance Σ
Mean	ν · V	Ψ / (ν − d − 1)
Concentration	Increases with ν	Increases with ν

Inverse Wishart: Bayesian Covariance Update

The purple ellipses show prior IW samples. Click "Add Data" to observe points from a true Gaussian. The teal ellipses show posterior samples — they converge to the truth.

Prior strength ν₀4

Why is the Inverse Wishart popular as a prior on covariance?

It always produces diagonal matrices It's the conjugate prior for the Gaussian's covariance, making updates tractable It has the smallest variance of any covariance prior

Chapter 5: The Multivariate Student-t

The multivariate Gaussian is beautiful but fragile. One outlier can destroy your mean estimate. If you're tracking a robot and a sensor glitches, reporting a position 100 meters off, the Gaussian posterior shifts dramatically. You need heavier tails to gracefully absorb outliers.

The multivariate Student-t distribution has an extra parameter ν (degrees of freedom) that controls tail heaviness. When ν is small (like 3), the tails are very fat — extreme values are plausible. As ν → ∞, it converges to a Gaussian.

p(x) ∝ (1 + ν⁻¹(x−μ)^TΣ⁻¹(x−μ))^−(ν+d)/2

The key difference from the Gaussian: instead of exp(−...), we have a power law decay. This means the probability of extreme events falls off polynomially, not exponentially. The t-distribution is actually a Gaussian scale mixture: each point is drawn from a Gaussian whose variance is itself random (drawn from an Inverse-Gamma). Some points get large variances — those are the outliers.

Gaussian scale mixture: Draw w ~ InvGamma(ν/2, ν/2). Then draw x ~ N(μ, wΣ). Marginalizing out w gives the multivariate-t. Points where w is large produce outliers naturally.

Where it appears: Robust state estimation, outlier-resistant Kalman filters, financial risk modeling, Bayesian predictive distributions (when Σ is estimated, the predictive is t, not Gaussian).

A surprising fact: When you fit a Gaussian to data and then predict a new data point, the predictive distribution is not Gaussian — it's a Student-t. That's because you're uncertain about Σ, and marginalizing over that uncertainty fattens the tails. With n data points in d dimensions, the predictive has ν = n − d + 1 degrees of freedom. More data → larger ν → closer to Gaussian.

python
import numpy as np
from scipy.stats import multivariate_t

# Multivariate-t with 3 degrees of freedom
rv = multivariate_t(
    loc=[0, 0],
    shape=[[1, 0.5], [0.5, 1]],
    df=3
)
samples = rv.rvs(size=1000)
# Notice outliers that would be nearly impossible under a Gaussian

Gaussian vs Student-t: Tail Behavior

Teal = Gaussian samples. Orange = Student-t samples. Lower ν means fatter tails and more outliers.

ν (degrees of freedom)3

Why use a multivariate-t instead of a Gaussian?

It's faster to compute It always has zero correlation Its heavy tails absorb outliers without distorting the estimate

Chapter 6: The LKJ Distribution

You're building a Bayesian model with several correlated parameters. You need a prior on the correlation matrix R. But correlation matrices are tricky — they must be symmetric, positive definite, with ones on the diagonal and off-diagonal entries between −1 and 1. The LKJ distribution (named for Lewandowski, Kurowicka, and Joe) is designed exactly for this.

p(R) ∝ |R|^η−1

The single parameter η controls how much the distribution favors the identity matrix (no correlation). When η = 1, the distribution is uniform over all valid correlation matrices. When η > 1, it concentrates near the identity (shrinks correlations toward zero). When η < 1, it favors matrices with strong correlations.

Why not just use Inverse Wishart? The IW conflates scale (variances) with structure (correlations). If you want to put separate priors on the standard deviations and the correlations, you decompose Σ = diag(σ) · R · diag(σ) and put an LKJ prior on R. This is the standard approach in Stan, PyMC, and other probabilistic programming languages.

Where it appears: Stan and PyMC priors on correlation, hierarchical models, any Bayesian model where you want to separate the "how big" (scale) from the "how related" (correlation) aspects of covariance.

python (PyMC)
import pymc as pm

with pm.Model():
    # Separate priors on scale and correlation
    sigma = pm.HalfCauchy('sigma', beta=2.5, shape=3)
    chol, corr, stds = pm.LKJCholeskyCov(
        'chol', n=3, eta=2.0,  # eta>1: shrink toward identity
        sd_dist=pm.HalfCauchy.dist(beta=2.5)
    )
    # cov = diag(sigma) @ corr @ diag(sigma)

LKJ: Sampled 2×2 Correlation Matrices

A 2×2 correlation matrix has one free parameter: ρ. This histogram shows the distribution of ρ under different η values.

η1.0

When η = 1, what does the LKJ distribution look like?

Uniform over all valid correlation matrices Concentrated at the identity (no correlation) Concentrated at perfect correlation

Chapter 7: The Dirichlet Process

You're clustering customer data. How many clusters should there be? 3? 10? 50? The Dirichlet Process (DP) answers: "let the data decide." It's a distribution over distributions — specifically, over discrete probability measures with a potentially infinite number of atoms (clusters).

The DP has two parameters: a concentration parameter α and a base distribution G₀. Small α produces few clusters. Large α produces many. The DP is not a distribution over finite vectors like the Dirichlet — it's a distribution over infinite-dimensional objects.

The Chinese Restaurant Process (CRP): Imagine customers entering a restaurant one at a time. Each customer either sits at an existing table (with probability proportional to how many people are already there) or starts a new table (with probability proportional to α). After all customers are seated, the table assignments define a clustering. This is the Dirichlet Process.

Customer n arrives

Look at existing tables

↓

Join table k

Probability ∝ n_k (number already seated)

Start new table

Probability ∝ α

Where it appears: Nonparametric clustering (infinite mixture models), topic models with unknown number of topics, density estimation, any setting where you don't want to pre-specify the number of components.

Stick-breaking view: Another way to construct a DP: break a unit-length stick repeatedly. At each step, break off a fraction β_k ~ Beta(1, α) of the remaining stick. The length of piece k becomes the weight π_k of the k-th cluster. Small α means you break off large chunks early (few clusters). Large α means small chips (many clusters). The weights sum to 1, but there are infinitely many of them.

β_k ~ Beta(1, α), π_k = β_k ∏_j=1^k-1 (1 − β_j)

Chinese Restaurant Process Simulator

Each circle is a customer. Colors represent tables. Click "Seat Customer" to add one. Higher α means more new tables.

α (concentration)1.0

      Customers: 0  |  Tables: 0
    

In the Chinese Restaurant Process, what does a larger α produce?

Fewer, larger clusters More clusters (new tables are more likely) Always exactly α clusters

Chapter 8: The Gaussian Process

You have 5 data points and want to predict the function that generated them. Not just a single best-fit line — you want a distribution over all possible functions that pass near those points, with uncertainty bands that widen where you have no data. That's a Gaussian Process (GP).

A GP defines a distribution over functions. Any finite collection of function values f(x₁), ..., f(x_n) is jointly Gaussian. The GP is fully specified by a mean function m(x) (usually zero) and a kernel (covariance function) k(x, x') that says how correlated function values at x and x' should be.

f ~ GP(m(x), k(x, x')) ⇒ [f(x₁), ..., f(x_n)]^T ~ N(m, K)

The most common kernel is the squared exponential (RBF): k(x, x') = σ² exp(−||x−x'||² / 2l²). The lengthscale l controls how smooth the function is (large l = slowly varying). The signal variance σ² controls the amplitude.

Kernel	Formula	Functions Look Like
Squared Exponential	σ² exp(−\|\|x−x'\|\|²/2l²)	Infinitely smooth
Matérn 3/2	σ²(1+√3 r/l)exp(−√3 r/l)	Once differentiable
Periodic	σ² exp(−2sin²(π\|x−x'\|/p)/l²)	Periodic with period p

Conditioning = learning: Before seeing data, the GP gives you a prior over smooth functions. After conditioning on observations, the posterior mean passes through the data points, and the posterior variance shrinks near observed locations. This is exact Bayesian inference — no approximation needed.

Where it appears: Bayesian optimization (surrogate model for expensive functions), geostatistics (kriging), robot model learning, hyperparameter tuning, any regression problem where uncertainty quantification matters.

Worked example: You're optimizing an expensive neural network training run. Each run costs $100 and takes 4 hours. You've tried 5 hyperparameter settings and observed validation accuracies. A GP models accuracy as a function of learning rate. The posterior mean tells you the expected accuracy at any untried learning rate. The posterior variance tells you where you're uncertain. Bayesian optimization picks the next point by balancing exploration (high variance) with exploitation (high mean).

python
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF

X_train = np.array([[0.001], [0.01], [0.1], [0.5], [1.0]])
y_train = np.array([0.65, 0.89, 0.93, 0.78, 0.55])

gp = GaussianProcessRegressor(kernel=RBF(0.5))
gp.fit(X_train, y_train)

X_test = np.linspace(0, 1.2, 100).reshape(-1, 1)
mu, sigma = gp.predict(X_test, return_std=True)
# mu = posterior mean, sigma = posterior std at each test point

Gaussian Process Regression

Click on the canvas to add observations. The teal line is the GP posterior mean. The shaded band is ±2σ. Orange dots are your data.

Lengthscale l0.50

Noise σ_n0.10

What does the GP's lengthscale parameter control?

How smooth (slowly varying) the sampled functions are The number of data points needed The mean of the function

Chapter 9: Normal-Inverse-Wishart

In practice, you rarely know the mean or the covariance of a Gaussian. You need a joint prior on both μ and Σ. The Normal-Inverse-Wishart (NIW) distribution is the conjugate prior for this case: it provides a prior on (μ, Σ) that updates cleanly when you observe data from N(μ, Σ).

Σ ~ IW(ν, Ψ), μ | Σ ~ N(μ₀, Σ / κ)

Four hyperparameters govern the NIW: μ₀ (prior mean location), κ (how strongly you believe in μ₀), ν (degrees of freedom for covariance), and Ψ (scale matrix). Notice that the prior on μ depends on Σ — the mean's uncertainty scales with the covariance itself. This coupling is what makes the conjugate update work.

The update rule: After observing n data points with sample mean x̄ and scatter matrix S:
κ_n = κ₀ + n, ν_n = ν₀ + n
μ_n = (κ₀μ₀ + n·x̄) / κ_n
Ψ_n = Ψ₀ + S + (κ₀n/κ_n)(x̄ − μ₀)(x̄ − μ₀)^T
Everything is a weighted combination of prior and data.

Where it appears: Full Bayesian Gaussian inference (unknown mean and covariance), Bayesian Gaussian mixture models, Bayesian linear regression with unknown noise covariance.

Worked example: You're fitting a 2D Gaussian to sensor data. Your prior: NIW with μ₀ = [0, 0], κ₀ = 1, ν₀ = 4, Ψ₀ = I. You observe 20 data points with sample mean [1.2, 0.8] and scatter matrix S. The posterior is still NIW with updated parameters — κ₂₀ = 21 (so the mean is tightly determined), ν₂₀ = 24 (so the covariance is well-estimated). The posterior mean of μ is (1·[0,0] + 20·[1.2,0.8]) / 21 ≈ [1.14, 0.76] — dominated by data.

python
import numpy as np

# NIW prior
mu0 = np.array([0, 0])
kappa0, nu0 = 1, 4
Psi0 = np.eye(2)

# Observed data
data = np.random.multivariate_normal([1.2, 0.8], [[0.5, 0.2], [0.2, 0.8]], 20)
n = len(data)
xbar = data.mean(axis=0)
S = (data - xbar).T @ (data - xbar)

# NIW posterior
kappa_n = kappa0 + n
nu_n = nu0 + n
mu_n = (kappa0 * mu0 + n * xbar) / kappa_n
diff = (xbar - mu0).reshape(-1, 1)
Psi_n = Psi0 + S + (kappa0 * n / kappa_n) * diff @ diff.T

NIW: Joint Inference on Mean & Covariance

Each sample from the NIW posterior gives a (μ, Σ) pair. Orange dots = sampled means. Teal ellipses = corresponding covariances. Add data to see the posterior tighten.

κ₀ (mean confidence)1.0

Why does the NIW couple the prior on μ with Σ?

To reduce computation Because means and covariances are always independent Because the uncertainty in μ should scale with Σ — this coupling makes conjugate updates possible

Chapter 10: Von Mises-Fisher

Text embeddings from models like BERT live on the unit sphere — they're normalized to unit length, and similarity is measured by cosine (angle), not Euclidean distance. The Gaussian doesn't respect this geometry; it would put mass outside the sphere. You need a distribution designed for directional data.

The Von Mises-Fisher (vMF) distribution is the "Gaussian on a sphere." It's defined by a mean direction μ (a unit vector) and a concentration parameter κ. When κ = 0, the distribution is uniform on the sphere. As κ grows, it concentrates tightly around μ.

p(x | μ, κ) = C_d(κ) · exp(κ · μ^T x)

The exponent κ · μ^Tx is just κ times the cosine similarity between x and μ. Points aligned with μ (high cosine) get high probability. Points opposite to μ get low probability. The normalizing constant C_d(κ) involves Bessel functions and depends on the dimension d.

vMF vs Gaussian: The Gaussian is for unconstrained R^d. The vMF is for the unit sphere S^d-1. Use vMF whenever your data are directions (unit vectors, angles, orientations) rather than positions. In 2D, the vMF reduces to the von Mises distribution on the circle.

Properties: The vMF mean direction is μ (the parameter itself), and the mean resultant length is R̄ = I_d/2(κ) / I_d/2-1(κ), where I_v is the modified Bessel function. When κ = 0, R̄ = 0 (uniform spread). As κ → ∞, R̄ → 1 (all mass concentrated at μ). Maximum likelihood estimation of κ from data uses the equation R̄_data = I_d/2(κ) / I_d/2-1(κ), which must be solved numerically.

Property	Gaussian (on R^d)	vMF (on S^d-1)
Domain	All of R^d	Unit sphere
Center	μ (mean vector)	μ (mean direction)
Spread	Σ (covariance matrix)	1/κ (inverse concentration)
Distance metric	Euclidean / Mahalanobis	Cosine (dot product)
In 1D becomes	Normal distribution	Von Mises (on the circle)

Where it appears: Directional statistics (wind directions, animal migration), text embedding clustering, spherical topic models, robot orientation estimation, molecular biology (protein structure angles).

Worked example: You're clustering text embeddings from a sentence transformer. Each embedding is a 384-dimensional unit vector. The cosine similarity between two embeddings equals their dot product (since they're unit length). A vMF mixture model clusters these embeddings naturally, because vMF's probability depends on μ^Tx = cos(θ) — exactly the similarity metric you care about. Each cluster has a mean direction μ_k and concentration κ_k.

python
import numpy as np

def sample_vmf(mu, kappa, n=100):
    """Sample from vMF in d dimensions (Wood's algorithm)."""
    d = len(mu)
    result = []
    for _ in range(n):
        # Sample w from the marginal
        w = _sample_w(kappa, d)
        # Sample v uniformly on the (d-2)-sphere
        v = np.random.randn(d - 1)
        v /= np.linalg.norm(v)
        # Combine: x = sqrt(1-w^2)*v + w*mu
        sample = np.sqrt(1 - w**2) * _householder(mu, v) + w * mu
        result.append(sample)
    return np.array(result)

Von Mises-Fisher on the Circle

In 2D, the vMF becomes the von Mises distribution on the circle. The orange dots are samples. The teal curve is the density. Drag κ to see concentration change.

κ (concentration)3.0

Mean direction θ0.00

When should you use the Von Mises-Fisher instead of a Gaussian?

When your data are unit vectors (directions on a sphere) When your data have heavy tails When you have more than 3 dimensions

Chapter 11: The Landscape

You've now met ten multivariate distributions. Each solves a specific problem in the Bayesian modeling toolkit. Here's how they connect.

Distribution Relationship Map

The arrows show key relationships: conjugacy, limiting behavior, and composition.

Distribution	Lives On	Key Use	Parameters
Multivariate Gaussian	R^d	State estimation, GP, VAE	μ, Σ
Dirichlet	Simplex	Topic models, categorical prior	α_1..K
Wishart	PD matrices	Sample covariance distribution	ν, V
Inverse Wishart	PD matrices	Prior on Σ	ν, Ψ
Multivariate-t	R^d	Robust estimation, outlier resistance	μ, Σ, ν
LKJ	Correlation matrices	Prior on R in Stan/PyMC	η
Dirichlet Process	Distributions	Nonparametric clustering	α, G₀
Gaussian Process	Functions	Bayesian optimization, regression	m(x), k(x,x')
Normal-Inverse-Wishart	R^d × PD	Joint prior on μ, Σ	μ₀, κ, ν, Ψ
Von Mises-Fisher	Unit sphere	Directional statistics, embeddings	μ, κ

When to Use What

Rule of thumb:
• Data in R^d: Start with the multivariate Gaussian. If outliers matter, use multivariate-t.
• Probability vectors: Dirichlet. Unknown number of categories? Dirichlet Process.
• Prior on covariance: Inverse Wishart (simple), or LKJ + half-Cauchy (more flexible).
• Prior on mean + covariance: Normal-Inverse-Wishart.
• Functions: Gaussian Process.
• Directions: Von Mises-Fisher.

Connections to Other Lessons

The Kalman Filter is built entirely on multivariate Gaussians. Bayesian Estimation uses conjugate priors like the ones we studied (Inverse Wishart, NIW). GPT and Diffusion Models both use high-dimensional Gaussians in their core mechanics.

The meta-pattern: Every distribution in this lesson exists because engineers needed to put uncertainty on a specific type of mathematical object — vectors, matrices, functions, directions, probability vectors. The object's geometry determines which distribution is appropriate. That's the unifying idea.

The Conjugacy Map

Several distributions form conjugate pairs for the multivariate Gaussian:

Unknown	Conjugate Prior	Known
μ (mean only)	Multivariate Gaussian	Σ known
Σ (covariance only)	Inverse Wishart	μ known
(μ, Σ) jointly	Normal-Inverse-Wishart	Nothing known
Λ = Σ⁻¹ (precision)	Wishart	μ known

Conjugacy means the posterior has the same functional form as the prior, with updated parameters. This is the engine behind tractable Bayesian inference. When conjugacy isn't available (as in most modern models), we resort to MCMC or variational inference.

Closing thought: "The purpose of computing is insight, not numbers." — Richard Hamming. Every distribution in this lesson is a tool for representing what you know and what you don't. The better you understand their shapes, the better you can reason under uncertainty.

You need a prior on the correlation structure (not the scale) of a multivariate model. Which distribution?

Inverse Wishart LKJ Dirichlet

Multivariate Distributionsfor Engineers