Ten distributions that power Kalman filters, topic models, Gaussian processes, and Bayesian inference — built from absolute zero.
You're building a self-driving car. The vehicle has a position x, a velocity v, and a heading θ. These three quantities aren't independent — a car moving fast probably moved far from where it started, and its heading determines which way it went. Treating them as separate scalar distributions would throw away this critical structure.
Multivariate distributions model several random variables together, capturing how they relate to each other. The magic ingredient is the covariance matrix — a symmetric, positive-definite matrix whose off-diagonal entries encode correlations.
The covariance matrix Σ is a d×d matrix for d variables. Its diagonal entries Σii are variances (how spread each variable is). Its off-diagonal entries Σij capture how variables i and j co-vary. If Σij > 0, knowing that variable i is above average tells you variable j probably is too.
The covariance matrix must always be symmetric (Σij = Σji) and positive semi-definite (no direction of negative variance). These constraints matter: when you construct priors on covariance matrices (chapters 3-4), the distribution must respect them. The correlation matrix R normalizes Σ so diagonal entries are 1: Rij = Σij / (σiσj). Entries of R lie between −1 and 1.
Adjust the correlation ρ between two variables. When ρ=0, the scatter is circular. As ρ grows, the cloud stretches into an ellipse — knowing one variable tells you about the other.
If you learn one multivariate distribution, make it this one. The multivariate Gaussian (or multivariate normal) is the workhorse of estimation, robotics, and machine learning. It's defined by a mean vector μ (the center) and a covariance matrix Σ (the shape of uncertainty).
That exponent (x − μ)T Σ−1 (x − μ) is the Mahalanobis distance — it measures how far x is from the mean, scaled by the covariance. Points along the direction of high variance are "closer" in Mahalanobis terms. This is why the contours are ellipses, not circles.
Key properties: If x ~ N(μ, Σ), then any linear transformation Ax + b is also Gaussian: N(Aμ+b, AΣAT). Any subset of variables is Gaussian (marginalization). And the conditional distribution of some variables given others is also Gaussian — with a mean that shifts linearly and a covariance that shrinks.
Where it appears: Kalman filters (state belief), Gaussian processes (function priors), variational autoencoders (latent space), sensor fusion, any system where you track mean + covariance.
Adjust the variances and correlation. The teal ellipses are 1σ and 2σ contours. Orange dots are samples.
python import numpy as np mu = np.array([0, 0]) Sigma = np.array([[1.0, 0.5], [0.5, 1.0]]) samples = np.random.multivariate_normal(mu, Sigma, size=1000) # samples.shape = (1000, 2)
Imagine you're building a topic model. Each document is a mixture of topics: 30% sports, 50% politics, 20% tech. Where do those mixing proportions come from? You need a distribution over probability vectors — vectors whose entries are non-negative and sum to 1. That's the simplex, and the Dirichlet distribution lives on it.
The parameter vector α = (α1, ..., αK) controls where mass concentrates on the simplex. When all αi are large and equal, the distribution concentrates near the center (uniform mixing). When all αi are small (<1), it concentrates near the corners (sparse, one-dominant-category vectors). When αi are unequal, it favors certain categories.
Where it appears: Latent Dirichlet Allocation (LDA) for topic models, Bayesian priors on multinomial parameters, mixture model weights, any time you need a distribution over "how to split probability among K categories."
Worked example: You're estimating the probabilities of a 3-sided die. You start with a Dirichlet(1,1,1) prior (uniform — any combination is equally likely). You roll the die 10 times and see: 4 ones, 3 twos, 3 threes. The posterior is Dirichlet(1+4, 1+3, 1+3) = Dirichlet(5, 4, 4). The posterior mean is (5/13, 4/13, 4/13) ≈ (0.38, 0.31, 0.31) — shifted toward the observed frequencies, but regularized by the prior.
python import numpy as np # Prior: uniform Dirichlet alpha_prior = np.array([1, 1, 1]) counts = np.array([4, 3, 3]) # observed data alpha_post = alpha_prior + counts # [5, 4, 4] posterior_mean = alpha_post / alpha_post.sum() # array([0.385, 0.308, 0.308]) # Sample probability vectors from the posterior samples = np.random.dirichlet(alpha_post, size=1000)
The triangle is the 3-simplex (all probability vectors of length 3). Each orange dot is a sampled probability vector. Adjust α to see how the distribution changes.
You've measured 50 data points in 3D and computed their sample covariance matrix. How much should you trust it? If you had only 4 data points, that 3×3 covariance estimate would be wildly noisy. You need a distribution over covariance matrices to quantify this uncertainty. That's the Wishart.
The Wishart distribution W(ν, V) has two parameters: the degrees of freedom ν (how many "pseudo-observations" inform the estimate) and the scale matrix V (the expected shape, scaled by ν). Its mean is ν · V.
That's the key intuition: a Wishart matrix is a sum of outer products of Gaussian draws. This is exactly the form of a sample covariance matrix (up to a 1/n factor). So the Wishart naturally describes the distribution of sample covariance matrices from Gaussian data.
Where it appears: Sample covariance distribution in multivariate statistics, Bayesian estimation of precision matrices, prior in Gaussian mixture models, MANOVA tests.
Key properties: The Wishart is the multivariate generalization of the chi-squared distribution. In 1D, if you sum the squares of ν standard normal draws, you get a chi-squared(ν) variable. In d dimensions, if you sum the outer products of ν multivariate normal draws, you get a Wishart matrix. The degrees of freedom ν must be at least d (the dimension) for the distribution to be non-degenerate.
Each orange ellipse is one sampled 2×2 covariance matrix from a Wishart. The teal ellipse is the mean. Higher ν means tighter clustering around the mean.
The Wishart describes the distribution of sums of outer products (like sample covariance matrices scaled by n). But when we do Bayesian inference, we usually want a prior directly on the covariance matrix Σ of a Gaussian. The Inverse Wishart is the conjugate prior for this job.
If X ~ W(ν, V), then X−1 ~ IW(ν, V−1). The Inverse Wishart IW(ν, Ψ) has degrees of freedom ν and a scale matrix Ψ. Its mean is Ψ / (ν − d − 1), which only exists when ν > d + 1.
Where it appears: Prior on the covariance in Bayesian linear regression, Gaussian mixture model fitting with unknown covariance, any Bayesian analysis where Σ is unknown.
A practical limitation: The Inverse Wishart couples the variances and correlations together in ways that can be restrictive. For a d×d matrix, increasing ν simultaneously shrinks the uncertainty in all variance and correlation parameters. In high dimensions, this inflexibility can be problematic. That's one reason the LKJ + half-Cauchy decomposition (Chapter 6) has become the modern standard in probabilistic programming.
| Property | Wishart | Inverse Wishart |
|---|---|---|
| Samples are | PD matrices (like sample covariance) | PD matrices (like true covariance) |
| Conjugate prior for | Precision Λ = Σ−1 | Covariance Σ |
| Mean | ν · V | Ψ / (ν − d − 1) |
| Concentration | Increases with ν | Increases with ν |
The purple ellipses show prior IW samples. Click "Add Data" to observe points from a true Gaussian. The teal ellipses show posterior samples — they converge to the truth.
The multivariate Gaussian is beautiful but fragile. One outlier can destroy your mean estimate. If you're tracking a robot and a sensor glitches, reporting a position 100 meters off, the Gaussian posterior shifts dramatically. You need heavier tails to gracefully absorb outliers.
The multivariate Student-t distribution has an extra parameter ν (degrees of freedom) that controls tail heaviness. When ν is small (like 3), the tails are very fat — extreme values are plausible. As ν → ∞, it converges to a Gaussian.
The key difference from the Gaussian: instead of exp(−...), we have a power law decay. This means the probability of extreme events falls off polynomially, not exponentially. The t-distribution is actually a Gaussian scale mixture: each point is drawn from a Gaussian whose variance is itself random (drawn from an Inverse-Gamma). Some points get large variances — those are the outliers.
Where it appears: Robust state estimation, outlier-resistant Kalman filters, financial risk modeling, Bayesian predictive distributions (when Σ is estimated, the predictive is t, not Gaussian).
A surprising fact: When you fit a Gaussian to data and then predict a new data point, the predictive distribution is not Gaussian — it's a Student-t. That's because you're uncertain about Σ, and marginalizing over that uncertainty fattens the tails. With n data points in d dimensions, the predictive has ν = n − d + 1 degrees of freedom. More data → larger ν → closer to Gaussian.
python import numpy as np from scipy.stats import multivariate_t # Multivariate-t with 3 degrees of freedom rv = multivariate_t( loc=[0, 0], shape=[[1, 0.5], [0.5, 1]], df=3 ) samples = rv.rvs(size=1000) # Notice outliers that would be nearly impossible under a Gaussian
Teal = Gaussian samples. Orange = Student-t samples. Lower ν means fatter tails and more outliers.
You're building a Bayesian model with several correlated parameters. You need a prior on the correlation matrix R. But correlation matrices are tricky — they must be symmetric, positive definite, with ones on the diagonal and off-diagonal entries between −1 and 1. The LKJ distribution (named for Lewandowski, Kurowicka, and Joe) is designed exactly for this.
The single parameter η controls how much the distribution favors the identity matrix (no correlation). When η = 1, the distribution is uniform over all valid correlation matrices. When η > 1, it concentrates near the identity (shrinks correlations toward zero). When η < 1, it favors matrices with strong correlations.
Where it appears: Stan and PyMC priors on correlation, hierarchical models, any Bayesian model where you want to separate the "how big" (scale) from the "how related" (correlation) aspects of covariance.
python (PyMC) import pymc as pm with pm.Model(): # Separate priors on scale and correlation sigma = pm.HalfCauchy('sigma', beta=2.5, shape=3) chol, corr, stds = pm.LKJCholeskyCov( 'chol', n=3, eta=2.0, # eta>1: shrink toward identity sd_dist=pm.HalfCauchy.dist(beta=2.5) ) # cov = diag(sigma) @ corr @ diag(sigma)
A 2×2 correlation matrix has one free parameter: ρ. This histogram shows the distribution of ρ under different η values.
You're clustering customer data. How many clusters should there be? 3? 10? 50? The Dirichlet Process (DP) answers: "let the data decide." It's a distribution over distributions — specifically, over discrete probability measures with a potentially infinite number of atoms (clusters).
The DP has two parameters: a concentration parameter α and a base distribution G0. Small α produces few clusters. Large α produces many. The DP is not a distribution over finite vectors like the Dirichlet — it's a distribution over infinite-dimensional objects.
Where it appears: Nonparametric clustering (infinite mixture models), topic models with unknown number of topics, density estimation, any setting where you don't want to pre-specify the number of components.
Stick-breaking view: Another way to construct a DP: break a unit-length stick repeatedly. At each step, break off a fraction βk ~ Beta(1, α) of the remaining stick. The length of piece k becomes the weight πk of the k-th cluster. Small α means you break off large chunks early (few clusters). Large α means small chips (many clusters). The weights sum to 1, but there are infinitely many of them.
Each circle is a customer. Colors represent tables. Click "Seat Customer" to add one. Higher α means more new tables.
You have 5 data points and want to predict the function that generated them. Not just a single best-fit line — you want a distribution over all possible functions that pass near those points, with uncertainty bands that widen where you have no data. That's a Gaussian Process (GP).
A GP defines a distribution over functions. Any finite collection of function values f(x1), ..., f(xn) is jointly Gaussian. The GP is fully specified by a mean function m(x) (usually zero) and a kernel (covariance function) k(x, x') that says how correlated function values at x and x' should be.
The most common kernel is the squared exponential (RBF): k(x, x') = σ2 exp(−||x−x'||2 / 2l2). The lengthscale l controls how smooth the function is (large l = slowly varying). The signal variance σ2 controls the amplitude.
| Kernel | Formula | Functions Look Like |
|---|---|---|
| Squared Exponential | σ2 exp(−||x−x'||2/2l2) | Infinitely smooth |
| Matérn 3/2 | σ2(1+√3 r/l)exp(−√3 r/l) | Once differentiable |
| Periodic | σ2 exp(−2sin2(π|x−x'|/p)/l2) | Periodic with period p |
Where it appears: Bayesian optimization (surrogate model for expensive functions), geostatistics (kriging), robot model learning, hyperparameter tuning, any regression problem where uncertainty quantification matters.
Worked example: You're optimizing an expensive neural network training run. Each run costs $100 and takes 4 hours. You've tried 5 hyperparameter settings and observed validation accuracies. A GP models accuracy as a function of learning rate. The posterior mean tells you the expected accuracy at any untried learning rate. The posterior variance tells you where you're uncertain. Bayesian optimization picks the next point by balancing exploration (high variance) with exploitation (high mean).
python from sklearn.gaussian_process import GaussianProcessRegressor from sklearn.gaussian_process.kernels import RBF X_train = np.array([[0.001], [0.01], [0.1], [0.5], [1.0]]) y_train = np.array([0.65, 0.89, 0.93, 0.78, 0.55]) gp = GaussianProcessRegressor(kernel=RBF(0.5)) gp.fit(X_train, y_train) X_test = np.linspace(0, 1.2, 100).reshape(-1, 1) mu, sigma = gp.predict(X_test, return_std=True) # mu = posterior mean, sigma = posterior std at each test point
Click on the canvas to add observations. The teal line is the GP posterior mean. The shaded band is ±2σ. Orange dots are your data.
In practice, you rarely know the mean or the covariance of a Gaussian. You need a joint prior on both μ and Σ. The Normal-Inverse-Wishart (NIW) distribution is the conjugate prior for this case: it provides a prior on (μ, Σ) that updates cleanly when you observe data from N(μ, Σ).
Four hyperparameters govern the NIW: μ0 (prior mean location), κ (how strongly you believe in μ0), ν (degrees of freedom for covariance), and Ψ (scale matrix). Notice that the prior on μ depends on Σ — the mean's uncertainty scales with the covariance itself. This coupling is what makes the conjugate update work.
Where it appears: Full Bayesian Gaussian inference (unknown mean and covariance), Bayesian Gaussian mixture models, Bayesian linear regression with unknown noise covariance.
Worked example: You're fitting a 2D Gaussian to sensor data. Your prior: NIW with μ0 = [0, 0], κ0 = 1, ν0 = 4, Ψ0 = I. You observe 20 data points with sample mean [1.2, 0.8] and scatter matrix S. The posterior is still NIW with updated parameters — κ20 = 21 (so the mean is tightly determined), ν20 = 24 (so the covariance is well-estimated). The posterior mean of μ is (1·[0,0] + 20·[1.2,0.8]) / 21 ≈ [1.14, 0.76] — dominated by data.
python import numpy as np # NIW prior mu0 = np.array([0, 0]) kappa0, nu0 = 1, 4 Psi0 = np.eye(2) # Observed data data = np.random.multivariate_normal([1.2, 0.8], [[0.5, 0.2], [0.2, 0.8]], 20) n = len(data) xbar = data.mean(axis=0) S = (data - xbar).T @ (data - xbar) # NIW posterior kappa_n = kappa0 + n nu_n = nu0 + n mu_n = (kappa0 * mu0 + n * xbar) / kappa_n diff = (xbar - mu0).reshape(-1, 1) Psi_n = Psi0 + S + (kappa0 * n / kappa_n) * diff @ diff.T
Each sample from the NIW posterior gives a (μ, Σ) pair. Orange dots = sampled means. Teal ellipses = corresponding covariances. Add data to see the posterior tighten.
Text embeddings from models like BERT live on the unit sphere — they're normalized to unit length, and similarity is measured by cosine (angle), not Euclidean distance. The Gaussian doesn't respect this geometry; it would put mass outside the sphere. You need a distribution designed for directional data.
The Von Mises-Fisher (vMF) distribution is the "Gaussian on a sphere." It's defined by a mean direction μ (a unit vector) and a concentration parameter κ. When κ = 0, the distribution is uniform on the sphere. As κ grows, it concentrates tightly around μ.
The exponent κ · μTx is just κ times the cosine similarity between x and μ. Points aligned with μ (high cosine) get high probability. Points opposite to μ get low probability. The normalizing constant Cd(κ) involves Bessel functions and depends on the dimension d.
Properties: The vMF mean direction is μ (the parameter itself), and the mean resultant length is R̄ = Id/2(κ) / Id/2-1(κ), where Iv is the modified Bessel function. When κ = 0, R̄ = 0 (uniform spread). As κ → ∞, R̄ → 1 (all mass concentrated at μ). Maximum likelihood estimation of κ from data uses the equation R̄data = Id/2(κ) / Id/2-1(κ), which must be solved numerically.
| Property | Gaussian (on Rd) | vMF (on Sd-1) |
|---|---|---|
| Domain | All of Rd | Unit sphere |
| Center | μ (mean vector) | μ (mean direction) |
| Spread | Σ (covariance matrix) | 1/κ (inverse concentration) |
| Distance metric | Euclidean / Mahalanobis | Cosine (dot product) |
| In 1D becomes | Normal distribution | Von Mises (on the circle) |
Where it appears: Directional statistics (wind directions, animal migration), text embedding clustering, spherical topic models, robot orientation estimation, molecular biology (protein structure angles).
Worked example: You're clustering text embeddings from a sentence transformer. Each embedding is a 384-dimensional unit vector. The cosine similarity between two embeddings equals their dot product (since they're unit length). A vMF mixture model clusters these embeddings naturally, because vMF's probability depends on μTx = cos(θ) — exactly the similarity metric you care about. Each cluster has a mean direction μk and concentration κk.
python import numpy as np def sample_vmf(mu, kappa, n=100): """Sample from vMF in d dimensions (Wood's algorithm).""" d = len(mu) result = [] for _ in range(n): # Sample w from the marginal w = _sample_w(kappa, d) # Sample v uniformly on the (d-2)-sphere v = np.random.randn(d - 1) v /= np.linalg.norm(v) # Combine: x = sqrt(1-w^2)*v + w*mu sample = np.sqrt(1 - w**2) * _householder(mu, v) + w * mu result.append(sample) return np.array(result)
In 2D, the vMF becomes the von Mises distribution on the circle. The orange dots are samples. The teal curve is the density. Drag κ to see concentration change.
You've now met ten multivariate distributions. Each solves a specific problem in the Bayesian modeling toolkit. Here's how they connect.
The arrows show key relationships: conjugacy, limiting behavior, and composition.
| Distribution | Lives On | Key Use | Parameters |
|---|---|---|---|
| Multivariate Gaussian | Rd | State estimation, GP, VAE | μ, Σ |
| Dirichlet | Simplex | Topic models, categorical prior | α1..K |
| Wishart | PD matrices | Sample covariance distribution | ν, V |
| Inverse Wishart | PD matrices | Prior on Σ | ν, Ψ |
| Multivariate-t | Rd | Robust estimation, outlier resistance | μ, Σ, ν |
| LKJ | Correlation matrices | Prior on R in Stan/PyMC | η |
| Dirichlet Process | Distributions | Nonparametric clustering | α, G0 |
| Gaussian Process | Functions | Bayesian optimization, regression | m(x), k(x,x') |
| Normal-Inverse-Wishart | Rd × PD | Joint prior on μ, Σ | μ0, κ, ν, Ψ |
| Von Mises-Fisher | Unit sphere | Directional statistics, embeddings | μ, κ |
The Kalman Filter is built entirely on multivariate Gaussians. Bayesian Estimation uses conjugate priors like the ones we studied (Inverse Wishart, NIW). GPT and Diffusion Models both use high-dimensional Gaussians in their core mechanics.
Several distributions form conjugate pairs for the multivariate Gaussian:
| Unknown | Conjugate Prior | Known |
|---|---|---|
| μ (mean only) | Multivariate Gaussian | Σ known |
| Σ (covariance only) | Inverse Wishart | μ known |
| (μ, Σ) jointly | Normal-Inverse-Wishart | Nothing known |
| Λ = Σ−1 (precision) | Wishart | μ known |
Conjugacy means the posterior has the same functional form as the prior, with updated parameters. This is the engine behind tractable Bayesian inference. When conjugacy isn't available (as in most modern models), we resort to MCMC or variational inference.