Probability for AI & Robotics

Multivariate Distributions
for Engineers

Ten distributions that power Kalman filters, topic models, Gaussian processes, and Bayesian inference — built from absolute zero.

Prerequisites: Basic probability + Vectors & matrices. That's it.
12
Chapters
12+
Simulations
0
Assumed Knowledge

Chapter 0: Why Multivariate?

You're building a self-driving car. The vehicle has a position x, a velocity v, and a heading θ. These three quantities aren't independent — a car moving fast probably moved far from where it started, and its heading determines which way it went. Treating them as separate scalar distributions would throw away this critical structure.

Multivariate distributions model several random variables together, capturing how they relate to each other. The magic ingredient is the covariance matrix — a symmetric, positive-definite matrix whose off-diagonal entries encode correlations.

The core idea: A univariate distribution says "this value is uncertain." A multivariate distribution says "these values are uncertain and linked to each other." That linkage — covariance — is what makes Kalman filters, Gaussian processes, and topic models work.

The covariance matrix Σ is a d×d matrix for d variables. Its diagonal entries Σii are variances (how spread each variable is). Its off-diagonal entries Σij capture how variables i and j co-vary. If Σij > 0, knowing that variable i is above average tells you variable j probably is too.

The covariance matrix must always be symmetricij = Σji) and positive semi-definite (no direction of negative variance). These constraints matter: when you construct priors on covariance matrices (chapters 3-4), the distribution must respect them. The correlation matrix R normalizes Σ so diagonal entries are 1: Rij = Σij / (σiσj). Entries of R lie between −1 and 1.

Joint Distribution
Models all variables together: p(x1, ..., xd)
Covariance Matrix
Σij = E[(xi − μi)(xj − μj)]
Correlation Matrix
Rij = Σij / (σiσj), entries in [−1, 1]
Covariance Shapes Uncertainty

Adjust the correlation ρ between two variables. When ρ=0, the scatter is circular. As ρ grows, the cloud stretches into an ellipse — knowing one variable tells you about the other.

Correlation ρ0.00
Why can't we treat correlated variables as independent scalars?

Chapter 1: The Multivariate Gaussian

If you learn one multivariate distribution, make it this one. The multivariate Gaussian (or multivariate normal) is the workhorse of estimation, robotics, and machine learning. It's defined by a mean vector μ (the center) and a covariance matrix Σ (the shape of uncertainty).

p(x) = (2π)-d/2 |Σ|-1/2 exp(−½ (x − μ)T Σ−1 (x − μ))

That exponent (x − μ)T Σ−1 (x − μ) is the Mahalanobis distance — it measures how far x is from the mean, scaled by the covariance. Points along the direction of high variance are "closer" in Mahalanobis terms. This is why the contours are ellipses, not circles.

Why it's everywhere: The multivariate Gaussian is closed under conditioning, marginalization, and linear transformation. If your state is Gaussian and your model is linear, the output is still Gaussian. This is why the Kalman filter works — it never leaves the Gaussian family.

Key properties: If x ~ N(μ, Σ), then any linear transformation Ax + b is also Gaussian: N(Aμ+b, AΣAT). Any subset of variables is Gaussian (marginalization). And the conditional distribution of some variables given others is also Gaussian — with a mean that shifts linearly and a covariance that shrinks.

Conditioning example: If [x, y]T is jointly Gaussian with correlation ρ, then knowing y = y0 makes x conditionally Gaussian with mean μx + ρ(σxy)(y0 − μy) and variance σx2(1 − ρ2). The higher the correlation, the more knowing y shrinks the uncertainty in x.

Where it appears: Kalman filters (state belief), Gaussian processes (function priors), variational autoencoders (latent space), sensor fusion, any system where you track mean + covariance.

2D Gaussian: Contours & Samples

Adjust the variances and correlation. The teal ellipses are 1σ and 2σ contours. Orange dots are samples.

σx1.00
σy1.00
ρ0.50
python
import numpy as np

mu = np.array([0, 0])
Sigma = np.array([[1.0, 0.5],
                   [0.5, 1.0]])

samples = np.random.multivariate_normal(mu, Sigma, size=1000)
# samples.shape = (1000, 2)
What does the Mahalanobis distance measure?

Chapter 2: The Dirichlet Distribution

Imagine you're building a topic model. Each document is a mixture of topics: 30% sports, 50% politics, 20% tech. Where do those mixing proportions come from? You need a distribution over probability vectors — vectors whose entries are non-negative and sum to 1. That's the simplex, and the Dirichlet distribution lives on it.

p(x1, ..., xK) = (1/B(α)) ∏i=1K xiαi−1

The parameter vector α = (α1, ..., αK) controls where mass concentrates on the simplex. When all αi are large and equal, the distribution concentrates near the center (uniform mixing). When all αi are small (<1), it concentrates near the corners (sparse, one-dominant-category vectors). When αi are unequal, it favors certain categories.

Think of it this way: The Dirichlet is to the categorical distribution what the Beta is to the Bernoulli. The Beta gives you a prior over a single coin's bias. The Dirichlet gives you a prior over the probabilities of a K-sided die.

Where it appears: Latent Dirichlet Allocation (LDA) for topic models, Bayesian priors on multinomial parameters, mixture model weights, any time you need a distribution over "how to split probability among K categories."

Worked example: You're estimating the probabilities of a 3-sided die. You start with a Dirichlet(1,1,1) prior (uniform — any combination is equally likely). You roll the die 10 times and see: 4 ones, 3 twos, 3 threes. The posterior is Dirichlet(1+4, 1+3, 1+3) = Dirichlet(5, 4, 4). The posterior mean is (5/13, 4/13, 4/13) ≈ (0.38, 0.31, 0.31) — shifted toward the observed frequencies, but regularized by the prior.

python
import numpy as np

# Prior: uniform Dirichlet
alpha_prior = np.array([1, 1, 1])
counts = np.array([4, 3, 3])  # observed data

alpha_post = alpha_prior + counts  # [5, 4, 4]
posterior_mean = alpha_post / alpha_post.sum()
# array([0.385, 0.308, 0.308])

# Sample probability vectors from the posterior
samples = np.random.dirichlet(alpha_post, size=1000)
Dirichlet on the 3-Simplex

The triangle is the 3-simplex (all probability vectors of length 3). Each orange dot is a sampled probability vector. Adjust α to see how the distribution changes.

α11.0
α21.0
α31.0
When all αi < 1, where do Dirichlet samples concentrate?

Chapter 3: The Wishart Distribution

You've measured 50 data points in 3D and computed their sample covariance matrix. How much should you trust it? If you had only 4 data points, that 3×3 covariance estimate would be wildly noisy. You need a distribution over covariance matrices to quantify this uncertainty. That's the Wishart.

The Wishart distribution W(ν, V) has two parameters: the degrees of freedom ν (how many "pseudo-observations" inform the estimate) and the scale matrix V (the expected shape, scaled by ν). Its mean is ν · V.

X ~ Wd(ν, V)   ⇔   X = ∑i=1ν zi ziT,   zi ~ N(0, V)

That's the key intuition: a Wishart matrix is a sum of outer products of Gaussian draws. This is exactly the form of a sample covariance matrix (up to a 1/n factor). So the Wishart naturally describes the distribution of sample covariance matrices from Gaussian data.

Degrees of freedom as confidence: When ν is small (close to d), the Wishart produces highly variable matrices — your covariance estimate is uncertain. As ν grows, the distribution concentrates around νV — your estimate becomes reliable. It's a confidence dial for covariance.

Where it appears: Sample covariance distribution in multivariate statistics, Bayesian estimation of precision matrices, prior in Gaussian mixture models, MANOVA tests.

Key properties: The Wishart is the multivariate generalization of the chi-squared distribution. In 1D, if you sum the squares of ν standard normal draws, you get a chi-squared(ν) variable. In d dimensions, if you sum the outer products of ν multivariate normal draws, you get a Wishart matrix. The degrees of freedom ν must be at least d (the dimension) for the distribution to be non-degenerate.

Wishart Samples as Covariance Ellipses

Each orange ellipse is one sampled 2×2 covariance matrix from a Wishart. The teal ellipse is the mean. Higher ν means tighter clustering around the mean.

ν (degrees of freedom)5
What happens to Wishart samples as ν increases?

Chapter 4: The Inverse Wishart

The Wishart describes the distribution of sums of outer products (like sample covariance matrices scaled by n). But when we do Bayesian inference, we usually want a prior directly on the covariance matrix Σ of a Gaussian. The Inverse Wishart is the conjugate prior for this job.

If X ~ W(ν, V), then X−1 ~ IW(ν, V−1). The Inverse Wishart IW(ν, Ψ) has degrees of freedom ν and a scale matrix Ψ. Its mean is Ψ / (ν − d − 1), which only exists when ν > d + 1.

Σ ~ IW(ν, Ψ)   ⇒   E[Σ] = Ψ / (ν − d − 1)
Conjugate magic: If your prior on Σ is IW(ν0, Ψ0) and you observe n data points from N(μ, Σ), the posterior on Σ is also Inverse Wishart: IW(ν0 + n, Ψ0 + S), where S is the sum of squared deviations. Prior parameters act as "pseudo-data" that get added to real data.

Where it appears: Prior on the covariance in Bayesian linear regression, Gaussian mixture model fitting with unknown covariance, any Bayesian analysis where Σ is unknown.

A practical limitation: The Inverse Wishart couples the variances and correlations together in ways that can be restrictive. For a d×d matrix, increasing ν simultaneously shrinks the uncertainty in all variance and correlation parameters. In high dimensions, this inflexibility can be problematic. That's one reason the LKJ + half-Cauchy decomposition (Chapter 6) has become the modern standard in probabilistic programming.

PropertyWishartInverse Wishart
Samples arePD matrices (like sample covariance)PD matrices (like true covariance)
Conjugate prior forPrecision Λ = Σ−1Covariance Σ
Meanν · VΨ / (ν − d − 1)
ConcentrationIncreases with νIncreases with ν
Inverse Wishart: Bayesian Covariance Update

The purple ellipses show prior IW samples. Click "Add Data" to observe points from a true Gaussian. The teal ellipses show posterior samples — they converge to the truth.

Prior strength ν04
Why is the Inverse Wishart popular as a prior on covariance?

Chapter 5: The Multivariate Student-t

The multivariate Gaussian is beautiful but fragile. One outlier can destroy your mean estimate. If you're tracking a robot and a sensor glitches, reporting a position 100 meters off, the Gaussian posterior shifts dramatically. You need heavier tails to gracefully absorb outliers.

The multivariate Student-t distribution has an extra parameter ν (degrees of freedom) that controls tail heaviness. When ν is small (like 3), the tails are very fat — extreme values are plausible. As ν → ∞, it converges to a Gaussian.

p(x) ∝ (1 + ν−1(x−μ)TΣ−1(x−μ))−(ν+d)/2

The key difference from the Gaussian: instead of exp(−...), we have a power law decay. This means the probability of extreme events falls off polynomially, not exponentially. The t-distribution is actually a Gaussian scale mixture: each point is drawn from a Gaussian whose variance is itself random (drawn from an Inverse-Gamma). Some points get large variances — those are the outliers.

Gaussian scale mixture: Draw w ~ InvGamma(ν/2, ν/2). Then draw x ~ N(μ, wΣ). Marginalizing out w gives the multivariate-t. Points where w is large produce outliers naturally.

Where it appears: Robust state estimation, outlier-resistant Kalman filters, financial risk modeling, Bayesian predictive distributions (when Σ is estimated, the predictive is t, not Gaussian).

A surprising fact: When you fit a Gaussian to data and then predict a new data point, the predictive distribution is not Gaussian — it's a Student-t. That's because you're uncertain about Σ, and marginalizing over that uncertainty fattens the tails. With n data points in d dimensions, the predictive has ν = n − d + 1 degrees of freedom. More data → larger ν → closer to Gaussian.

python
import numpy as np
from scipy.stats import multivariate_t

# Multivariate-t with 3 degrees of freedom
rv = multivariate_t(
    loc=[0, 0],
    shape=[[1, 0.5], [0.5, 1]],
    df=3
)
samples = rv.rvs(size=1000)
# Notice outliers that would be nearly impossible under a Gaussian
Gaussian vs Student-t: Tail Behavior

Teal = Gaussian samples. Orange = Student-t samples. Lower ν means fatter tails and more outliers.

ν (degrees of freedom)3
Why use a multivariate-t instead of a Gaussian?

Chapter 6: The LKJ Distribution

You're building a Bayesian model with several correlated parameters. You need a prior on the correlation matrix R. But correlation matrices are tricky — they must be symmetric, positive definite, with ones on the diagonal and off-diagonal entries between −1 and 1. The LKJ distribution (named for Lewandowski, Kurowicka, and Joe) is designed exactly for this.

p(R) ∝ |R|η−1

The single parameter η controls how much the distribution favors the identity matrix (no correlation). When η = 1, the distribution is uniform over all valid correlation matrices. When η > 1, it concentrates near the identity (shrinks correlations toward zero). When η < 1, it favors matrices with strong correlations.

Why not just use Inverse Wishart? The IW conflates scale (variances) with structure (correlations). If you want to put separate priors on the standard deviations and the correlations, you decompose Σ = diag(σ) · R · diag(σ) and put an LKJ prior on R. This is the standard approach in Stan, PyMC, and other probabilistic programming languages.

Where it appears: Stan and PyMC priors on correlation, hierarchical models, any Bayesian model where you want to separate the "how big" (scale) from the "how related" (correlation) aspects of covariance.

python (PyMC)
import pymc as pm

with pm.Model():
    # Separate priors on scale and correlation
    sigma = pm.HalfCauchy('sigma', beta=2.5, shape=3)
    chol, corr, stds = pm.LKJCholeskyCov(
        'chol', n=3, eta=2.0,  # eta>1: shrink toward identity
        sd_dist=pm.HalfCauchy.dist(beta=2.5)
    )
    # cov = diag(sigma) @ corr @ diag(sigma)
LKJ: Sampled 2×2 Correlation Matrices

A 2×2 correlation matrix has one free parameter: ρ. This histogram shows the distribution of ρ under different η values.

η1.0
When η = 1, what does the LKJ distribution look like?

Chapter 7: The Dirichlet Process

You're clustering customer data. How many clusters should there be? 3? 10? 50? The Dirichlet Process (DP) answers: "let the data decide." It's a distribution over distributions — specifically, over discrete probability measures with a potentially infinite number of atoms (clusters).

The DP has two parameters: a concentration parameter α and a base distribution G0. Small α produces few clusters. Large α produces many. The DP is not a distribution over finite vectors like the Dirichlet — it's a distribution over infinite-dimensional objects.

The Chinese Restaurant Process (CRP): Imagine customers entering a restaurant one at a time. Each customer either sits at an existing table (with probability proportional to how many people are already there) or starts a new table (with probability proportional to α). After all customers are seated, the table assignments define a clustering. This is the Dirichlet Process.
Customer n arrives
Look at existing tables
Join table k
Probability ∝ nk (number already seated)
or
Start new table
Probability ∝ α

Where it appears: Nonparametric clustering (infinite mixture models), topic models with unknown number of topics, density estimation, any setting where you don't want to pre-specify the number of components.

Stick-breaking view: Another way to construct a DP: break a unit-length stick repeatedly. At each step, break off a fraction βk ~ Beta(1, α) of the remaining stick. The length of piece k becomes the weight πk of the k-th cluster. Small α means you break off large chunks early (few clusters). Large α means small chips (many clusters). The weights sum to 1, but there are infinitely many of them.

βk ~ Beta(1, α),    πk = βkj=1k-1 (1 − βj)
Chinese Restaurant Process Simulator

Each circle is a customer. Colors represent tables. Click "Seat Customer" to add one. Higher α means more new tables.

α (concentration)1.0
Customers: 0  |  Tables: 0
In the Chinese Restaurant Process, what does a larger α produce?

Chapter 8: The Gaussian Process

You have 5 data points and want to predict the function that generated them. Not just a single best-fit line — you want a distribution over all possible functions that pass near those points, with uncertainty bands that widen where you have no data. That's a Gaussian Process (GP).

A GP defines a distribution over functions. Any finite collection of function values f(x1), ..., f(xn) is jointly Gaussian. The GP is fully specified by a mean function m(x) (usually zero) and a kernel (covariance function) k(x, x') that says how correlated function values at x and x' should be.

f ~ GP(m(x), k(x, x'))   ⇒   [f(x1), ..., f(xn)]T ~ N(m, K)

The most common kernel is the squared exponential (RBF): k(x, x') = σ2 exp(−||x−x'||2 / 2l2). The lengthscale l controls how smooth the function is (large l = slowly varying). The signal variance σ2 controls the amplitude.

KernelFormulaFunctions Look Like
Squared Exponentialσ2 exp(−||x−x'||2/2l2)Infinitely smooth
Matérn 3/2σ2(1+√3 r/l)exp(−√3 r/l)Once differentiable
Periodicσ2 exp(−2sin2(π|x−x'|/p)/l2)Periodic with period p
Conditioning = learning: Before seeing data, the GP gives you a prior over smooth functions. After conditioning on observations, the posterior mean passes through the data points, and the posterior variance shrinks near observed locations. This is exact Bayesian inference — no approximation needed.

Where it appears: Bayesian optimization (surrogate model for expensive functions), geostatistics (kriging), robot model learning, hyperparameter tuning, any regression problem where uncertainty quantification matters.

Worked example: You're optimizing an expensive neural network training run. Each run costs $100 and takes 4 hours. You've tried 5 hyperparameter settings and observed validation accuracies. A GP models accuracy as a function of learning rate. The posterior mean tells you the expected accuracy at any untried learning rate. The posterior variance tells you where you're uncertain. Bayesian optimization picks the next point by balancing exploration (high variance) with exploitation (high mean).

python
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF

X_train = np.array([[0.001], [0.01], [0.1], [0.5], [1.0]])
y_train = np.array([0.65, 0.89, 0.93, 0.78, 0.55])

gp = GaussianProcessRegressor(kernel=RBF(0.5))
gp.fit(X_train, y_train)

X_test = np.linspace(0, 1.2, 100).reshape(-1, 1)
mu, sigma = gp.predict(X_test, return_std=True)
# mu = posterior mean, sigma = posterior std at each test point
Gaussian Process Regression

Click on the canvas to add observations. The teal line is the GP posterior mean. The shaded band is ±2σ. Orange dots are your data.

Lengthscale l0.50
Noise σn0.10
What does the GP's lengthscale parameter control?

Chapter 9: Normal-Inverse-Wishart

In practice, you rarely know the mean or the covariance of a Gaussian. You need a joint prior on both μ and Σ. The Normal-Inverse-Wishart (NIW) distribution is the conjugate prior for this case: it provides a prior on (μ, Σ) that updates cleanly when you observe data from N(μ, Σ).

Σ ~ IW(ν, Ψ),    μ | Σ ~ N(μ0, Σ / κ)

Four hyperparameters govern the NIW: μ0 (prior mean location), κ (how strongly you believe in μ0), ν (degrees of freedom for covariance), and Ψ (scale matrix). Notice that the prior on μ depends on Σ — the mean's uncertainty scales with the covariance itself. This coupling is what makes the conjugate update work.

The update rule: After observing n data points with sample mean x̄ and scatter matrix S:
κn = κ0 + n,   νn = ν0 + n
μn = (κ0μ0 + n·x̄) / κn
Ψn = Ψ0 + S + (κ0n/κn)(x̄ − μ0)(x̄ − μ0)T
Everything is a weighted combination of prior and data.

Where it appears: Full Bayesian Gaussian inference (unknown mean and covariance), Bayesian Gaussian mixture models, Bayesian linear regression with unknown noise covariance.

Worked example: You're fitting a 2D Gaussian to sensor data. Your prior: NIW with μ0 = [0, 0], κ0 = 1, ν0 = 4, Ψ0 = I. You observe 20 data points with sample mean [1.2, 0.8] and scatter matrix S. The posterior is still NIW with updated parameters — κ20 = 21 (so the mean is tightly determined), ν20 = 24 (so the covariance is well-estimated). The posterior mean of μ is (1·[0,0] + 20·[1.2,0.8]) / 21 ≈ [1.14, 0.76] — dominated by data.

python
import numpy as np

# NIW prior
mu0 = np.array([0, 0])
kappa0, nu0 = 1, 4
Psi0 = np.eye(2)

# Observed data
data = np.random.multivariate_normal([1.2, 0.8], [[0.5, 0.2], [0.2, 0.8]], 20)
n = len(data)
xbar = data.mean(axis=0)
S = (data - xbar).T @ (data - xbar)

# NIW posterior
kappa_n = kappa0 + n
nu_n = nu0 + n
mu_n = (kappa0 * mu0 + n * xbar) / kappa_n
diff = (xbar - mu0).reshape(-1, 1)
Psi_n = Psi0 + S + (kappa0 * n / kappa_n) * diff @ diff.T
NIW: Joint Inference on Mean & Covariance

Each sample from the NIW posterior gives a (μ, Σ) pair. Orange dots = sampled means. Teal ellipses = corresponding covariances. Add data to see the posterior tighten.

κ0 (mean confidence)1.0
Why does the NIW couple the prior on μ with Σ?

Chapter 10: Von Mises-Fisher

Text embeddings from models like BERT live on the unit sphere — they're normalized to unit length, and similarity is measured by cosine (angle), not Euclidean distance. The Gaussian doesn't respect this geometry; it would put mass outside the sphere. You need a distribution designed for directional data.

The Von Mises-Fisher (vMF) distribution is the "Gaussian on a sphere." It's defined by a mean direction μ (a unit vector) and a concentration parameter κ. When κ = 0, the distribution is uniform on the sphere. As κ grows, it concentrates tightly around μ.

p(x | μ, κ) = Cd(κ) · exp(κ · μT x)

The exponent κ · μTx is just κ times the cosine similarity between x and μ. Points aligned with μ (high cosine) get high probability. Points opposite to μ get low probability. The normalizing constant Cd(κ) involves Bessel functions and depends on the dimension d.

vMF vs Gaussian: The Gaussian is for unconstrained Rd. The vMF is for the unit sphere Sd-1. Use vMF whenever your data are directions (unit vectors, angles, orientations) rather than positions. In 2D, the vMF reduces to the von Mises distribution on the circle.

Properties: The vMF mean direction is μ (the parameter itself), and the mean resultant length is R̄ = Id/2(κ) / Id/2-1(κ), where Iv is the modified Bessel function. When κ = 0, R̄ = 0 (uniform spread). As κ → ∞, R̄ → 1 (all mass concentrated at μ). Maximum likelihood estimation of κ from data uses the equation R̄data = Id/2(κ) / Id/2-1(κ), which must be solved numerically.

PropertyGaussian (on Rd)vMF (on Sd-1)
DomainAll of RdUnit sphere
Centerμ (mean vector)μ (mean direction)
SpreadΣ (covariance matrix)1/κ (inverse concentration)
Distance metricEuclidean / MahalanobisCosine (dot product)
In 1D becomesNormal distributionVon Mises (on the circle)

Where it appears: Directional statistics (wind directions, animal migration), text embedding clustering, spherical topic models, robot orientation estimation, molecular biology (protein structure angles).

Worked example: You're clustering text embeddings from a sentence transformer. Each embedding is a 384-dimensional unit vector. The cosine similarity between two embeddings equals their dot product (since they're unit length). A vMF mixture model clusters these embeddings naturally, because vMF's probability depends on μTx = cos(θ) — exactly the similarity metric you care about. Each cluster has a mean direction μk and concentration κk.

python
import numpy as np

def sample_vmf(mu, kappa, n=100):
    """Sample from vMF in d dimensions (Wood's algorithm)."""
    d = len(mu)
    result = []
    for _ in range(n):
        # Sample w from the marginal
        w = _sample_w(kappa, d)
        # Sample v uniformly on the (d-2)-sphere
        v = np.random.randn(d - 1)
        v /= np.linalg.norm(v)
        # Combine: x = sqrt(1-w^2)*v + w*mu
        sample = np.sqrt(1 - w**2) * _householder(mu, v) + w * mu
        result.append(sample)
    return np.array(result)
Von Mises-Fisher on the Circle

In 2D, the vMF becomes the von Mises distribution on the circle. The orange dots are samples. The teal curve is the density. Drag κ to see concentration change.

κ (concentration)3.0
Mean direction θ0.00
When should you use the Von Mises-Fisher instead of a Gaussian?

Chapter 11: The Landscape

You've now met ten multivariate distributions. Each solves a specific problem in the Bayesian modeling toolkit. Here's how they connect.

Distribution Relationship Map

The arrows show key relationships: conjugacy, limiting behavior, and composition.

DistributionLives OnKey UseParameters
Multivariate GaussianRdState estimation, GP, VAEμ, Σ
DirichletSimplexTopic models, categorical priorα1..K
WishartPD matricesSample covariance distributionν, V
Inverse WishartPD matricesPrior on Σν, Ψ
Multivariate-tRdRobust estimation, outlier resistanceμ, Σ, ν
LKJCorrelation matricesPrior on R in Stan/PyMCη
Dirichlet ProcessDistributionsNonparametric clusteringα, G0
Gaussian ProcessFunctionsBayesian optimization, regressionm(x), k(x,x')
Normal-Inverse-WishartRd × PDJoint prior on μ, Σμ0, κ, ν, Ψ
Von Mises-FisherUnit sphereDirectional statistics, embeddingsμ, κ

When to Use What

Rule of thumb:
Data in Rd: Start with the multivariate Gaussian. If outliers matter, use multivariate-t.
Probability vectors: Dirichlet. Unknown number of categories? Dirichlet Process.
Prior on covariance: Inverse Wishart (simple), or LKJ + half-Cauchy (more flexible).
Prior on mean + covariance: Normal-Inverse-Wishart.
Functions: Gaussian Process.
Directions: Von Mises-Fisher.

Connections to Other Lessons

The Kalman Filter is built entirely on multivariate Gaussians. Bayesian Estimation uses conjugate priors like the ones we studied (Inverse Wishart, NIW). GPT and Diffusion Models both use high-dimensional Gaussians in their core mechanics.

The meta-pattern: Every distribution in this lesson exists because engineers needed to put uncertainty on a specific type of mathematical object — vectors, matrices, functions, directions, probability vectors. The object's geometry determines which distribution is appropriate. That's the unifying idea.

The Conjugacy Map

Several distributions form conjugate pairs for the multivariate Gaussian:

UnknownConjugate PriorKnown
μ (mean only)Multivariate GaussianΣ known
Σ (covariance only)Inverse Wishartμ known
(μ, Σ) jointlyNormal-Inverse-WishartNothing known
Λ = Σ−1 (precision)Wishartμ known

Conjugacy means the posterior has the same functional form as the prior, with updated parameters. This is the engine behind tractable Bayesian inference. When conjugacy isn't available (as in most modern models), we resort to MCMC or variational inference.

Closing thought: "The purpose of computing is insight, not numbers." — Richard Hamming. Every distribution in this lesson is a tool for representing what you know and what you don't. The better you understand their shapes, the better you can reason under uncertainty.
You need a prior on the correlation structure (not the scale) of a multivariate model. Which distribution?