The ten distributions that power modern Bayesian inference — from conjugate priors to the Gumbel-Softmax trick.
You want to estimate an unknown parameter — maybe the mean of a sensor, the sparsity pattern of a neural network, or the topic mixture of a document. You have data, but data alone is never enough. You need a prior: a distribution that encodes what you believed before seeing any data.
Pick the wrong prior and inference becomes intractable — you can't compute the posterior in closed form. Pick the right prior and mathematics hands you the answer on a silver platter. That's the magic of conjugacy: when the prior and likelihood belong to the same family, the posterior does too.
But not every problem fits a conjugate pair. Modern Bayesian inference uses distributions designed for specific structural assumptions: sparsity, boundedness, compositions, extremes. This lesson covers ten such distributions, each solving a problem that standard Gaussians and Betas cannot.
Flip a coin with unknown bias θ. The orange curve is your prior Beta(α,β). Click "Flip" to add data. The teal posterior is always another Beta — that's conjugacy.
A single Gaussian can only model one bump. Real data often has multiple clusters — heights of men and women, income groups, cell types in a tissue sample. A Gaussian Mixture Model (GMM) is a weighted sum of K Gaussians, each representing one cluster.
Each component k has a weight πk (how common that cluster is), a mean μk (cluster center), and a variance σk2 (cluster spread). The weights must sum to 1: ∑πk = 1.
| Property | Value |
|---|---|
| Support | (−∞, +∞) |
| Parameters | πk, μk, σk for each component |
| Mean | ∑ πk μk |
| Fitting | EM algorithm (Expectation-Maximization) |
| Used in | Clustering, density estimation, speech recognition, GANs |
Adjust the means, standard deviations, and weights of three Gaussian components. The white curve is the mixture density.
python import numpy as np from scipy.stats import norm def gmm_pdf(x, mus, sigmas, weights): """Evaluate GMM density at points x.""" return sum(w * norm.pdf(x, m, s) for w, m, s in zip(weights, mus, sigmas))
What if your posterior is so complex that no named distribution can describe it? Maybe it's multimodal, skewed, or lives on a weird manifold. The particle distribution (empirical distribution) sidesteps the problem entirely: represent the distribution as a collection of N weighted samples (particles).
Each particle xi is a specific value, and wi is its weight (with ∑wi = 1). The δ is a Dirac delta — a spike at that point. More particles in a region means higher density there. This is the foundation of particle filters and Sequential Monte Carlo (SMC).
The critical operation is resampling: when particle weights become uneven (a few particles hog all the weight), you duplicate high-weight particles and discard low-weight ones. This keeps the approximation healthy.
| Property | Value |
|---|---|
| Support | Anywhere the particles are |
| Parameters | {xi, wi} for i = 1..N |
| Mean | ∑ wi xi |
| Key operation | Resampling (multinomial, systematic, stratified) |
| Used in | Particle filters, SMC, robotics SLAM |
Particles (vertical bars) approximate a target distribution (teal curve). Bar height = weight. Click Resample to see weighted resampling in action. Click Scatter to spread particles randomly again.
You're fitting a regression model with 10,000 features, but you suspect only 50 actually matter. You want the model to discover which features are relevant and set the rest to exactly zero. This is variable selection, and the spike-and-slab prior is the Bayesian gold standard for it.
The spike is a point mass at zero — it says "this coefficient is exactly zero, the feature is irrelevant." The slab is a broad Gaussian — it says "this coefficient is nonzero, the feature matters." The mixing weight π controls the prior probability that any given feature is relevant.
| Property | Value |
|---|---|
| Support | {0} ∪ (−∞, +∞) |
| Parameters | π (inclusion prob), σslab (slab width) |
| Sparsity | Exact zeros with probability 1 − π |
| Computation | Requires MCMC or variational approximation |
| Used in | Genomics, feature selection, sparse Bayesian learning |
The orange spike at zero represents irrelevant features. The teal slab covers nonzero coefficients. Adjust π to control sparsity.
Spike-and-slab is theoretically beautiful but computationally expensive — you need to explore 2p possible inclusion patterns. The horseshoe prior achieves similar sparsity with a continuous distribution, making it far more practical for high-dimensional problems.
The trick is the global-local structure. The global scale τ controls overall shrinkage (small τ = most coefficients near zero). Each coefficient also gets a local scale λj drawn from a half-Cauchy. The heavy tails of the Cauchy let truly important coefficients escape the shrinkage.
| Property | Value |
|---|---|
| Support | (−∞, +∞) |
| Parameters | τ (global shrinkage), λj (local scales) |
| Tails | Polynomial (heavier than Gaussian, like Cauchy) |
| Advantage | Continuous — no combinatorial explosion |
| Used in | High-dimensional regression, Bayesian neural nets, genomics |
The horseshoe (orange) concentrates mass near zero but has heavy tails. Compare with a Gaussian of the same scale. Adjust τ to see global shrinkage.
Here's a common problem: you're observing data from a Gaussian, but you don't know the mean or the variance. You need a joint prior over both. The Normal-Gamma distribution is the conjugate prior for exactly this situation.
Here λ = 1/σ2 is the precision (inverse variance). The prior says: first draw a precision λ from a Gamma, then draw the mean μ from a Gaussian whose variance depends on λ. This coupling is what makes the math work out.
After observing data, the updates are elegant: κn = κ0 + n, μn is the precision-weighted average of prior and sample means, αn = α0 + n/2, and βn incorporates the sample variance and the prior-data disagreement.
| Property | Value |
|---|---|
| Support | μ ∈ (−∞, +∞), λ ∈ (0, ∞) |
| Parameters | μ0, κ0, α0, β0 |
| Conjugate to | Gaussian likelihood with unknown mean and variance |
| Marginal on μ | Student-t distribution |
| Used in | Bayesian regression, online learning, Kalman filters |
The heatmap shows the joint density p(μ, λ). Click Add Data to draw random observations and watch the posterior sharpen around the true parameters.
You need a distribution over probability vectors — like topic proportions in a document (30% politics, 50% sports, 20% tech). The Dirichlet distribution is the standard choice, but it has a limitation: its components are negatively correlated by construction. If topic A increases, the others must decrease. What if topics tend to co-occur?
The Logistic-Normal distribution fixes this. Draw a vector from a multivariate Gaussian, then push it through the softmax function. The Gaussian's covariance matrix lets you model arbitrary correlations between components.
| Property | Value |
|---|---|
| Support | Probability simplex (components sum to 1) |
| Parameters | μ (location in log-space), Σ (covariance) |
| Correlations | Arbitrary (unlike Dirichlet) |
| Conjugate? | No — requires variational inference |
| Used in | Topic models, compositional data, VAE priors |
Samples from a Logistic-Normal shown as dots on the probability simplex (triangle). Adjust the mean to shift the cloud, and the correlation to skew it.
A standard Gaussian assigns nonzero probability to the entire real line. But what if your parameter can't be negative? A length, a concentration, a time duration — these are strictly positive. Naively clamping a Gaussian to [0, ∞) changes its mean and variance in hard-to-track ways.
The truncated Gaussian solves this cleanly: take a normal distribution N(μ, σ2) and restrict it to an interval [a, b], then renormalize so the density integrates to 1.
Here φ is the standard normal PDF and Φ is the CDF. The denominator is just the probability that an untruncated sample would fall in [a, b]. The shape stays Gaussian within the interval, but the effective mean shifts toward the interval center.
| Property | Value |
|---|---|
| Support | [a, b] (user-specified interval) |
| Parameters | μ, σ (parent Gaussian), a, b (bounds) |
| Mean | Shifted toward interval center from μ |
| Variance | ≤ σ2 (truncation reduces spread) |
| Used in | Constrained optimization, probit models, Gibbs sampling |
The blue dashed curve is the full Gaussian. The orange region is the truncated density. Notice how truncation shifts the effective mean.
You're measuring the magnitude of some quantity — the absolute deviation from a target, the strength of a signal, the distance a robot drifted from its path. You know the underlying error is Gaussian, but you only observe the absolute value. The result is a folded normal distribution.
The PDF is the sum of the original Gaussian and its mirror image reflected at zero:
When μ = 0, the folded normal simplifies to the half-normal distribution — a popular default prior for scale parameters in Bayesian hierarchical models (recommended by Andrew Gelman as a weakly informative prior for standard deviations).
| Property | Value |
|---|---|
| Support | [0, ∞) |
| Parameters | μ, σ (of the underlying Gaussian) |
| Special case | μ = 0 gives the half-normal |
| Mean | σ√(2/π) exp(−μ2/2σ2) + μ(1 − 2Φ(−μ/σ)) |
| Used in | Signal magnitudes, hierarchical model priors, noise modeling |
The blue dashed curve is the original Gaussian. Its negative portion "folds" to the positive side, creating the orange folded normal.
Imagine measuring the maximum daily temperature every year and asking: "What's the distribution of the annual maximum?" This is an extreme value problem, and the Gumbel distribution is one of the three possible limiting distributions for maxima of independent samples.
The Gumbel has a distinctive asymmetric shape: a steep rise on the left and a long tail on the right (for the maximum case). But its most famous modern use has nothing to do with weather.
The math: if gk ~ Gumbel(0, 1) independently, then argmaxk(log πk + gk) gives a sample from Categorical(π). Replace argmax with softmax at temperature τ for a differentiable relaxation.
| Property | Value |
|---|---|
| Support | (−∞, +∞) |
| Parameters | μ (location), β (scale) |
| Mean | μ + βγ where γ ≈ 0.5772 (Euler-Mascheroni) |
| Variance | π2β2/6 |
| Used in | Gumbel-Softmax, extreme value theory, RL exploration |
Left: the Gumbel PDF (note the asymmetry). Right: Gumbel-Softmax samples — lower temperature τ approaches a hard one-hot vector.
The Beta distribution is the conjugate prior for Bernoulli and binomial likelihoods, and it's ubiquitous in Bayesian statistics. But it has a dirty secret: its CDF has no closed form. You need the incomplete Beta function, which requires numerical approximation. For variational inference, where you need to compute KL divergences and sample using the reparameterization trick, this is a real problem.
Enter the Kumaraswamy distribution: a two-parameter distribution on [0, 1] that looks like a Beta but has closed-form PDF, CDF, and inverse CDF (quantile function).
| Property | Value |
|---|---|
| Support | [0, 1] |
| Parameters | a > 0, b > 0 |
| CDF | Closed-form (unlike Beta) |
| Reparameterizable | Yes, via inverse CDF |
| Used in | VAEs with bounded latents, stick-breaking constructions, flow models |
The Kumaraswamy (orange) and Beta (teal) with matched parameters. They're similar but not identical — the Kumaraswamy's closed-form CDF is its computational advantage.
python import numpy as np def kumaraswamy_sample(a, b, n=1000): """Sample via inverse CDF — no special functions!""" u = np.random.uniform(0, 1, n) return (1 - (1 - u)**(1/b))**(1/a)
You've now met ten distributions that power modern Bayesian inference. Each solves a specific structural problem that standard distributions can't handle. Here's your decision guide.
| Distribution | Use When | Key Property |
|---|---|---|
| GMM | Data has multiple clusters | Universal density approximator |
| Particle | Posterior is too complex for any named family | Nonparametric, any shape |
| Spike-and-Slab | You want exact variable selection | True zeros with posterior probabilities |
| Horseshoe | High-dimensional sparsity, scalable | Continuous shrinkage with heavy tails |
| Normal-Gamma | Unknown Gaussian mean AND variance | Conjugate, closed-form updates |
| Logistic-Normal | Correlated proportions / topics | Flexible correlation via Σ |
| Truncated Gaussian | Parameter is bounded to an interval | Gaussian shape with hard constraints |
| Folded Normal | You observe magnitudes (|X|) | Reflects negative mass to positive |
| Gumbel | Extreme values or differentiable discrete sampling | Gumbel-Softmax trick |
| Kumaraswamy | Beta-like prior with reparameterization | Closed-form CDF/inverse CDF |
When your problem has a conjugate pair, use it — the posterior updates are free.
| Likelihood | Conjugate Prior | Posterior |
|---|---|---|
| Bernoulli / Binomial | Beta(α, β) | Beta(α+k, β+n−k) |
| Gaussian (known σ) | Gaussian(μ0, σ02) | Gaussian (precision-weighted mean) |
| Gaussian (unknown μ, σ) | Normal-Gamma | Normal-Gamma (updated params) |
| Poisson | Gamma(α, β) | Gamma(α+∑x, β+n) |
| Multinomial | Dirichlet(α) | Dirichlet(α+counts) |
| Exponential | Gamma(α, β) | Gamma(α+n, β+∑x) |
All ten distributions on one canvas. Click a name to highlight it.
Every distribution in this lesson was born from a practical need. The Gaussian alone cannot model sparsity, boundedness, compositions, extremes, or arbitrary posteriors. Bayesian inference is as much about choosing the right distributional family as it is about applying Bayes' theorem.