Probability & Inference

Bayesian Distributions

The ten distributions that power modern Bayesian inference — from conjugate priors to the Gumbel-Softmax trick.

Prerequisites: Basic probability + Bayes' theorem. Familiarity with Gaussians helps.
12
Chapters
10+
Simulations
0
Assumed Knowledge

Chapter 0: Why Bayesian Distributions?

You want to estimate an unknown parameter — maybe the mean of a sensor, the sparsity pattern of a neural network, or the topic mixture of a document. You have data, but data alone is never enough. You need a prior: a distribution that encodes what you believed before seeing any data.

Pick the wrong prior and inference becomes intractable — you can't compute the posterior in closed form. Pick the right prior and mathematics hands you the answer on a silver platter. That's the magic of conjugacy: when the prior and likelihood belong to the same family, the posterior does too.

Conjugacy in one sentence: A prior is conjugate to a likelihood if the posterior has the same functional form as the prior. You just update the parameters — no integrals, no approximations, no MCMC.

But not every problem fits a conjugate pair. Modern Bayesian inference uses distributions designed for specific structural assumptions: sparsity, boundedness, compositions, extremes. This lesson covers ten such distributions, each solving a problem that standard Gaussians and Betas cannot.

Prior p(θ)
Your belief before data
× likelihood p(data|θ) ↓
Posterior p(θ|data)
Updated belief after data
conjugate? same family ↻
Conjugacy in Action: Beta-Binomial

Flip a coin with unknown bias θ. The orange curve is your prior Beta(α,β). Click "Flip" to add data. The teal posterior is always another Beta — that's conjugacy.

Prior α2.0
Prior β2.0
Flips: 0 | Heads: 0
Why do Bayesian practitioners care about conjugate priors?

Chapter 1: Gaussian Mixture Model

A single Gaussian can only model one bump. Real data often has multiple clusters — heights of men and women, income groups, cell types in a tissue sample. A Gaussian Mixture Model (GMM) is a weighted sum of K Gaussians, each representing one cluster.

p(x) = ∑k=1K πk · N(x | μk, σk2)

Each component k has a weight πk (how common that cluster is), a mean μk (cluster center), and a variance σk2 (cluster spread). The weights must sum to 1: ∑πk = 1.

Key insight: A GMM is a universal density approximator. With enough components, it can approximate any continuous density to arbitrary accuracy. It's the Swiss Army knife of density estimation.
PropertyValue
Support(−∞, +∞)
Parametersπk, μk, σk for each component
Mean∑ πk μk
FittingEM algorithm (Expectation-Maximization)
Used inClustering, density estimation, speech recognition, GANs
Interactive GMM: Drag Means & Weights

Adjust the means, standard deviations, and weights of three Gaussian components. The white curve is the mixture density.

μ1-2.0
μ20.0
μ32.5
π1 weight0.30
π2 weight0.40
python
import numpy as np
from scipy.stats import norm

def gmm_pdf(x, mus, sigmas, weights):
    """Evaluate GMM density at points x."""
    return sum(w * norm.pdf(x, m, s)
               for w, m, s in zip(weights, mus, sigmas))
In a 3-component GMM with weights [0.5, 0.3, 0.2] and means [-1, 0, 3], what is the overall mean?

Chapter 2: Particle Distribution

What if your posterior is so complex that no named distribution can describe it? Maybe it's multimodal, skewed, or lives on a weird manifold. The particle distribution (empirical distribution) sidesteps the problem entirely: represent the distribution as a collection of N weighted samples (particles).

p(x) ≈ ∑i=1N wi · δ(x − xi)

Each particle xi is a specific value, and wi is its weight (with ∑wi = 1). The δ is a Dirac delta — a spike at that point. More particles in a region means higher density there. This is the foundation of particle filters and Sequential Monte Carlo (SMC).

The power of particles: They can represent any distribution — multimodal, heavy-tailed, discontinuous, you name it. The cost? You need enough particles. Too few and you get a poor approximation. This is the curse of dimensionality for particle methods.

The critical operation is resampling: when particle weights become uneven (a few particles hog all the weight), you duplicate high-weight particles and discard low-weight ones. This keeps the approximation healthy.

PropertyValue
SupportAnywhere the particles are
Parameters{xi, wi} for i = 1..N
Mean∑ wi xi
Key operationResampling (multinomial, systematic, stratified)
Used inParticle filters, SMC, robotics SLAM
Particle Filter Resampling

Particles (vertical bars) approximate a target distribution (teal curve). Bar height = weight. Click Resample to see weighted resampling in action. Click Scatter to spread particles randomly again.

N = 40 particles | Effective N = 40
After resampling, what happens to particles with very low weights?

Chapter 3: Spike-and-Slab Prior

You're fitting a regression model with 10,000 features, but you suspect only 50 actually matter. You want the model to discover which features are relevant and set the rest to exactly zero. This is variable selection, and the spike-and-slab prior is the Bayesian gold standard for it.

p(θj) = (1 − π) · δ(0) + π · N(0, σslab2)

The spike is a point mass at zero — it says "this coefficient is exactly zero, the feature is irrelevant." The slab is a broad Gaussian — it says "this coefficient is nonzero, the feature matters." The mixing weight π controls the prior probability that any given feature is relevant.

Why not just use L1 regularization? L1 (Lasso) encourages sparsity but doesn't truly set coefficients to zero during inference — it's a continuous approximation. Spike-and-slab gives you a posterior probability that each feature is active, which is far more informative for scientific discovery.
PropertyValue
Support{0} ∪ (−∞, +∞)
Parametersπ (inclusion prob), σslab (slab width)
SparsityExact zeros with probability 1 − π
ComputationRequires MCMC or variational approximation
Used inGenomics, feature selection, sparse Bayesian learning
Spike-and-Slab: Variable Selection

The orange spike at zero represents irrelevant features. The teal slab covers nonzero coefficients. Adjust π to control sparsity.

Inclusion π0.20
Slab σ2.0
If π = 0.05 in a spike-and-slab prior over 1000 features, how many features do you expect to be nonzero a priori?

Chapter 4: The Horseshoe Prior

Spike-and-slab is theoretically beautiful but computationally expensive — you need to explore 2p possible inclusion patterns. The horseshoe prior achieves similar sparsity with a continuous distribution, making it far more practical for high-dimensional problems.

θj | λj, τ ~ N(0, λj2τ2)
λj ~ Half-Cauchy(0, 1)

The trick is the global-local structure. The global scale τ controls overall shrinkage (small τ = most coefficients near zero). Each coefficient also gets a local scale λj drawn from a half-Cauchy. The heavy tails of the Cauchy let truly important coefficients escape the shrinkage.

The horseshoe shape: Plot the marginal prior on θj and you get a density with a tall spike near zero (like the nail of a horseshoe) and heavy tails (the curved shoe). It aggressively shrinks noise to zero while leaving large signals untouched.
PropertyValue
Support(−∞, +∞)
Parametersτ (global shrinkage), λj (local scales)
TailsPolynomial (heavier than Gaussian, like Cauchy)
AdvantageContinuous — no combinatorial explosion
Used inHigh-dimensional regression, Bayesian neural nets, genomics
Horseshoe vs Gaussian Prior

The horseshoe (orange) concentrates mass near zero but has heavy tails. Compare with a Gaussian of the same scale. Adjust τ to see global shrinkage.

Global τ0.50
What gives the horseshoe prior its sparsity-inducing property?

Chapter 5: The Normal-Gamma Prior

Here's a common problem: you're observing data from a Gaussian, but you don't know the mean or the variance. You need a joint prior over both. The Normal-Gamma distribution is the conjugate prior for exactly this situation.

p(μ, λ) = N(μ | μ0, 1/(κ0λ)) · Gamma(λ | α0, β0)

Here λ = 1/σ2 is the precision (inverse variance). The prior says: first draw a precision λ from a Gamma, then draw the mean μ from a Gaussian whose variance depends on λ. This coupling is what makes the math work out.

Why precision, not variance? The Gamma distribution has support on (0, ∞), which is exactly right for precision. And with this parameterization, the posterior after observing n data points {xi} is another Normal-Gamma with updated parameters — pure conjugacy.

After observing data, the updates are elegant: κn = κ0 + n, μn is the precision-weighted average of prior and sample means, αn = α0 + n/2, and βn incorporates the sample variance and the prior-data disagreement.

PropertyValue
Supportμ ∈ (−∞, +∞), λ ∈ (0, ∞)
Parametersμ0, κ0, α0, β0
Conjugate toGaussian likelihood with unknown mean and variance
Marginal on μStudent-t distribution
Used inBayesian regression, online learning, Kalman filters
Normal-Gamma: Joint Prior on Mean & Precision

The heatmap shows the joint density p(μ, λ). Click Add Data to draw random observations and watch the posterior sharpen around the true parameters.

Observations: 0 | μ̂ = 0.0 | λ̂ = 1.0
Why is the Normal-Gamma parameterized with precision λ rather than variance σ²?

Chapter 6: The Logistic-Normal

You need a distribution over probability vectors — like topic proportions in a document (30% politics, 50% sports, 20% tech). The Dirichlet distribution is the standard choice, but it has a limitation: its components are negatively correlated by construction. If topic A increases, the others must decrease. What if topics tend to co-occur?

The Logistic-Normal distribution fixes this. Draw a vector from a multivariate Gaussian, then push it through the softmax function. The Gaussian's covariance matrix lets you model arbitrary correlations between components.

z ~ N(μ, Σ),   θk = exp(zk) / ∑j exp(zj)
Logistic-Normal vs Dirichlet: The Dirichlet is conjugate to the multinomial (easy updates) but can only model negative correlations. The Logistic-Normal can model any correlation structure via Σ, but loses conjugacy. It's the prior behind Correlated Topic Models and many VAE decoders.
PropertyValue
SupportProbability simplex (components sum to 1)
Parametersμ (location in log-space), Σ (covariance)
CorrelationsArbitrary (unlike Dirichlet)
Conjugate?No — requires variational inference
Used inTopic models, compositional data, VAE priors
Logistic-Normal on the 3-Simplex

Samples from a Logistic-Normal shown as dots on the probability simplex (triangle). Adjust the mean to shift the cloud, and the correlation to skew it.

μ10.0
μ20.0
Correlation ρ0.0
What advantage does the Logistic-Normal have over the Dirichlet for modeling topic proportions?

Chapter 7: The Truncated Gaussian

A standard Gaussian assigns nonzero probability to the entire real line. But what if your parameter can't be negative? A length, a concentration, a time duration — these are strictly positive. Naively clamping a Gaussian to [0, ∞) changes its mean and variance in hard-to-track ways.

The truncated Gaussian solves this cleanly: take a normal distribution N(μ, σ2) and restrict it to an interval [a, b], then renormalize so the density integrates to 1.

p(x | μ, σ, a, b) = φ((x−μ)/σ) / (σ · [Φ((b−μ)/σ) − Φ((a−μ)/σ)])

Here φ is the standard normal PDF and Φ is the CDF. The denominator is just the probability that an untruncated sample would fall in [a, b]. The shape stays Gaussian within the interval, but the effective mean shifts toward the interval center.

Where it appears: Truncated Gaussians show up in probit regression (latent variables are truncated based on the observed binary outcome), physics simulations with hard constraints, and Gibbs sampling steps where conditional distributions are truncated by other variables.
PropertyValue
Support[a, b] (user-specified interval)
Parametersμ, σ (parent Gaussian), a, b (bounds)
MeanShifted toward interval center from μ
Variance≤ σ2 (truncation reduces spread)
Used inConstrained optimization, probit models, Gibbs sampling
Truncated vs Full Gaussian

The blue dashed curve is the full Gaussian. The orange region is the truncated density. Notice how truncation shifts the effective mean.

μ0.5
Lower bound a0.0
Upper bound b3.0
If you truncate N(0, 1) to [0, ∞), what happens to the mean?

Chapter 8: The Folded Normal

You're measuring the magnitude of some quantity — the absolute deviation from a target, the strength of a signal, the distance a robot drifted from its path. You know the underlying error is Gaussian, but you only observe the absolute value. The result is a folded normal distribution.

Y = |X|   where   X ~ N(μ, σ2)

The PDF is the sum of the original Gaussian and its mirror image reflected at zero:

p(y) = φ((y−μ)/σ)/σ + φ((y+μ)/σ)/σ   for y ≥ 0

When μ = 0, the folded normal simplifies to the half-normal distribution — a popular default prior for scale parameters in Bayesian hierarchical models (recommended by Andrew Gelman as a weakly informative prior for standard deviations).

Folded vs Truncated: Don't confuse them! A truncated Gaussian discards the negative part and renormalizes. A folded Gaussian folds the negative part onto the positive side. The folded normal has more mass near zero (it gets contributions from both tails).
PropertyValue
Support[0, ∞)
Parametersμ, σ (of the underlying Gaussian)
Special caseμ = 0 gives the half-normal
Meanσ√(2/π) exp(−μ2/2σ2) + μ(1 − 2Φ(−μ/σ))
Used inSignal magnitudes, hierarchical model priors, noise modeling
Folding a Gaussian

The blue dashed curve is the original Gaussian. Its negative portion "folds" to the positive side, creating the orange folded normal.

μ1.0
σ1.0
What is the key difference between a folded normal and a truncated normal?

Chapter 9: The Gumbel Distribution

Imagine measuring the maximum daily temperature every year and asking: "What's the distribution of the annual maximum?" This is an extreme value problem, and the Gumbel distribution is one of the three possible limiting distributions for maxima of independent samples.

p(x) = (1/β) exp(−z − exp(−z))   where   z = (x − μ) / β

The Gumbel has a distinctive asymmetric shape: a steep rise on the left and a long tail on the right (for the maximum case). But its most famous modern use has nothing to do with weather.

The Gumbel-Softmax trick: In reinforcement learning and discrete VAEs, you need to sample from a categorical distribution but backpropagate through the sampling step. The Gumbel-Max trick says: add Gumbel noise to log-probabilities, then take the argmax. The Gumbel-Softmax relaxation replaces argmax with softmax, giving a differentiable approximation. This single trick unlocked gradient-based training of discrete latent variable models.

The math: if gk ~ Gumbel(0, 1) independently, then argmaxk(log πk + gk) gives a sample from Categorical(π). Replace argmax with softmax at temperature τ for a differentiable relaxation.

PropertyValue
Support(−∞, +∞)
Parametersμ (location), β (scale)
Meanμ + βγ where γ ≈ 0.5772 (Euler-Mascheroni)
Varianceπ2β2/6
Used inGumbel-Softmax, extreme value theory, RL exploration
Gumbel Distribution & Softmax Trick

Left: the Gumbel PDF (note the asymmetry). Right: Gumbel-Softmax samples — lower temperature τ approaches a hard one-hot vector.

Location μ0.0
Scale β1.0
Softmax τ1.0
Why is the Gumbel-Softmax trick important for training models with discrete latent variables?

Chapter 10: The Kumaraswamy Distribution

The Beta distribution is the conjugate prior for Bernoulli and binomial likelihoods, and it's ubiquitous in Bayesian statistics. But it has a dirty secret: its CDF has no closed form. You need the incomplete Beta function, which requires numerical approximation. For variational inference, where you need to compute KL divergences and sample using the reparameterization trick, this is a real problem.

Enter the Kumaraswamy distribution: a two-parameter distribution on [0, 1] that looks like a Beta but has closed-form PDF, CDF, and inverse CDF (quantile function).

p(x | a, b) = a · b · xa−1 (1 − xa)b−1
F(x) = 1 − (1 − xa)b
F−1(u) = (1 − (1 − u)1/b)1/a
Why this matters for VAEs: In a variational autoencoder, the approximate posterior q(z|x) is usually Gaussian. But if your latent variable lives on [0, 1], Gaussian won't do. The Kumaraswamy lets you reparameterize: sample u ~ Uniform(0,1), apply F−1(u), and you have a Kumaraswamy sample with gradients flowing through a and b. No special functions needed.
PropertyValue
Support[0, 1]
Parametersa > 0, b > 0
CDFClosed-form (unlike Beta)
ReparameterizableYes, via inverse CDF
Used inVAEs with bounded latents, stick-breaking constructions, flow models
Kumaraswamy vs Beta

The Kumaraswamy (orange) and Beta (teal) with matched parameters. They're similar but not identical — the Kumaraswamy's closed-form CDF is its computational advantage.

a2.0
b5.0
python
import numpy as np

def kumaraswamy_sample(a, b, n=1000):
    """Sample via inverse CDF — no special functions!"""
    u = np.random.uniform(0, 1, n)
    return (1 - (1 - u)**(1/b))**(1/a)
What is the main computational advantage of the Kumaraswamy over the Beta distribution?

Chapter 11: When to Use What

You've now met ten distributions that power modern Bayesian inference. Each solves a specific structural problem that standard distributions can't handle. Here's your decision guide.

DistributionUse WhenKey Property
GMMData has multiple clustersUniversal density approximator
ParticlePosterior is too complex for any named familyNonparametric, any shape
Spike-and-SlabYou want exact variable selectionTrue zeros with posterior probabilities
HorseshoeHigh-dimensional sparsity, scalableContinuous shrinkage with heavy tails
Normal-GammaUnknown Gaussian mean AND varianceConjugate, closed-form updates
Logistic-NormalCorrelated proportions / topicsFlexible correlation via Σ
Truncated GaussianParameter is bounded to an intervalGaussian shape with hard constraints
Folded NormalYou observe magnitudes (|X|)Reflects negative mass to positive
GumbelExtreme values or differentiable discrete samplingGumbel-Softmax trick
KumaraswamyBeta-like prior with reparameterizationClosed-form CDF/inverse CDF

Conjugacy Cheat Sheet

When your problem has a conjugate pair, use it — the posterior updates are free.

LikelihoodConjugate PriorPosterior
Bernoulli / BinomialBeta(α, β)Beta(α+k, β+n−k)
Gaussian (known σ)Gaussian(μ0, σ02)Gaussian (precision-weighted mean)
Gaussian (unknown μ, σ)Normal-GammaNormal-Gamma (updated params)
PoissonGamma(α, β)Gamma(α+∑x, β+n)
MultinomialDirichlet(α)Dirichlet(α+counts)
ExponentialGamma(α, β)Gamma(α+n, β+∑x)
Rule of thumb: If your problem fits a conjugate pair, start there. If you need sparsity, use the horseshoe (scalable) or spike-and-slab (precise). If you need bounded support, look at Kumaraswamy or truncated Gaussian. If nothing fits, particles or variational inference with a flexible family (Logistic-Normal, normalizing flows) are your fallback.
Distribution Gallery

All ten distributions on one canvas. Click a name to highlight it.

Every distribution in this lesson was born from a practical need. The Gaussian alone cannot model sparsity, boundedness, compositions, extremes, or arbitrary posteriors. Bayesian inference is as much about choosing the right distributional family as it is about applying Bayes' theorem.

What's next? See Bayesian Estimation for how these priors combine with data to form optimal estimates. See Bayes Filters for recursive inference over time. And see Bayesian Networks for graphical models that chain these distributions together.
You're building a VAE with latent variables on [0, 1] and need reparameterizable sampling. Which distribution should you use?