Bayesian Distributions — From Priors to Posteriors

Chapter 0: Why Bayesian Distributions?

You want to estimate an unknown parameter — maybe the mean of a sensor, the sparsity pattern of a neural network, or the topic mixture of a document. You have data, but data alone is never enough. You need a prior: a distribution that encodes what you believed before seeing any data.

Pick the wrong prior and inference becomes intractable — you can't compute the posterior in closed form. Pick the right prior and mathematics hands you the answer on a silver platter. That's the magic of conjugacy: when the prior and likelihood belong to the same family, the posterior does too.

Conjugacy in one sentence: A prior is conjugate to a likelihood if the posterior has the same functional form as the prior. You just update the parameters — no integrals, no approximations, no MCMC.

But not every problem fits a conjugate pair. Modern Bayesian inference uses distributions designed for specific structural assumptions: sparsity, boundedness, compositions, extremes. This lesson covers ten such distributions, each solving a problem that standard Gaussians and Betas cannot.

Prior p(θ)

Your belief before data

× likelihood p(data|θ) ↓

Posterior p(θ|data)

Updated belief after data

conjugate? same family ↻

Conjugacy in Action: Beta-Binomial

Flip a coin with unknown bias θ. The orange curve is your prior Beta(α,β). Click "Flip" to add data. The teal posterior is always another Beta — that's conjugacy.

Prior α2.0

Prior β2.0

Flips: 0 | Heads: 0

Why do Bayesian practitioners care about conjugate priors?

They always give the most accurate posterior The posterior stays in the same family, so updates are closed-form They eliminate the need for data entirely

Chapter 1: Gaussian Mixture Model

A single Gaussian can only model one bump. Real data often has multiple clusters — heights of men and women, income groups, cell types in a tissue sample. A Gaussian Mixture Model (GMM) is a weighted sum of K Gaussians, each representing one cluster.

p(x) = ∑_k=1^K π_k · N(x | μ_k, σ_k²)

Each component k has a weight π_k (how common that cluster is), a mean μ_k (cluster center), and a variance σ_k² (cluster spread). The weights must sum to 1: ∑π_k = 1.

Key insight: A GMM is a universal density approximator. With enough components, it can approximate any continuous density to arbitrary accuracy. It's the Swiss Army knife of density estimation.

Property	Value
Support	(−∞, +∞)
Parameters	π_k, μ_k, σ_k for each component
Mean	∑ π_k μ_k
Fitting	EM algorithm (Expectation-Maximization)
Used in	Clustering, density estimation, speech recognition, GANs

Interactive GMM: Drag Means & Weights

Adjust the means, standard deviations, and weights of three Gaussian components. The white curve is the mixture density.

μ₁-2.0

μ₂0.0

μ₃2.5

π₁ weight0.30

π₂ weight0.40

python
import numpy as np
from scipy.stats import norm

def gmm_pdf(x, mus, sigmas, weights):
    """Evaluate GMM density at points x."""
    return sum(w * norm.pdf(x, m, s)
               for w, m, s in zip(weights, mus, sigmas))

In a 3-component GMM with weights [0.5, 0.3, 0.2] and means [-1, 0, 3], what is the overall mean?

0.5(-1) + 0.3(0) + 0.2(3) = 0.1 (-1 + 0 + 3) / 3 = 0.67 0

Chapter 2: Particle Distribution

What if your posterior is so complex that no named distribution can describe it? Maybe it's multimodal, skewed, or lives on a weird manifold. The particle distribution (empirical distribution) sidesteps the problem entirely: represent the distribution as a collection of N weighted samples (particles).

p(x) ≈ ∑_i=1^N w_i · δ(x − x_i)

Each particle x_i is a specific value, and w_i is its weight (with ∑w_i = 1). The δ is a Dirac delta — a spike at that point. More particles in a region means higher density there. This is the foundation of particle filters and Sequential Monte Carlo (SMC).

The power of particles: They can represent any distribution — multimodal, heavy-tailed, discontinuous, you name it. The cost? You need enough particles. Too few and you get a poor approximation. This is the curse of dimensionality for particle methods.

The critical operation is resampling: when particle weights become uneven (a few particles hog all the weight), you duplicate high-weight particles and discard low-weight ones. This keeps the approximation healthy.

Property	Value
Support	Anywhere the particles are
Parameters	{x_i, w_i} for i = 1..N
Mean	∑ w_i x_i
Key operation	Resampling (multinomial, systematic, stratified)
Used in	Particle filters, SMC, robotics SLAM

Particle Filter Resampling

Particles (vertical bars) approximate a target distribution (teal curve). Bar height = weight. Click Resample to see weighted resampling in action. Click Scatter to spread particles randomly again.

N = 40 particles | Effective N = 40

After resampling, what happens to particles with very low weights?

Their weights increase to match the others They are likely eliminated and replaced by copies of high-weight particles They stay unchanged

Chapter 3: Spike-and-Slab Prior

You're fitting a regression model with 10,000 features, but you suspect only 50 actually matter. You want the model to discover which features are relevant and set the rest to exactly zero. This is variable selection, and the spike-and-slab prior is the Bayesian gold standard for it.

p(θ_j) = (1 − π) · δ(0) + π · N(0, σ_slab²)

The spike is a point mass at zero — it says "this coefficient is exactly zero, the feature is irrelevant." The slab is a broad Gaussian — it says "this coefficient is nonzero, the feature matters." The mixing weight π controls the prior probability that any given feature is relevant.

Why not just use L1 regularization? L1 (Lasso) encourages sparsity but doesn't truly set coefficients to zero during inference — it's a continuous approximation. Spike-and-slab gives you a posterior probability that each feature is active, which is far more informative for scientific discovery.

Property	Value
Support	{0} ∪ (−∞, +∞)
Parameters	π (inclusion prob), σ_slab (slab width)
Sparsity	Exact zeros with probability 1 − π
Computation	Requires MCMC or variational approximation
Used in	Genomics, feature selection, sparse Bayesian learning

Spike-and-Slab: Variable Selection

The orange spike at zero represents irrelevant features. The teal slab covers nonzero coefficients. Adjust π to control sparsity.

Inclusion π0.20

Slab σ2.0

If π = 0.05 in a spike-and-slab prior over 1000 features, how many features do you expect to be nonzero a priori?

50 950 5

Chapter 4: The Horseshoe Prior

Spike-and-slab is theoretically beautiful but computationally expensive — you need to explore 2^p possible inclusion patterns. The horseshoe prior achieves similar sparsity with a continuous distribution, making it far more practical for high-dimensional problems.

θ_j | λ_j, τ ~ N(0, λ_j²τ²)
λ_j ~ Half-Cauchy(0, 1)

The trick is the global-local structure. The global scale τ controls overall shrinkage (small τ = most coefficients near zero). Each coefficient also gets a local scale λ_j drawn from a half-Cauchy. The heavy tails of the Cauchy let truly important coefficients escape the shrinkage.

The horseshoe shape: Plot the marginal prior on θ_j and you get a density with a tall spike near zero (like the nail of a horseshoe) and heavy tails (the curved shoe). It aggressively shrinks noise to zero while leaving large signals untouched.

Property	Value
Support	(−∞, +∞)
Parameters	τ (global shrinkage), λ_j (local scales)
Tails	Polynomial (heavier than Gaussian, like Cauchy)
Advantage	Continuous — no combinatorial explosion
Used in	High-dimensional regression, Bayesian neural nets, genomics

Horseshoe vs Gaussian Prior

The horseshoe (orange) concentrates mass near zero but has heavy tails. Compare with a Gaussian of the same scale. Adjust τ to see global shrinkage.

Global τ0.50

What gives the horseshoe prior its sparsity-inducing property?

It uses L1 regularization internally It sets coefficients to exactly zero Heavy-tailed local scales let large signals escape global shrinkage toward zero

Chapter 5: The Normal-Gamma Prior

Here's a common problem: you're observing data from a Gaussian, but you don't know the mean or the variance. You need a joint prior over both. The Normal-Gamma distribution is the conjugate prior for exactly this situation.

p(μ, λ) = N(μ | μ₀, 1/(κ₀λ)) · Gamma(λ | α₀, β₀)

Here λ = 1/σ² is the precision (inverse variance). The prior says: first draw a precision λ from a Gamma, then draw the mean μ from a Gaussian whose variance depends on λ. This coupling is what makes the math work out.

Why precision, not variance? The Gamma distribution has support on (0, ∞), which is exactly right for precision. And with this parameterization, the posterior after observing n data points {x_i} is another Normal-Gamma with updated parameters — pure conjugacy.

After observing data, the updates are elegant: κ_n = κ₀ + n, μ_n is the precision-weighted average of prior and sample means, α_n = α₀ + n/2, and β_n incorporates the sample variance and the prior-data disagreement.

Property	Value
Support	μ ∈ (−∞, +∞), λ ∈ (0, ∞)
Parameters	μ₀, κ₀, α₀, β₀
Conjugate to	Gaussian likelihood with unknown mean and variance
Marginal on μ	Student-t distribution
Used in	Bayesian regression, online learning, Kalman filters

Normal-Gamma: Joint Prior on Mean & Precision

The heatmap shows the joint density p(μ, λ). Click Add Data to draw random observations and watch the posterior sharpen around the true parameters.

Observations: 0 | μ̂ = 0.0 | λ̂ = 1.0

Why is the Normal-Gamma parameterized with precision λ rather than variance σ²?

Because precision is always larger than variance Because the Gamma distribution's support (0, ∞) naturally matches precision, enabling conjugacy Because precision is easier to estimate from data

Chapter 6: The Logistic-Normal

You need a distribution over probability vectors — like topic proportions in a document (30% politics, 50% sports, 20% tech). The Dirichlet distribution is the standard choice, but it has a limitation: its components are negatively correlated by construction. If topic A increases, the others must decrease. What if topics tend to co-occur?

The Logistic-Normal distribution fixes this. Draw a vector from a multivariate Gaussian, then push it through the softmax function. The Gaussian's covariance matrix lets you model arbitrary correlations between components.

z ~ N(μ, Σ), θ_k = exp(z_k) / ∑_j exp(z_j)

Logistic-Normal vs Dirichlet: The Dirichlet is conjugate to the multinomial (easy updates) but can only model negative correlations. The Logistic-Normal can model any correlation structure via Σ, but loses conjugacy. It's the prior behind Correlated Topic Models and many VAE decoders.

Property	Value
Support	Probability simplex (components sum to 1)
Parameters	μ (location in log-space), Σ (covariance)
Correlations	Arbitrary (unlike Dirichlet)
Conjugate?	No — requires variational inference
Used in	Topic models, compositional data, VAE priors

Logistic-Normal on the 3-Simplex

Samples from a Logistic-Normal shown as dots on the probability simplex (triangle). Adjust the mean to shift the cloud, and the correlation to skew it.

μ₁0.0

μ₂0.0

Correlation ρ0.0

What advantage does the Logistic-Normal have over the Dirichlet for modeling topic proportions?

It's conjugate to the multinomial likelihood It assigns higher probability to sparse vectors It can model positive correlations between topics via Σ

Chapter 7: The Truncated Gaussian

A standard Gaussian assigns nonzero probability to the entire real line. But what if your parameter can't be negative? A length, a concentration, a time duration — these are strictly positive. Naively clamping a Gaussian to [0, ∞) changes its mean and variance in hard-to-track ways.

The truncated Gaussian solves this cleanly: take a normal distribution N(μ, σ²) and restrict it to an interval [a, b], then renormalize so the density integrates to 1.

p(x | μ, σ, a, b) = φ((x−μ)/σ) / (σ · [Φ((b−μ)/σ) − Φ((a−μ)/σ)])

Here φ is the standard normal PDF and Φ is the CDF. The denominator is just the probability that an untruncated sample would fall in [a, b]. The shape stays Gaussian within the interval, but the effective mean shifts toward the interval center.

Where it appears: Truncated Gaussians show up in probit regression (latent variables are truncated based on the observed binary outcome), physics simulations with hard constraints, and Gibbs sampling steps where conditional distributions are truncated by other variables.

Property	Value
Support	[a, b] (user-specified interval)
Parameters	μ, σ (parent Gaussian), a, b (bounds)
Mean	Shifted toward interval center from μ
Variance	≤ σ² (truncation reduces spread)
Used in	Constrained optimization, probit models, Gibbs sampling

Truncated vs Full Gaussian

The blue dashed curve is the full Gaussian. The orange region is the truncated density. Notice how truncation shifts the effective mean.

μ0.5

Lower bound a0.0

Upper bound b3.0

If you truncate N(0, 1) to [0, ∞), what happens to the mean?

It stays at 0 It shifts to a positive value (roughly 0.798) It becomes undefined

Chapter 8: The Folded Normal

You're measuring the magnitude of some quantity — the absolute deviation from a target, the strength of a signal, the distance a robot drifted from its path. You know the underlying error is Gaussian, but you only observe the absolute value. The result is a folded normal distribution.

Y = |X| where X ~ N(μ, σ²)

The PDF is the sum of the original Gaussian and its mirror image reflected at zero:

p(y) = φ((y−μ)/σ)/σ + φ((y+μ)/σ)/σ for y ≥ 0

When μ = 0, the folded normal simplifies to the half-normal distribution — a popular default prior for scale parameters in Bayesian hierarchical models (recommended by Andrew Gelman as a weakly informative prior for standard deviations).

Folded vs Truncated: Don't confuse them! A truncated Gaussian discards the negative part and renormalizes. A folded Gaussian folds the negative part onto the positive side. The folded normal has more mass near zero (it gets contributions from both tails).

Property	Value
Support	[0, ∞)
Parameters	μ, σ (of the underlying Gaussian)
Special case	μ = 0 gives the half-normal
Mean	σ√(2/π) exp(−μ²/2σ²) + μ(1 − 2Φ(−μ/σ))
Used in	Signal magnitudes, hierarchical model priors, noise modeling

Folding a Gaussian

The blue dashed curve is the original Gaussian. Its negative portion "folds" to the positive side, creating the orange folded normal.

μ1.0

σ1.0

What is the key difference between a folded normal and a truncated normal?

Folding reflects the negative part onto the positive side; truncation discards it They are the same distribution with different names Truncation gives heavier tails

Chapter 9: The Gumbel Distribution

Imagine measuring the maximum daily temperature every year and asking: "What's the distribution of the annual maximum?" This is an extreme value problem, and the Gumbel distribution is one of the three possible limiting distributions for maxima of independent samples.

p(x) = (1/β) exp(−z − exp(−z)) where z = (x − μ) / β

The Gumbel has a distinctive asymmetric shape: a steep rise on the left and a long tail on the right (for the maximum case). But its most famous modern use has nothing to do with weather.

The Gumbel-Softmax trick: In reinforcement learning and discrete VAEs, you need to sample from a categorical distribution but backpropagate through the sampling step. The Gumbel-Max trick says: add Gumbel noise to log-probabilities, then take the argmax. The Gumbel-Softmax relaxation replaces argmax with softmax, giving a differentiable approximation. This single trick unlocked gradient-based training of discrete latent variable models.

The math: if g_k ~ Gumbel(0, 1) independently, then argmax_k(log π_k + g_k) gives a sample from Categorical(π). Replace argmax with softmax at temperature τ for a differentiable relaxation.

Property	Value
Support	(−∞, +∞)
Parameters	μ (location), β (scale)
Mean	μ + βγ where γ ≈ 0.5772 (Euler-Mascheroni)
Variance	π²β²/6
Used in	Gumbel-Softmax, extreme value theory, RL exploration

Gumbel Distribution & Softmax Trick

Left: the Gumbel PDF (note the asymmetry). Right: Gumbel-Softmax samples — lower temperature τ approaches a hard one-hot vector.

Location μ0.0

Scale β1.0

Softmax τ1.0

Why is the Gumbel-Softmax trick important for training models with discrete latent variables?

It eliminates the need for any sampling It converts continuous variables to discrete ones It provides a differentiable approximation to categorical sampling, enabling backpropagation

Chapter 10: The Kumaraswamy Distribution

The Beta distribution is the conjugate prior for Bernoulli and binomial likelihoods, and it's ubiquitous in Bayesian statistics. But it has a dirty secret: its CDF has no closed form. You need the incomplete Beta function, which requires numerical approximation. For variational inference, where you need to compute KL divergences and sample using the reparameterization trick, this is a real problem.

Enter the Kumaraswamy distribution: a two-parameter distribution on [0, 1] that looks like a Beta but has closed-form PDF, CDF, and inverse CDF (quantile function).

p(x | a, b) = a · b · x^a−1 (1 − x^a)^b−1

F(x) = 1 − (1 − x^a)^b

F⁻¹(u) = (1 − (1 − u)^1/b)^1/a

Why this matters for VAEs: In a variational autoencoder, the approximate posterior q(z|x) is usually Gaussian. But if your latent variable lives on [0, 1], Gaussian won't do. The Kumaraswamy lets you reparameterize: sample u ~ Uniform(0,1), apply F⁻¹(u), and you have a Kumaraswamy sample with gradients flowing through a and b. No special functions needed.

Property	Value
Support	[0, 1]
Parameters	a > 0, b > 0
CDF	Closed-form (unlike Beta)
Reparameterizable	Yes, via inverse CDF
Used in	VAEs with bounded latents, stick-breaking constructions, flow models

Kumaraswamy vs Beta

The Kumaraswamy (orange) and Beta (teal) with matched parameters. They're similar but not identical — the Kumaraswamy's closed-form CDF is its computational advantage.

a2.0

b5.0

python
import numpy as np

def kumaraswamy_sample(a, b, n=1000):
    """Sample via inverse CDF — no special functions!"""
    u = np.random.uniform(0, 1, n)
    return (1 - (1 - u)**(1/b))**(1/a)

What is the main computational advantage of the Kumaraswamy over the Beta distribution?

It has more flexible shapes Its CDF and inverse CDF are closed-form, enabling easy reparameterization It's conjugate to more likelihoods

Chapter 11: When to Use What

You've now met ten distributions that power modern Bayesian inference. Each solves a specific structural problem that standard distributions can't handle. Here's your decision guide.

Distribution	Use When	Key Property
GMM	Data has multiple clusters	Universal density approximator
Particle	Posterior is too complex for any named family	Nonparametric, any shape
Spike-and-Slab	You want exact variable selection	True zeros with posterior probabilities
Horseshoe	High-dimensional sparsity, scalable	Continuous shrinkage with heavy tails
Normal-Gamma	Unknown Gaussian mean AND variance	Conjugate, closed-form updates
Logistic-Normal	Correlated proportions / topics	Flexible correlation via Σ
Truncated Gaussian	Parameter is bounded to an interval	Gaussian shape with hard constraints
Folded Normal	You observe magnitudes (\|X\|)	Reflects negative mass to positive
Gumbel	Extreme values or differentiable discrete sampling	Gumbel-Softmax trick
Kumaraswamy	Beta-like prior with reparameterization	Closed-form CDF/inverse CDF

Conjugacy Cheat Sheet

When your problem has a conjugate pair, use it — the posterior updates are free.

Likelihood	Conjugate Prior	Posterior
Bernoulli / Binomial	Beta(α, β)	Beta(α+k, β+n−k)
Gaussian (known σ)	Gaussian(μ₀, σ₀²)	Gaussian (precision-weighted mean)
Gaussian (unknown μ, σ)	Normal-Gamma	Normal-Gamma (updated params)
Poisson	Gamma(α, β)	Gamma(α+∑x, β+n)
Multinomial	Dirichlet(α)	Dirichlet(α+counts)
Exponential	Gamma(α, β)	Gamma(α+n, β+∑x)

Rule of thumb: If your problem fits a conjugate pair, start there. If you need sparsity, use the horseshoe (scalable) or spike-and-slab (precise). If you need bounded support, look at Kumaraswamy or truncated Gaussian. If nothing fits, particles or variational inference with a flexible family (Logistic-Normal, normalizing flows) are your fallback.

Distribution Gallery

All ten distributions on one canvas. Click a name to highlight it.

Every distribution in this lesson was born from a practical need. The Gaussian alone cannot model sparsity, boundedness, compositions, extremes, or arbitrary posteriors. Bayesian inference is as much about choosing the right distributional family as it is about applying Bayes' theorem.

What's next? See Bayesian Estimation for how these priors combine with data to form optimal estimates. See Bayes Filters for recursive inference over time. And see Bayesian Networks for graphical models that chain these distributions together.

You're building a VAE with latent variables on [0, 1] and need reparameterizable sampling. Which distribution should you use?

Beta Dirichlet Kumaraswamy