The 14 probability distributions that power machine learning, estimation theory, and robotics — derived from intuition, not memorization.
You build a robot that measures distance to a wall with a laser sensor. You take 1,000 readings of the same wall. They should all be the same number, right? They're not. They scatter around the true value in a pattern — some errors are tiny and frequent, others are large and rare. That pattern is a probability distribution.
A probability distribution is a mathematical function that tells you how likely each possible value is. It's the DNA of randomness — once you know which distribution your data follows, you can predict, filter, estimate, and make decisions optimally.
Every continuous distribution has two faces. The probability density function (PDF) tells you the relative likelihood of each value — where the curve is tall, values are common. The cumulative distribution function (CDF) tells you the probability of getting a value less than or equal to x — it always rises from 0 to 1.
Drag the vertical marker to see how the shaded area under the PDF (left) corresponds to the CDF value (right). The PDF shows density; the CDF shows accumulated probability.
Every branch of AI relies on distributions. Bayesian inference uses them as priors and posteriors. Neural networks initialize weights from them. Kalman filters assume Gaussian noise. Reinforcement learning samples actions from them. Generative models learn to transform simple distributions into complex ones.
This lesson covers 14 continuous distributions. Each one models a different kind of uncertainty. By the end, you'll know which distribution to reach for in any situation — and why.
You're measuring the height of 10,000 adults. Most cluster around the average. A few are very tall or very short, but extreme heights are rare. Plot a histogram and you get the most famous shape in all of statistics: the bell curve.
The Gaussian distribution (also called Normal distribution) is parameterized by two numbers: the mean μ (the center of the bell) and the variance σ² (how spread out it is). It arises naturally whenever many small, independent effects add together — this is the Central Limit Theorem.
The Gaussian maximizes entropy (uncertainty) among all distributions with a given mean and variance. In other words, if all you know is the average and the spread, the Gaussian is the most "honest" distribution — it assumes nothing else. The exponential of a negative quadratic gives the bell shape: values near μ get high density, and density drops off as a squared exponential.
Drag μ and σ to reshape the bell curve. The orange line marks the mean. The teal bars show ±1σ (68% of data) and ±2σ (95%).
| Property | Value |
|---|---|
| Support | (−∞, +∞) |
| Mean | μ |
| Variance | σ² |
| Mode | μ |
| MGF | exp(μt + σ²t²/2) |
Sensor noise modeling: Kalman filters assume process and measurement noise are Gaussian. Weight initialization: Neural network weights are often initialized from N(0, 1/n). Variational autoencoders: The latent space is regularized to be Gaussian. Gaussian processes: Entire function priors are built from Gaussians.
A lidar sensor has noise N(0, 0.1²). What's the probability the error exceeds 0.3 meters?
We need P(|X| > 0.3) = P(|Z| > 3) where Z = X/0.1 is standard normal. From the 68-95-99.7 rule, P(|Z| > 3) ≈ 0.27%. So only about 1 in 370 readings will be off by more than 30 cm.
python from scipy import stats # Define the distribution dist = stats.norm(loc=0, scale=0.1) # P(|X| > 0.3) p = 2 * dist.sf(0.3) # sf = survival function = 1 - CDF print(f"P(|error| > 0.3) = {p:.4f}") # 0.0027 # Sample 1000 measurements samples = dist.rvs(size=1000) print(f"Empirical fraction > 0.3: {(abs(samples) > 0.3).mean():.4f}")
You're testing whether a new robot controller improves lap times, but you only have 8 test runs. You compute a mean and standard deviation, but with so few samples your estimate of σ is shaky. If you pretend you know σ exactly and use a Gaussian, you'll be overconfident. The Student-t distribution accounts for this extra uncertainty by having heavier tails.
The key parameter is degrees of freedom ν (nu). With ν = 1 you get the extremely heavy-tailed Cauchy. As ν grows, the t-distribution converges to the Gaussian. Think of ν as "how much data backs up our estimate of σ." More data → thinner tails → closer to Gaussian.
Take a standard Gaussian Z and divide it by the square root of an independent Chi-Squared variable divided by its degrees of freedom: T = Z / √(V/ν). Because V fluctuates, T has more spread than Z alone — hence heavier tails. The formula is:
Increase ν and watch the heavy tails shrink toward a Gaussian. The gray dashed curve is the standard Gaussian for comparison.
| Property | Value |
|---|---|
| Support | (−∞, +∞) |
| Mean | 0 (for ν > 1), undefined for ν ≤ 1 |
| Variance | ν/(ν − 2) for ν > 2, ∞ for 1 < ν ≤ 2 |
| Mode | 0 |
Small-sample inference: When n < 30, use t-tests instead of z-tests. Robust regression: Replacing Gaussian likelihoods with Student-t makes regression resistant to outliers because the heavier tails assign more probability to extreme residuals. Bayesian deep learning: Student-t priors on weights encourage sparsity while tolerating outlier weights.
You measure 5 robot lap times: [12.1, 11.8, 12.4, 12.0, 11.9] seconds. The sample mean is 12.04, sample std is 0.228. The 95% confidence interval uses t4, 0.025 = 2.776:
CI = 12.04 ± 2.776 × 0.228/√5 = 12.04 ± 0.283 = [11.76, 12.32]
python import numpy as np from scipy import stats times = np.array([12.1, 11.8, 12.4, 12.0, 11.9]) n = len(times) mean, sem = times.mean(), stats.sem(times) ci = stats.t.interval(0.95, df=n-1, loc=mean, scale=sem) print(f"95% CI: [{ci[0]:.2f}, {ci[1]:.2f}]") # Compare with Gaussian (overconfident with n=5) ci_z = stats.norm.interval(0.95, loc=mean, scale=sem) print(f"Gaussian CI: [{ci_z[0]:.2f}, {ci_z[1]:.2f}]") # narrower!
Imagine compressing an image. Most pixel differences between neighboring pixels are zero or tiny, but occasionally there's an edge with a big jump. If you histogram these differences, you get a sharp peak at zero with exponential tails — much sharper than a Gaussian. This is the Laplace distribution, sometimes called the double exponential.
The Laplace has two parameters: a location μ (the peak) and a scale b (how quickly the tails decay). Its sharp peak at μ makes it the natural prior when you believe most values are near zero — which is exactly the assumption behind L1 regularization (Lasso).
The Gaussian uses exp(−x²), which penalizes deviations quadratically. The Laplace uses exp(−|x|), penalizing deviations linearly. Linear penalty means the density decreases more slowly in the tails (heavier tails than Gaussian) but concentrates more mass at exactly μ (sharper peak). It's the maximum entropy distribution for a given mean absolute deviation.
Compare the sharp Laplace peak with the rounded Gaussian. Adjust scale b to control the spread. The gray dashed Gaussian has the same variance for comparison.
| Property | Value |
|---|---|
| Support | (−∞, +∞) |
| Mean | μ |
| Variance | 2b² |
| Mode | μ |
| MGF | exp(μt) / (1 − b²t²) for |t| < 1/b |
L1 / Lasso regularization: Adding λ|w| to a loss is equivalent to a Laplace prior on weights. This drives many weights to exactly zero, producing sparse models. Sparse coding: Natural image patches are well-modeled by sparse coefficients with Laplace priors. Robust statistics: The median is the maximum likelihood estimate under a Laplace model, making it naturally outlier-resistant.
A sparse coding model assumes coefficients follow Laplace(0, 0.5). What fraction of coefficients have |c| < 0.1?
P(|X| < 0.1) = 1 − exp(−0.1/0.5) = 1 − exp(−0.2) ≈ 1 − 0.819 = 0.181. So about 18% are near-zero. The CDF at x ≥ 0 is 1 − 0.5 exp(−x/b).
python from scipy import stats dist = stats.laplace(loc=0, scale=0.5) p_near_zero = dist.cdf(0.1) - dist.cdf(-0.1) print(f"P(|c| < 0.1) = {p_near_zero:.3f}") # ~0.181 # Sample and check sparsity samples = dist.rvs(size=10000) print(f"Fraction |c| < 0.01: {(abs(samples) < 0.01).mean():.3f}")
A coin lands heads 7 times out of 10. What's the probability it's a fair coin? What's the probability the true heads-rate is between 0.6 and 0.8? You need a distribution over probabilities — a distribution on [0, 1]. That's the Beta distribution.
The Beta is parameterized by two shape parameters α and β. Think of α as "pseudo-counts of successes" and β as "pseudo-counts of failures." With α = β = 1, you get a flat (uniform) distribution — total ignorance. After observing 7 heads and 3 tails, the posterior becomes Beta(8, 4), peaked around 0.67.
The Beta is the conjugate prior to the Bernoulli/Binomial likelihood. "Conjugate" means the posterior has the same functional form as the prior: Beta prior + Binomial data = Beta posterior. The shape comes from xα−1(1−x)β−1, which creates flexible asymmetric bumps on [0, 1].
Small α, β = wide uncertainty. Large values = concentrated. α > β shifts right (more successes). Try α = β = 1 (uniform) and α = β = 0.5 (Jeffrey's prior).
| Property | Value |
|---|---|
| Support | [0, 1] |
| Mean | α / (α + β) |
| Variance | αβ / [(α+β)²(α+β+1)] |
| Mode | (α−1) / (α+β−2) for α, β > 1 |
Bayesian A/B testing: Model click-through rates as Beta, update with observations. Thompson sampling: Each arm's reward probability is Beta-distributed; sample to decide which arm to pull. Topic models (LDA): Document-topic and topic-word distributions use Dirichlet priors (the multivariate generalization of Beta).
A robot grasps successfully 14 out of 20 times. With a uniform prior Beta(1, 1), the posterior is Beta(15, 7). What's P(success rate > 0.8)?
python from scipy import stats posterior = stats.beta(15, 7) p_above_80 = 1 - posterior.cdf(0.8) print(f"P(rate > 0.8) = {p_above_80:.3f}") # ~0.099 print(f"Posterior mean = {posterior.mean():.3f}") # 0.682 print(f"95% CI: {posterior.ppf([0.025, 0.975])}")
You're monitoring a server, and requests arrive randomly. The time until the first request is Exponential. But what about the time until the k-th request? That's the Gamma distribution — it's the sum of k independent Exponential waiting times.
The Gamma has a shape parameter k (or α) and a rate parameter θ (or β, depending on convention). When k = 1, it reduces to the Exponential. As k grows, the distribution becomes more symmetric and Gaussian-like (another instance of the CLT).
The factor xk−1 comes from the probability that exactly k−1 events occurred before time x (a polynomial growth), multiplied by e−x/θ from the exponential decay of waiting. The combination creates a right-skewed hump that shifts rightward as k increases.
k = 1 is the Exponential. Increase k to see the peak emerge and shift right. The orange marker shows the mean, the teal marker shows the mode.
| Property | Value |
|---|---|
| Support | (0, +∞) |
| Mean | kθ |
| Variance | kθ² |
| Mode | (k − 1)θ for k ≥ 1 |
| MGF | (1 − θt)−k for t < 1/θ |
Bayesian precision priors: The Gamma is the conjugate prior for the precision (inverse variance) of a Gaussian. Queuing theory: Time until k-th arrival in a Poisson process. Rainfall modeling: Precipitation amounts are often Gamma-distributed. Neural network priors: Gamma priors on noise precision in Bayesian neural networks.
Server requests arrive at 5 per minute (Poisson). Time until 3rd request is Gamma(k=3, θ=1/5 min). What's P(wait > 1 min)?
python from scipy import stats # Gamma with shape k=3, scale theta=1/5 minutes = 0.2 dist = stats.gamma(a=3, scale=0.2) p_wait = 1 - dist.cdf(1.0) print(f"P(wait > 1 min) = {p_wait:.4f}") # ~0.0803 print(f"Expected wait = {dist.mean():.2f} min") # 0.6 min
In Bayesian inference, we often need a prior on the variance σ² of a Gaussian. The variance must be positive, and we'd like a conjugate prior (so the posterior has the same form). The Inverse Gamma distribution is exactly that — if X ~ Gamma(α, β), then 1/X ~ InverseGamma(α, β).
The parameters α (shape) and β (scale) control the distribution. Larger α concentrates mass more tightly. Think of α as "how many prior observations worth of confidence" and β as "prior sum of squared deviations."
The Inverse Gamma is a change of variables from the Gamma. Applying x → 1/x to a Gamma-distributed variable transforms the polynomial growth into polynomial decay near zero and the exponential decay into exponential decay in 1/x. The result is a distribution that's zero at zero, rises to a peak, then has a heavy right tail.
Increase α to concentrate mass. Increase β to shift the distribution right. The orange marker shows the mean (exists only for α > 1).
| Property | Value |
|---|---|
| Support | (0, +∞) |
| Mean | β / (α − 1) for α > 1 |
| Variance | β² / [(α−1)²(α−2)] for α > 2 |
| Mode | β / (α + 1) |
Bayesian linear regression: The standard model places an Inverse Gamma prior on the noise variance σ². Hierarchical models: Group-level variance parameters often get Inverse Gamma priors. Gaussian mixture models: Each component's variance can have an Inverse Gamma prior.
You place an InverseGamma(3, 1) prior on σ². What's the prior mean and the probability that σ² > 1?
python from scipy import stats dist = stats.invgamma(a=3, scale=1) print(f"Prior mean of sigma^2 = {dist.mean():.3f}") # 0.5 print(f"P(sigma^2 > 1) = {1 - dist.cdf(1):.3f}") # ~0.080 print(f"Mode = {1 / (3 + 1):.3f}") # 0.25
A sensor on your robot fails at random. There's no "wearing out" — a sensor that's survived 1,000 hours is just as likely to fail in the next hour as a brand-new one. This memoryless property uniquely characterizes the Exponential distribution. It models the waiting time between events in a Poisson process.
The single parameter λ (rate) controls how frequently events happen. Higher λ means shorter average wait times. The mean wait is 1/λ.
If events happen at a constant rate λ (independently, no memory), then the probability of "no event for t time" decays as e−λt. The PDF is the rate of that decay: λ e−λt. The exponential shape is the only continuous distribution that is memoryless.
Increase λ for more frequent events (shorter waits). The orange marker shows the mean 1/λ. Notice the PDF always starts at λ on the y-axis.
| Property | Value |
|---|---|
| Support | [0, +∞) |
| Mean | 1/λ |
| Variance | 1/λ² |
| Mode | 0 |
| MGF | λ / (λ − t) for t < λ |
| Memoryless | P(X > s+t | X > s) = P(X > t) |
Sensor failure modeling: Time until a component fails (if failure rate is constant). Poisson processes: Time between arrivals of requests, photons, earthquakes. Survival analysis: Baseline hazard model. Queuing theory: Service times often modeled as Exponential.
A robot's motor fails on average every 500 hours (λ = 1/500). What's the probability it lasts at least 700 hours?
P(X > 700) = e−700/500 = e−1.4 ≈ 0.247. About a 25% chance.
python from scipy import stats dist = stats.expon(scale=500) # scale = 1/lambda = mean print(f"P(X > 700) = {dist.sf(700):.3f}") # 0.247 print(f"Median lifetime = {dist.median():.1f} hours") # 346.6
Real components DO wear out. A bearing that's been spinning for 10,000 hours is more likely to fail in the next hour than a new one. The Weibull distribution generalizes the Exponential by adding a shape parameter k that controls whether the failure rate increases, decreases, or stays constant over time.
When k = 1, it's exactly Exponential (constant failure rate). When k > 1, the failure rate increases over time (aging). When k < 1, the failure rate decreases (infant mortality — if it survives the early period, it's likely fine).
The Weibull modifies the Exponential by replacing λt with (λt)k. This means the hazard function (instantaneous failure rate) is h(t) = (k/λ)(t/λ)k−1, which is a power law in time. This single extra parameter captures a wide range of failure behaviors.
k < 1: decreasing failure rate. k = 1: constant (Exponential). k > 1: increasing failure rate. k ≈ 3.6 approximates a Gaussian shape.
| Property | Value |
|---|---|
| Support | [0, +∞) |
| Mean | λ Γ(1 + 1/k) |
| Variance | λ² [Γ(1+2/k) − Γ(1+1/k)²] |
| Mode | λ ((k−1)/k)1/k for k > 1, else 0 |
Reliability engineering: Modeling component lifetimes where failure rate isn't constant. Wind speed modeling: Wind speeds at a location often follow a Weibull. Survival analysis: The Weibull proportional hazards model. Robotics maintenance scheduling: Predicting when to replace components before failure.
A robot joint has Weibull-distributed lifetime with k=2, λ=1000 hours. What's P(survive 800 hours)?
python from scipy import stats import numpy as np dist = stats.weibull_min(c=2, scale=1000) # c=shape, scale=lambda print(f"P(X > 800) = {dist.sf(800):.3f}") # 0.527 print(f"Mean lifetime = {dist.mean():.1f} hours") # ~886 print(f"Replace at 95% reliability: {dist.ppf(0.05):.1f} hours")
A startup's valuation grows: 10% this year, −5% next, +20% the year after. Each year multiplies the previous value. After many years, the log of the valuation is a sum of independent effects — and by the CLT, sums converge to Gaussian. So the valuation itself follows a Log-Normal distribution: the exponential of a Gaussian.
If X ~ N(μ, σ²), then Y = eX is Log-Normal. It's always positive, right-skewed, and heavy-tailed. The parameters μ and σ are the mean and standard deviation of the log, not of the distribution itself.
Taking exp() of a Gaussian stretches the right tail exponentially while compressing the left into zero. Values near exp(μ) are most common, but there's a long right tail of large values. This creates the characteristic right-skewed shape seen in wealth, city sizes, and word frequencies.
Increase σ to make the right tail heavier. The orange marker is the mean, teal marker is the median. Notice the mean is always right of the median.
| Property | Value |
|---|---|
| Support | (0, +∞) |
| Mean | exp(μ + σ²/2) |
| Variance | [exp(σ²) − 1] exp(2μ + σ²) |
| Median | exp(μ) |
| Mode | exp(μ − σ²) |
Word frequencies: Word counts in natural language follow approximately log-normal distributions. Neural network weights after training: Weight magnitudes are often log-normally distributed. Financial modeling: Stock returns are multiplicative, so prices are log-normal (Black-Scholes model). Signal processing: Shadowing in wireless channels is log-normal.
Word frequencies in a corpus follow LogNormal(μ=2, σ=1.5). What fraction of words appear fewer than 10 times?
python from scipy import stats dist = stats.lognorm(s=1.5, scale=np.exp(2)) # s=sigma, scale=exp(mu) print(f"P(count < 10) = {dist.cdf(10):.3f}") print(f"Median count = {dist.median():.1f}") # exp(2) = 7.4 print(f"Mean count = {dist.mean():.1f}") # exp(2+1.5^2/2) = 24.5
You're spinning a wheel that can land anywhere between 0 and 360 degrees with equal probability. No angle is more likely than any other. This is the Uniform distribution — the simplest continuous distribution. It's the formal way of saying "I have absolutely no idea, and every value in this range is equally plausible."
The Uniform on [a, b] has constant density 1/(b − a) within the interval and zero outside. Two parameters: the lower bound a and the upper bound b.
The Uniform maximizes entropy among all distributions with bounded support. If all you know is that x lies in [a, b], the most honest distribution assigns equal density everywhere. A flat line is the only function that integrates to 1 over [a, b] without favoring any subregion.
Adjust the bounds. Notice the PDF height adjusts so the total area is always 1. Narrower range = taller rectangle.
| Property | Value |
|---|---|
| Support | [a, b] |
| Mean | (a + b) / 2 |
| Variance | (b − a)² / 12 |
| Mode | Any value in [a, b] |
| MGF | (etb − eta) / [t(b−a)] |
Random initialization: Xavier/He initialization of neural network weights uses uniform or Gaussian variants. Uninformative priors: When you have no information, start with Uniform. Random number generation: Every random number generator starts with Uniform[0,1] and transforms it. Monte Carlo sampling: Uniform random variables drive importance sampling and MCMC acceptance.
Xavier initialization sets weights from Uniform(−√(6/n), √(6/n)) where n = fan_in + fan_out. For a layer with 256 inputs and 128 outputs, n = 384. What's the weight range and variance?
python import numpy as np from scipy import stats n = 256 + 128 bound = np.sqrt(6 / n) print(f"Range: [{-bound:.4f}, {bound:.4f}]") # [-0.1250, 0.1250] dist = stats.uniform(loc=-bound, scale=2*bound) print(f"Variance = {dist.var():.6f}") # ~0.005208 = 2/n
Spin a pointer mounted at (0, 1) in the plane. It points at a random angle θ, and you record where it hits the x-axis. The resulting distribution of x-intercepts is the Cauchy distribution. It looks like a bell curve but with tails so heavy that the mean and variance don't exist.
That's not a typo. If you average N samples from a Cauchy, the average doesn't converge — it has the same distribution as a single sample. The CLT fails completely. This makes the Cauchy the ultimate stress test for any algorithm that assumes finite variance.
The Cauchy is the ratio of two independent standard Gaussians: Z1/Z2. Since Z2 can be very close to zero, the ratio can be astronomically large, creating the heavy tails. The PDF decays as 1/x² (not exponentially), so the tails contain infinite expected mass.
Compare with the gray Gaussian. The Cauchy tails never really die — extreme values are far more common than with a Gaussian. The orange marks the median (the mean doesn't exist!).
| Property | Value |
|---|---|
| Support | (−∞, +∞) |
| Mean | Undefined |
| Variance | Undefined |
| Median | x0 |
| Mode | x0 |
Robust estimation: The Cauchy likelihood makes estimators highly resistant to outliers because it assigns non-negligible probability to extreme values. Physics (Lorentzian): The spectral line shape of atomic resonances is Cauchy (called the Lorentz profile). Testing algorithms: If your algorithm breaks on Cauchy-distributed data, it's relying too heavily on the existence of moments.
You suspect outliers in your data. Fit both a Gaussian and Cauchy likelihood. If the data point x = 15 comes from N(0, 1), the density is essentially 0. From Cauchy(0, 1), the density is 1/(π(1 + 225)) ≈ 0.0014 — still small, but vastly larger than the Gaussian density.
python from scipy import stats x = 15 gauss_pdf = stats.norm.pdf(x, 0, 1) cauchy_pdf = stats.cauchy.pdf(x, 0, 1) print(f"Gaussian density at x=15: {gauss_pdf:.2e}") # ~5.5e-50 print(f"Cauchy density at x=15: {cauchy_pdf:.4f}") # ~0.0014 print(f"Cauchy/Gaussian ratio: {cauchy_pdf/gauss_pdf:.2e}")
You're tracking a target with a radar system. The signal bounces off the target and returns with both an in-phase (X) and quadrature (Y) component, each corrupted by independent Gaussian noise. The magnitude of this 2D noise — √(X² + Y²) — follows the Rayleigh distribution.
Whenever you take the distance (L2 norm) of a 2D Gaussian vector, you get a Rayleigh. This makes it fundamental in signal processing, communications, and any 2D measurement system.
Convert 2D Gaussian coordinates to polar: x = r cos(θ), y = r sin(θ). The Jacobian of this transformation introduces an extra factor of r. So the density of r is proportional to r · exp(−r²/2σ²). The r factor suppresses values near zero (small circles have small circumference) while the exponential kills large values.
Adjust σ (the spread of each Gaussian component). Notice the PDF starts at zero, rises to a peak at σ, then decays. The orange marker shows the mean.
| Property | Value |
|---|---|
| Support | [0, +∞) |
| Mean | σ √(π/2) |
| Variance | σ² (4 − π) / 2 |
| Mode | σ |
Radar/sonar signal processing: The envelope of a noisy sinusoid follows a Rayleigh distribution. Wireless channel fading: The Rayleigh fading model describes signal amplitude in non-line-of-sight channels. GPS error modeling: 2D positioning error (CEP) is Rayleigh-distributed. Wind speed: Wind speed distributions are often approximately Rayleigh.
A GPS receiver has independent N(0, 3²) meter errors in X and Y. What's P(2D error < 5m)?
python from scipy import stats dist = stats.rayleigh(scale=3) # sigma = 3 meters print(f"P(error < 5m) = {dist.cdf(5):.3f}") # ~0.672 print(f"Mean 2D error = {dist.mean():.2f}m") # ~3.76m print(f"95th percentile = {dist.ppf(0.95):.2f}m") # ~7.35m
You have n independent standard Gaussian random variables Z1, ..., Zn. Square each one and add them up: Q = Z1² + ... + Zn². The resulting Q follows a Chi-Squared distribution with n degrees of freedom, written χ²(n). It measures the "total squared deviation" from zero.
This distribution appears whenever you compute a sum of squared residuals, which is everywhere in statistics: goodness-of-fit tests, confidence intervals for variance, the denominator of the F-test, and covariance matrix estimation.
For n = 1, you're squaring a single Gaussian — most values cluster near zero (small squared values) with a heavy right tail. For larger n, the sum of many squared Gaussians becomes more symmetric by the CLT. The Chi-Squared is literally Gamma(n/2, 2), which explains its shape: the Gamma captures the sum of Exponential-like variables.
Increase the degrees of freedom n. At n=1 the density peaks at zero. At n=2 it's Exponential. For large n it looks Gaussian. The orange marker is the mean (= n).
| Property | Value |
|---|---|
| Support | (0, +∞) |
| Mean | n |
| Variance | 2n |
| Mode | max(n − 2, 0) |
| MGF | (1 − 2t)−n/2 for t < 1/2 |
Hypothesis testing: The χ² test checks if observed frequencies match expected frequencies. Variance estimation: The sample variance times (n−1)/σ² follows χ²(n−1). Covariance estimation: The Wishart distribution (multivariate χ²) is the conjugate prior for covariance matrices. Mahalanobis distance: The squared Mahalanobis distance follows χ²(k) for k-dimensional Gaussian data.
You fit a model to data with n=10 residuals. Under H0, the sum of squared standardized residuals follows χ²(10). You observe Q = 18.3. Is this significant at α = 0.05?
python from scipy import stats n = 10 Q = 18.3 p_value = 1 - stats.chi2.cdf(Q, df=n) critical = stats.chi2.ppf(0.95, df=n) print(f"p-value = {p_value:.3f}") # ~0.050 print(f"Critical value = {critical:.2f}") # 18.31 print(f"Reject H0? {Q > critical}") # borderline!
Your robot has a compass that estimates heading angle. The heading is 0° to 360°, and the important thing is that 359° and 1° are close together — angles wrap around. You can't use a Gaussian here because a Gaussian at μ = 350° with σ = 20° would put significant mass at 370°, which doesn't exist. You need a distribution that lives on a circle. That's the Von Mises distribution.
The Von Mises is the "Gaussian on a circle." It has a mean direction μ and a concentration κ (analogous to 1/σ²). When κ = 0, it's uniform on the circle. As κ → ∞, it concentrates at μ.
The maximum-entropy distribution on a circle with a specified mean direction and mean resultant length is proportional to exp(κ cos(θ − μ)). The cosine ensures the density wraps around smoothly: at θ = μ the exponent is maximized, and at the antipodal point it's minimized. The normalization constant involves the modified Bessel function I0(κ).
Left: the distribution on a circle. Right: unrolled as a function of θ. Increase κ to concentrate around the mean direction. The orange arrow shows the mean direction.
| Property | Value |
|---|---|
| Support | [−π, π) (circular) |
| Mean direction | μ |
| Circular variance | 1 − I1(κ)/I0(κ) |
| Mode | μ |
Robotics heading estimation: Fusing compass and gyroscope readings for orientation. Wind direction modeling: Meteorological wind direction is inherently circular. Protein structure prediction: Backbone dihedral angles (φ, ψ) are modeled with Von Mises distributions. Image orientation: Edge directions in computer vision.
A compass reads headings with Von Mises noise (κ = 5). What's P(heading within ±30° of true)?
python import numpy as np from scipy import stats kappa = 5 dist = stats.vonmises(kappa) angle_30 = np.radians(30) p = dist.cdf(angle_30) - dist.cdf(-angle_30) print(f"P(within ±30°) = {p:.3f}") # ~0.891 # Approximate circular std dev print(f"Circ. std dev ≈ {np.degrees(1/np.sqrt(kappa)):.1f}°") # ~25.5°
Now let's put it all together. The explorer below lets you select any of the 14 distributions, adjust its parameters, and see the PDF, CDF, and random samples update in real time. Use it to compare distributions side-by-side and build intuition for how parameters change shapes.
Select a distribution, adjust parameters, and watch the PDF (top), CDF (middle), and histogram of 1000 samples (bottom) update live.
Distributions don't exist in isolation. They form a rich family tree connected by special cases, limits, and transformations.
Lines show how distributions are related. Hover over a node to see the connection.
| Distribution | Support | Params | Key Use |
|---|---|---|---|
| Gaussian | (−∞, +∞) | μ, σ | Noise, priors, CLT |
| Student-t | (−∞, +∞) | ν | Small samples, robust regression |
| Laplace | (−∞, +∞) | μ, b | L1 regularization, sparsity |
| Beta | [0, 1] | α, β | Priors on probabilities |
| Gamma | (0, ∞) | k, θ | Precision priors, waiting times |
| Inverse Gamma | (0, ∞) | α, β | Variance priors |
| Exponential | [0, ∞) | λ | Memoryless waiting, Poisson |
| Weibull | [0, ∞) | k, λ | Reliability, failure analysis |
| Log-Normal | (0, ∞) | μ, σ | Multiplicative processes |
| Uniform | [a, b] | a, b | Initialization, uninformative priors |
| Cauchy | (−∞, +∞) | x0, γ | Robust estimation |
| Rayleigh | [0, ∞) | σ | 2D noise magnitude, radar |
| Chi-Squared | (0, ∞) | n | Hypothesis testing, covariance |
| Von Mises | [−π, π) | μ, κ | Circular data, heading estimation |