Probability & Statistics for AI

Continuous Distributions

The 14 probability distributions that power machine learning, estimation theory, and robotics — derived from intuition, not memorization.

Prerequisites: Basic calculus + What a probability is. That's it.
16
Chapters
15+
Simulations
0
Assumed Knowledge

Chapter 0: Why Distributions?

You build a robot that measures distance to a wall with a laser sensor. You take 1,000 readings of the same wall. They should all be the same number, right? They're not. They scatter around the true value in a pattern — some errors are tiny and frequent, others are large and rare. That pattern is a probability distribution.

A probability distribution is a mathematical function that tells you how likely each possible value is. It's the DNA of randomness — once you know which distribution your data follows, you can predict, filter, estimate, and make decisions optimally.

PDF vs CDF

Every continuous distribution has two faces. The probability density function (PDF) tells you the relative likelihood of each value — where the curve is tall, values are common. The cumulative distribution function (CDF) tells you the probability of getting a value less than or equal to x — it always rises from 0 to 1.

Critical distinction: For continuous distributions, the PDF at a single point is NOT a probability. Probabilities come from integrating the PDF over an interval. The PDF can exceed 1 (the Laplace at its peak is 0.5/b, which can be > 1 for small b). Only the total area under the curve must equal 1.
PDF vs CDF

Drag the vertical marker to see how the shaded area under the PDF (left) corresponds to the CDF value (right). The PDF shows density; the CDF shows accumulated probability.

Marker x0.00

Why ML needs distributions

Every branch of AI relies on distributions. Bayesian inference uses them as priors and posteriors. Neural networks initialize weights from them. Kalman filters assume Gaussian noise. Reinforcement learning samples actions from them. Generative models learn to transform simple distributions into complex ones.

This lesson covers 14 continuous distributions. Each one models a different kind of uncertainty. By the end, you'll know which distribution to reach for in any situation — and why.

The big picture: Distributions aren't abstract math. They're models of how the world generates data. A Gaussian says "errors are symmetric and mostly small." An Exponential says "events happen randomly and independently." Choosing the right distribution is choosing the right assumption about reality.
What does the PDF of a continuous distribution tell you?

Chapter 1: The Gaussian (Normal)

You're measuring the height of 10,000 adults. Most cluster around the average. A few are very tall or very short, but extreme heights are rare. Plot a histogram and you get the most famous shape in all of statistics: the bell curve.

The Gaussian distribution (also called Normal distribution) is parameterized by two numbers: the mean μ (the center of the bell) and the variance σ² (how spread out it is). It arises naturally whenever many small, independent effects add together — this is the Central Limit Theorem.

Why does it have this shape?

The Gaussian maximizes entropy (uncertainty) among all distributions with a given mean and variance. In other words, if all you know is the average and the spread, the Gaussian is the most "honest" distribution — it assumes nothing else. The exponential of a negative quadratic gives the bell shape: values near μ get high density, and density drops off as a squared exponential.

f(x) = 1√(2πσ²) · exp(− (x − μ)²2σ²)
Interactive Gaussian

Drag μ and σ to reshape the bell curve. The orange line marks the mean. The teal bars show ±1σ (68% of data) and ±2σ (95%).

Mean μ0.0
Std dev σ1.0
PropertyValue
Support(−∞, +∞)
Meanμ
Varianceσ²
Modeμ
MGFexp(μt + σ²t²/2)

Where it's used

Sensor noise modeling: Kalman filters assume process and measurement noise are Gaussian. Weight initialization: Neural network weights are often initialized from N(0, 1/n). Variational autoencoders: The latent space is regularized to be Gaussian. Gaussian processes: Entire function priors are built from Gaussians.

Relationships: The Gaussian is the limit of the Student-t as degrees of freedom → ∞. It's the limit of the Binomial for large n. Log-Normal is exp(Gaussian). Chi-Squared is the sum of squared Gaussians. Nearly every distribution in this lesson connects back to the Gaussian.

Worked example

A lidar sensor has noise N(0, 0.1²). What's the probability the error exceeds 0.3 meters?

We need P(|X| > 0.3) = P(|Z| > 3) where Z = X/0.1 is standard normal. From the 68-95-99.7 rule, P(|Z| > 3) ≈ 0.27%. So only about 1 in 370 readings will be off by more than 30 cm.

python
from scipy import stats

# Define the distribution
dist = stats.norm(loc=0, scale=0.1)

# P(|X| > 0.3)
p = 2 * dist.sf(0.3)  # sf = survival function = 1 - CDF
print(f"P(|error| > 0.3) = {p:.4f}")  # 0.0027

# Sample 1000 measurements
samples = dist.rvs(size=1000)
print(f"Empirical fraction > 0.3: {(abs(samples) > 0.3).mean():.4f}")
Why is the Gaussian the "default" distribution in so many algorithms?

Chapter 2: The Student-t

You're testing whether a new robot controller improves lap times, but you only have 8 test runs. You compute a mean and standard deviation, but with so few samples your estimate of σ is shaky. If you pretend you know σ exactly and use a Gaussian, you'll be overconfident. The Student-t distribution accounts for this extra uncertainty by having heavier tails.

The key parameter is degrees of freedom ν (nu). With ν = 1 you get the extremely heavy-tailed Cauchy. As ν grows, the t-distribution converges to the Gaussian. Think of ν as "how much data backs up our estimate of σ." More data → thinner tails → closer to Gaussian.

Where does the shape come from?

Take a standard Gaussian Z and divide it by the square root of an independent Chi-Squared variable divided by its degrees of freedom: T = Z / √(V/ν). Because V fluctuates, T has more spread than Z alone — hence heavier tails. The formula is:

f(x) = Γ((ν+1)/2)√(νπ) Γ(ν/2) · (1 + x²/ν)−(ν+1)/2
Interactive Student-t

Increase ν and watch the heavy tails shrink toward a Gaussian. The gray dashed curve is the standard Gaussian for comparison.

d.f. ν3
PropertyValue
Support(−∞, +∞)
Mean0 (for ν > 1), undefined for ν ≤ 1
Varianceν/(ν − 2) for ν > 2, ∞ for 1 < ν ≤ 2
Mode0

Where it's used

Small-sample inference: When n < 30, use t-tests instead of z-tests. Robust regression: Replacing Gaussian likelihoods with Student-t makes regression resistant to outliers because the heavier tails assign more probability to extreme residuals. Bayesian deep learning: Student-t priors on weights encourage sparsity while tolerating outlier weights.

Relationship: Student-t with ν = 1 IS the Cauchy distribution. Student-t with ν → ∞ IS the Gaussian. It's the bridge between them, parameterized by how much data you have.

Worked example

You measure 5 robot lap times: [12.1, 11.8, 12.4, 12.0, 11.9] seconds. The sample mean is 12.04, sample std is 0.228. The 95% confidence interval uses t4, 0.025 = 2.776:

CI = 12.04 ± 2.776 × 0.228/√5 = 12.04 ± 0.283 = [11.76, 12.32]

python
import numpy as np
from scipy import stats

times = np.array([12.1, 11.8, 12.4, 12.0, 11.9])
n = len(times)
mean, sem = times.mean(), stats.sem(times)
ci = stats.t.interval(0.95, df=n-1, loc=mean, scale=sem)
print(f"95% CI: [{ci[0]:.2f}, {ci[1]:.2f}]")

# Compare with Gaussian (overconfident with n=5)
ci_z = stats.norm.interval(0.95, loc=mean, scale=sem)
print(f"Gaussian CI: [{ci_z[0]:.2f}, {ci_z[1]:.2f}]")  # narrower!
Why does the Student-t have heavier tails than the Gaussian?

Chapter 3: The Laplace

Imagine compressing an image. Most pixel differences between neighboring pixels are zero or tiny, but occasionally there's an edge with a big jump. If you histogram these differences, you get a sharp peak at zero with exponential tails — much sharper than a Gaussian. This is the Laplace distribution, sometimes called the double exponential.

The Laplace has two parameters: a location μ (the peak) and a scale b (how quickly the tails decay). Its sharp peak at μ makes it the natural prior when you believe most values are near zero — which is exactly the assumption behind L1 regularization (Lasso).

Why does it have this shape?

The Gaussian uses exp(−x²), which penalizes deviations quadratically. The Laplace uses exp(−|x|), penalizing deviations linearly. Linear penalty means the density decreases more slowly in the tails (heavier tails than Gaussian) but concentrates more mass at exactly μ (sharper peak). It's the maximum entropy distribution for a given mean absolute deviation.

f(x) = 12b · exp(− |x − μ| / b)
Interactive Laplace

Compare the sharp Laplace peak with the rounded Gaussian. Adjust scale b to control the spread. The gray dashed Gaussian has the same variance for comparison.

Location μ0.0
Scale b1.0
PropertyValue
Support(−∞, +∞)
Meanμ
Variance2b²
Modeμ
MGFexp(μt) / (1 − b²t²) for |t| < 1/b

Where it's used

L1 / Lasso regularization: Adding λ|w| to a loss is equivalent to a Laplace prior on weights. This drives many weights to exactly zero, producing sparse models. Sparse coding: Natural image patches are well-modeled by sparse coefficients with Laplace priors. Robust statistics: The median is the maximum likelihood estimate under a Laplace model, making it naturally outlier-resistant.

Gaussian vs Laplace, one sentence: Gaussian = L2 penalty = many small weights. Laplace = L1 penalty = few nonzero weights. If you want sparsity, think Laplace.

Worked example

A sparse coding model assumes coefficients follow Laplace(0, 0.5). What fraction of coefficients have |c| < 0.1?

P(|X| < 0.1) = 1 − exp(−0.1/0.5) = 1 − exp(−0.2) ≈ 1 − 0.819 = 0.181. So about 18% are near-zero. The CDF at x ≥ 0 is 1 − 0.5 exp(−x/b).

python
from scipy import stats

dist = stats.laplace(loc=0, scale=0.5)
p_near_zero = dist.cdf(0.1) - dist.cdf(-0.1)
print(f"P(|c| < 0.1) = {p_near_zero:.3f}")  # ~0.181

# Sample and check sparsity
samples = dist.rvs(size=10000)
print(f"Fraction |c| < 0.01: {(abs(samples) < 0.01).mean():.3f}")
Why is the Laplace distribution the natural prior for L1 (Lasso) regularization?

Chapter 4: The Beta

A coin lands heads 7 times out of 10. What's the probability it's a fair coin? What's the probability the true heads-rate is between 0.6 and 0.8? You need a distribution over probabilities — a distribution on [0, 1]. That's the Beta distribution.

The Beta is parameterized by two shape parameters α and β. Think of α as "pseudo-counts of successes" and β as "pseudo-counts of failures." With α = β = 1, you get a flat (uniform) distribution — total ignorance. After observing 7 heads and 3 tails, the posterior becomes Beta(8, 4), peaked around 0.67.

Why does it have this shape?

The Beta is the conjugate prior to the Bernoulli/Binomial likelihood. "Conjugate" means the posterior has the same functional form as the prior: Beta prior + Binomial data = Beta posterior. The shape comes from xα−1(1−x)β−1, which creates flexible asymmetric bumps on [0, 1].

f(x) = 1B(α, β) · xα−1 (1 − x)β−1    for x ∈ [0, 1]
Interactive Beta

Small α, β = wide uncertainty. Large values = concentrated. α > β shifts right (more successes). Try α = β = 1 (uniform) and α = β = 0.5 (Jeffrey's prior).

α2.0
β5.0
PropertyValue
Support[0, 1]
Meanα / (α + β)
Varianceαβ / [(α+β)²(α+β+1)]
Mode(α−1) / (α+β−2) for α, β > 1

Where it's used

Bayesian A/B testing: Model click-through rates as Beta, update with observations. Thompson sampling: Each arm's reward probability is Beta-distributed; sample to decide which arm to pull. Topic models (LDA): Document-topic and topic-word distributions use Dirichlet priors (the multivariate generalization of Beta).

Relationship: Beta(α, β) with α = β = 1 is the Uniform(0, 1). The Dirichlet distribution generalizes Beta to multiple dimensions. If X ~ Beta(α, β), then 1−X ~ Beta(β, α).

Worked example

A robot grasps successfully 14 out of 20 times. With a uniform prior Beta(1, 1), the posterior is Beta(15, 7). What's P(success rate > 0.8)?

python
from scipy import stats

posterior = stats.beta(15, 7)
p_above_80 = 1 - posterior.cdf(0.8)
print(f"P(rate > 0.8) = {p_above_80:.3f}")  # ~0.099
print(f"Posterior mean = {posterior.mean():.3f}")  # 0.682
print(f"95% CI: {posterior.ppf([0.025, 0.975])}")
What makes the Beta distribution special for Bayesian inference about probabilities?

Chapter 5: The Gamma

You're monitoring a server, and requests arrive randomly. The time until the first request is Exponential. But what about the time until the k-th request? That's the Gamma distribution — it's the sum of k independent Exponential waiting times.

The Gamma has a shape parameter k (or α) and a rate parameter θ (or β, depending on convention). When k = 1, it reduces to the Exponential. As k grows, the distribution becomes more symmetric and Gaussian-like (another instance of the CLT).

Why does it have this shape?

The factor xk−1 comes from the probability that exactly k−1 events occurred before time x (a polynomial growth), multiplied by e−x/θ from the exponential decay of waiting. The combination creates a right-skewed hump that shifts rightward as k increases.

f(x) = 1Γ(k) θk · xk−1 e−x/θ    for x > 0
Interactive Gamma

k = 1 is the Exponential. Increase k to see the peak emerge and shift right. The orange marker shows the mean, the teal marker shows the mode.

Shape k2.0
Scale θ1.0
PropertyValue
Support(0, +∞)
Mean
Variancekθ²
Mode(k − 1)θ for k ≥ 1
MGF(1 − θt)−k for t < 1/θ

Where it's used

Bayesian precision priors: The Gamma is the conjugate prior for the precision (inverse variance) of a Gaussian. Queuing theory: Time until k-th arrival in a Poisson process. Rainfall modeling: Precipitation amounts are often Gamma-distributed. Neural network priors: Gamma priors on noise precision in Bayesian neural networks.

Relationships: Gamma(1, θ) = Exponential(θ). Gamma(n/2, 2) = Chi-Squared(n). If X ~ Gamma(k, θ), then 1/X ~ Inverse-Gamma(k, 1/θ). The Gamma is the parent of a whole family of distributions.

Worked example

Server requests arrive at 5 per minute (Poisson). Time until 3rd request is Gamma(k=3, θ=1/5 min). What's P(wait > 1 min)?

python
from scipy import stats

# Gamma with shape k=3, scale theta=1/5 minutes = 0.2
dist = stats.gamma(a=3, scale=0.2)
p_wait = 1 - dist.cdf(1.0)
print(f"P(wait > 1 min) = {p_wait:.4f}")  # ~0.0803
print(f"Expected wait = {dist.mean():.2f} min")  # 0.6 min
What distribution do you get when you set the Gamma shape parameter k = 1?

Chapter 6: The Inverse Gamma

In Bayesian inference, we often need a prior on the variance σ² of a Gaussian. The variance must be positive, and we'd like a conjugate prior (so the posterior has the same form). The Inverse Gamma distribution is exactly that — if X ~ Gamma(α, β), then 1/X ~ InverseGamma(α, β).

The parameters α (shape) and β (scale) control the distribution. Larger α concentrates mass more tightly. Think of α as "how many prior observations worth of confidence" and β as "prior sum of squared deviations."

Why does it have this shape?

The Inverse Gamma is a change of variables from the Gamma. Applying x → 1/x to a Gamma-distributed variable transforms the polynomial growth into polynomial decay near zero and the exponential decay into exponential decay in 1/x. The result is a distribution that's zero at zero, rises to a peak, then has a heavy right tail.

f(x) = βαΓ(α) · x−α−1 e−β/x    for x > 0
Interactive Inverse Gamma

Increase α to concentrate mass. Increase β to shift the distribution right. The orange marker shows the mean (exists only for α > 1).

Shape α3.0
Scale β2.0
PropertyValue
Support(0, +∞)
Meanβ / (α − 1) for α > 1
Varianceβ² / [(α−1)²(α−2)] for α > 2
Modeβ / (α + 1)

Where it's used

Bayesian linear regression: The standard model places an Inverse Gamma prior on the noise variance σ². Hierarchical models: Group-level variance parameters often get Inverse Gamma priors. Gaussian mixture models: Each component's variance can have an Inverse Gamma prior.

Relationship: If X ~ Gamma(α, β), then 1/X ~ InverseGamma(α, 1/β). The Inverse Gamma is the conjugate prior for the variance of a Gaussian with known mean. The Scaled Inverse Chi-Squared is a reparameterization of the Inverse Gamma.

Worked example

You place an InverseGamma(3, 1) prior on σ². What's the prior mean and the probability that σ² > 1?

python
from scipy import stats

dist = stats.invgamma(a=3, scale=1)
print(f"Prior mean of sigma^2 = {dist.mean():.3f}")  # 0.5
print(f"P(sigma^2 > 1) = {1 - dist.cdf(1):.3f}")  # ~0.080
print(f"Mode = {1 / (3 + 1):.3f}")  # 0.25
Why is the Inverse Gamma the standard prior for variance in Bayesian models?

Chapter 7: The Exponential

A sensor on your robot fails at random. There's no "wearing out" — a sensor that's survived 1,000 hours is just as likely to fail in the next hour as a brand-new one. This memoryless property uniquely characterizes the Exponential distribution. It models the waiting time between events in a Poisson process.

The single parameter λ (rate) controls how frequently events happen. Higher λ means shorter average wait times. The mean wait is 1/λ.

Why does it have this shape?

If events happen at a constant rate λ (independently, no memory), then the probability of "no event for t time" decays as e−λt. The PDF is the rate of that decay: λ e−λt. The exponential shape is the only continuous distribution that is memoryless.

f(x) = λ e−λx    for x ≥ 0
Interactive Exponential

Increase λ for more frequent events (shorter waits). The orange marker shows the mean 1/λ. Notice the PDF always starts at λ on the y-axis.

Rate λ1.0
PropertyValue
Support[0, +∞)
Mean1/λ
Variance1/λ²
Mode0
MGFλ / (λ − t) for t < λ
MemorylessP(X > s+t | X > s) = P(X > t)

Where it's used

Sensor failure modeling: Time until a component fails (if failure rate is constant). Poisson processes: Time between arrivals of requests, photons, earthquakes. Survival analysis: Baseline hazard model. Queuing theory: Service times often modeled as Exponential.

The memoryless property: If your sensor has survived 1,000 hours, the expected remaining lifetime is still 1/λ. The past gives you zero information about the future. This is powerful but often unrealistic — real components DO wear out, which motivates the Weibull (next chapter).

Worked example

A robot's motor fails on average every 500 hours (λ = 1/500). What's the probability it lasts at least 700 hours?

P(X > 700) = e−700/500 = e−1.4 ≈ 0.247. About a 25% chance.

python
from scipy import stats

dist = stats.expon(scale=500)  # scale = 1/lambda = mean
print(f"P(X > 700) = {dist.sf(700):.3f}")  # 0.247
print(f"Median lifetime = {dist.median():.1f} hours")  # 346.6
What does the "memoryless" property of the Exponential mean?

Chapter 8: The Weibull

Real components DO wear out. A bearing that's been spinning for 10,000 hours is more likely to fail in the next hour than a new one. The Weibull distribution generalizes the Exponential by adding a shape parameter k that controls whether the failure rate increases, decreases, or stays constant over time.

When k = 1, it's exactly Exponential (constant failure rate). When k > 1, the failure rate increases over time (aging). When k < 1, the failure rate decreases (infant mortality — if it survives the early period, it's likely fine).

Why does it have this shape?

The Weibull modifies the Exponential by replacing λt with (λt)k. This means the hazard function (instantaneous failure rate) is h(t) = (k/λ)(t/λ)k−1, which is a power law in time. This single extra parameter captures a wide range of failure behaviors.

f(x) = kλ (x/λ)k−1 e−(x/λ)k    for x ≥ 0
Interactive Weibull

k < 1: decreasing failure rate. k = 1: constant (Exponential). k > 1: increasing failure rate. k ≈ 3.6 approximates a Gaussian shape.

Shape k1.5
Scale λ1.0
PropertyValue
Support[0, +∞)
Meanλ Γ(1 + 1/k)
Varianceλ² [Γ(1+2/k) − Γ(1+1/k)²]
Modeλ ((k−1)/k)1/k for k > 1, else 0

Where it's used

Reliability engineering: Modeling component lifetimes where failure rate isn't constant. Wind speed modeling: Wind speeds at a location often follow a Weibull. Survival analysis: The Weibull proportional hazards model. Robotics maintenance scheduling: Predicting when to replace components before failure.

Relationship: Weibull(k=1, λ) = Exponential(1/λ). Weibull(k=2, λ) is the Rayleigh distribution (up to reparameterization). The "bathtub curve" in reliability combines early Weibull (k<1), middle Exponential (k=1), and late Weibull (k>1).

Worked example

A robot joint has Weibull-distributed lifetime with k=2, λ=1000 hours. What's P(survive 800 hours)?

python
from scipy import stats
import numpy as np

dist = stats.weibull_min(c=2, scale=1000)  # c=shape, scale=lambda
print(f"P(X > 800) = {dist.sf(800):.3f}")  # 0.527
print(f"Mean lifetime = {dist.mean():.1f} hours")  # ~886
print(f"Replace at 95% reliability: {dist.ppf(0.05):.1f} hours")
What happens when the Weibull shape parameter k > 1?

Chapter 9: The Log-Normal

A startup's valuation grows: 10% this year, −5% next, +20% the year after. Each year multiplies the previous value. After many years, the log of the valuation is a sum of independent effects — and by the CLT, sums converge to Gaussian. So the valuation itself follows a Log-Normal distribution: the exponential of a Gaussian.

If X ~ N(μ, σ²), then Y = eX is Log-Normal. It's always positive, right-skewed, and heavy-tailed. The parameters μ and σ are the mean and standard deviation of the log, not of the distribution itself.

Why does it have this shape?

Taking exp() of a Gaussian stretches the right tail exponentially while compressing the left into zero. Values near exp(μ) are most common, but there's a long right tail of large values. This creates the characteristic right-skewed shape seen in wealth, city sizes, and word frequencies.

f(x) = 1x σ √(2π) · exp(− (ln x − μ)²2σ²)    for x > 0
Interactive Log-Normal

Increase σ to make the right tail heavier. The orange marker is the mean, teal marker is the median. Notice the mean is always right of the median.

μ (of log)0.0
σ (of log)0.5
PropertyValue
Support(0, +∞)
Meanexp(μ + σ²/2)
Variance[exp(σ²) − 1] exp(2μ + σ²)
Medianexp(μ)
Modeexp(μ − σ²)

Where it's used

Word frequencies: Word counts in natural language follow approximately log-normal distributions. Neural network weights after training: Weight magnitudes are often log-normally distributed. Financial modeling: Stock returns are multiplicative, so prices are log-normal (Black-Scholes model). Signal processing: Shadowing in wireless channels is log-normal.

Relationship: If X ~ Normal(μ, σ²), then exp(X) ~ LogNormal(μ, σ²). Conversely, if Y ~ LogNormal, then log(Y) ~ Normal. The CLT for products: products of positive random variables converge to Log-Normal.

Worked example

Word frequencies in a corpus follow LogNormal(μ=2, σ=1.5). What fraction of words appear fewer than 10 times?

python
from scipy import stats

dist = stats.lognorm(s=1.5, scale=np.exp(2))  # s=sigma, scale=exp(mu)
print(f"P(count < 10) = {dist.cdf(10):.3f}")
print(f"Median count = {dist.median():.1f}")  # exp(2) = 7.4
print(f"Mean count = {dist.mean():.1f}")  # exp(2+1.5^2/2) = 24.5
When does a Log-Normal distribution arise naturally?

Chapter 10: The Uniform

You're spinning a wheel that can land anywhere between 0 and 360 degrees with equal probability. No angle is more likely than any other. This is the Uniform distribution — the simplest continuous distribution. It's the formal way of saying "I have absolutely no idea, and every value in this range is equally plausible."

The Uniform on [a, b] has constant density 1/(b − a) within the interval and zero outside. Two parameters: the lower bound a and the upper bound b.

Why does it have this shape?

The Uniform maximizes entropy among all distributions with bounded support. If all you know is that x lies in [a, b], the most honest distribution assigns equal density everywhere. A flat line is the only function that integrates to 1 over [a, b] without favoring any subregion.

f(x) = 1b − a    for x ∈ [a, b],   0 otherwise
Interactive Uniform

Adjust the bounds. Notice the PDF height adjusts so the total area is always 1. Narrower range = taller rectangle.

Lower a-2.0
Upper b2.0
PropertyValue
Support[a, b]
Mean(a + b) / 2
Variance(b − a)² / 12
ModeAny value in [a, b]
MGF(etb − eta) / [t(b−a)]

Where it's used

Random initialization: Xavier/He initialization of neural network weights uses uniform or Gaussian variants. Uninformative priors: When you have no information, start with Uniform. Random number generation: Every random number generator starts with Uniform[0,1] and transforms it. Monte Carlo sampling: Uniform random variables drive importance sampling and MCMC acceptance.

Relationships: Uniform(0, 1) = Beta(1, 1). Every continuous CDF F transforms a Uniform into the corresponding distribution: if U ~ Uniform(0,1), then F−1(U) has CDF F. This is the inverse transform method.

Worked example

Xavier initialization sets weights from Uniform(−√(6/n), √(6/n)) where n = fan_in + fan_out. For a layer with 256 inputs and 128 outputs, n = 384. What's the weight range and variance?

python
import numpy as np
from scipy import stats

n = 256 + 128
bound = np.sqrt(6 / n)
print(f"Range: [{-bound:.4f}, {bound:.4f}]")  # [-0.1250, 0.1250]

dist = stats.uniform(loc=-bound, scale=2*bound)
print(f"Variance = {dist.var():.6f}")  # ~0.005208 = 2/n
Why is the Uniform distribution the maximum-entropy choice for bounded variables?

Chapter 11: The Cauchy

Spin a pointer mounted at (0, 1) in the plane. It points at a random angle θ, and you record where it hits the x-axis. The resulting distribution of x-intercepts is the Cauchy distribution. It looks like a bell curve but with tails so heavy that the mean and variance don't exist.

That's not a typo. If you average N samples from a Cauchy, the average doesn't converge — it has the same distribution as a single sample. The CLT fails completely. This makes the Cauchy the ultimate stress test for any algorithm that assumes finite variance.

Why does it have this shape?

The Cauchy is the ratio of two independent standard Gaussians: Z1/Z2. Since Z2 can be very close to zero, the ratio can be astronomically large, creating the heavy tails. The PDF decays as 1/x² (not exponentially), so the tails contain infinite expected mass.

f(x) = 1πγ · 11 + ((x − x0)/γ)²
Interactive Cauchy

Compare with the gray Gaussian. The Cauchy tails never really die — extreme values are far more common than with a Gaussian. The orange marks the median (the mean doesn't exist!).

Location x00.0
Scale γ1.0
PropertyValue
Support(−∞, +∞)
MeanUndefined
VarianceUndefined
Medianx0
Modex0

Where it's used

Robust estimation: The Cauchy likelihood makes estimators highly resistant to outliers because it assigns non-negligible probability to extreme values. Physics (Lorentzian): The spectral line shape of atomic resonances is Cauchy (called the Lorentz profile). Testing algorithms: If your algorithm breaks on Cauchy-distributed data, it's relying too heavily on the existence of moments.

Warning: The sample mean of Cauchy data does NOT converge. Taking the average of a million Cauchy samples gives an estimate with the same uncertainty as a single sample. Always use the median for Cauchy data.

Worked example

You suspect outliers in your data. Fit both a Gaussian and Cauchy likelihood. If the data point x = 15 comes from N(0, 1), the density is essentially 0. From Cauchy(0, 1), the density is 1/(π(1 + 225)) ≈ 0.0014 — still small, but vastly larger than the Gaussian density.

python
from scipy import stats

x = 15
gauss_pdf = stats.norm.pdf(x, 0, 1)
cauchy_pdf = stats.cauchy.pdf(x, 0, 1)
print(f"Gaussian density at x=15: {gauss_pdf:.2e}")  # ~5.5e-50
print(f"Cauchy density at x=15:   {cauchy_pdf:.4f}")  # ~0.0014
print(f"Cauchy/Gaussian ratio:    {cauchy_pdf/gauss_pdf:.2e}")
Why does the sample mean fail for Cauchy-distributed data?

Chapter 12: The Rayleigh

You're tracking a target with a radar system. The signal bounces off the target and returns with both an in-phase (X) and quadrature (Y) component, each corrupted by independent Gaussian noise. The magnitude of this 2D noise — √(X² + Y²) — follows the Rayleigh distribution.

Whenever you take the distance (L2 norm) of a 2D Gaussian vector, you get a Rayleigh. This makes it fundamental in signal processing, communications, and any 2D measurement system.

Why does it have this shape?

Convert 2D Gaussian coordinates to polar: x = r cos(θ), y = r sin(θ). The Jacobian of this transformation introduces an extra factor of r. So the density of r is proportional to r · exp(−r²/2σ²). The r factor suppresses values near zero (small circles have small circumference) while the exponential kills large values.

f(r) = rσ² · exp(−r² / 2σ²)    for r ≥ 0
Interactive Rayleigh

Adjust σ (the spread of each Gaussian component). Notice the PDF starts at zero, rises to a peak at σ, then decays. The orange marker shows the mean.

Scale σ1.0
PropertyValue
Support[0, +∞)
Meanσ √(π/2)
Varianceσ² (4 − π) / 2
Modeσ

Where it's used

Radar/sonar signal processing: The envelope of a noisy sinusoid follows a Rayleigh distribution. Wireless channel fading: The Rayleigh fading model describes signal amplitude in non-line-of-sight channels. GPS error modeling: 2D positioning error (CEP) is Rayleigh-distributed. Wind speed: Wind speed distributions are often approximately Rayleigh.

Relationships: If X, Y ~ N(0, σ²) independently, then √(X²+Y²) ~ Rayleigh(σ). Rayleigh is a special case of the Weibull with k=2. If you add a mean to the 2D Gaussian (non-zero-mean components), you get the Rice distribution instead.

Worked example

A GPS receiver has independent N(0, 3²) meter errors in X and Y. What's P(2D error < 5m)?

python
from scipy import stats

dist = stats.rayleigh(scale=3)  # sigma = 3 meters
print(f"P(error < 5m) = {dist.cdf(5):.3f}")  # ~0.672
print(f"Mean 2D error = {dist.mean():.2f}m")  # ~3.76m
print(f"95th percentile = {dist.ppf(0.95):.2f}m")  # ~7.35m
Why does the Rayleigh PDF start at zero (f(0) = 0) instead of at a maximum?

Chapter 13: The Chi-Squared

You have n independent standard Gaussian random variables Z1, ..., Zn. Square each one and add them up: Q = Z1² + ... + Zn². The resulting Q follows a Chi-Squared distribution with n degrees of freedom, written χ²(n). It measures the "total squared deviation" from zero.

This distribution appears whenever you compute a sum of squared residuals, which is everywhere in statistics: goodness-of-fit tests, confidence intervals for variance, the denominator of the F-test, and covariance matrix estimation.

Why does it have this shape?

For n = 1, you're squaring a single Gaussian — most values cluster near zero (small squared values) with a heavy right tail. For larger n, the sum of many squared Gaussians becomes more symmetric by the CLT. The Chi-Squared is literally Gamma(n/2, 2), which explains its shape: the Gamma captures the sum of Exponential-like variables.

f(x) = 12n/2 Γ(n/2) · xn/2 − 1 e−x/2    for x > 0
Interactive Chi-Squared

Increase the degrees of freedom n. At n=1 the density peaks at zero. At n=2 it's Exponential. For large n it looks Gaussian. The orange marker is the mean (= n).

d.f. n3
PropertyValue
Support(0, +∞)
Meann
Variance2n
Modemax(n − 2, 0)
MGF(1 − 2t)−n/2 for t < 1/2

Where it's used

Hypothesis testing: The χ² test checks if observed frequencies match expected frequencies. Variance estimation: The sample variance times (n−1)/σ² follows χ²(n−1). Covariance estimation: The Wishart distribution (multivariate χ²) is the conjugate prior for covariance matrices. Mahalanobis distance: The squared Mahalanobis distance follows χ²(k) for k-dimensional Gaussian data.

Relationships: χ²(n) = Gamma(n/2, 2). If Q ~ χ²(n) and Z ~ N(0,1) independently, then Z/√(Q/n) ~ Student-t(n). The Rayleigh is related: √(χ²(2)) ~ Rayleigh. The χ² is the backbone connecting Gaussian, Student-t, F, and Wishart distributions.

Worked example

You fit a model to data with n=10 residuals. Under H0, the sum of squared standardized residuals follows χ²(10). You observe Q = 18.3. Is this significant at α = 0.05?

python
from scipy import stats

n = 10
Q = 18.3
p_value = 1 - stats.chi2.cdf(Q, df=n)
critical = stats.chi2.ppf(0.95, df=n)
print(f"p-value = {p_value:.3f}")  # ~0.050
print(f"Critical value = {critical:.2f}")  # 18.31
print(f"Reject H0? {Q > critical}")  # borderline!
The Chi-Squared distribution with n degrees of freedom is a special case of which distribution?

Chapter 14: Von Mises

Your robot has a compass that estimates heading angle. The heading is 0° to 360°, and the important thing is that 359° and 1° are close together — angles wrap around. You can't use a Gaussian here because a Gaussian at μ = 350° with σ = 20° would put significant mass at 370°, which doesn't exist. You need a distribution that lives on a circle. That's the Von Mises distribution.

The Von Mises is the "Gaussian on a circle." It has a mean direction μ and a concentration κ (analogous to 1/σ²). When κ = 0, it's uniform on the circle. As κ → ∞, it concentrates at μ.

Why does it have this shape?

The maximum-entropy distribution on a circle with a specified mean direction and mean resultant length is proportional to exp(κ cos(θ − μ)). The cosine ensures the density wraps around smoothly: at θ = μ the exponent is maximized, and at the antipodal point it's minimized. The normalization constant involves the modified Bessel function I0(κ).

f(θ) = 12π I0(κ) · exp(κ cos(θ − μ))    for θ ∈ [−π, π)
Interactive Von Mises

Left: the distribution on a circle. Right: unrolled as a function of θ. Increase κ to concentrate around the mean direction. The orange arrow shows the mean direction.

Mean μ (deg)0
Concentration κ2.0
PropertyValue
Support[−π, π) (circular)
Mean directionμ
Circular variance1 − I1(κ)/I0(κ)
Modeμ

Where it's used

Robotics heading estimation: Fusing compass and gyroscope readings for orientation. Wind direction modeling: Meteorological wind direction is inherently circular. Protein structure prediction: Backbone dihedral angles (φ, ψ) are modeled with Von Mises distributions. Image orientation: Edge directions in computer vision.

Why not just use a Gaussian? Gaussians on angles cause "wrapping artifacts." A Gaussian centered at 350° will think 10° is 340 degrees away instead of 20 degrees away. The Von Mises correctly handles the circular topology. For small κ, the difference from a wrapped Gaussian is negligible, but for high concentration the proper circular treatment matters.

Worked example

A compass reads headings with Von Mises noise (κ = 5). What's P(heading within ±30° of true)?

python
import numpy as np
from scipy import stats

kappa = 5
dist = stats.vonmises(kappa)
angle_30 = np.radians(30)
p = dist.cdf(angle_30) - dist.cdf(-angle_30)
print(f"P(within ±30°) = {p:.3f}")  # ~0.891

# Approximate circular std dev
print(f"Circ. std dev ≈ {np.degrees(1/np.sqrt(kappa)):.1f}°")  # ~25.5°
Why can't you use a standard Gaussian for angular data like robot heading?

Chapter 15: Distribution Explorer

Now let's put it all together. The explorer below lets you select any of the 14 distributions, adjust its parameters, and see the PDF, CDF, and random samples update in real time. Use it to compare distributions side-by-side and build intuition for how parameters change shapes.

Universal Distribution Explorer

Select a distribution, adjust parameters, and watch the PDF (top), CDF (middle), and histogram of 1000 samples (bottom) update live.

Relationship Map

Distributions don't exist in isolation. They form a rich family tree connected by special cases, limits, and transformations.

Distribution Family Tree

Lines show how distributions are related. Hover over a node to see the connection.

Comparison Table

DistributionSupportParamsKey Use
Gaussian(−∞, +∞)μ, σNoise, priors, CLT
Student-t(−∞, +∞)νSmall samples, robust regression
Laplace(−∞, +∞)μ, bL1 regularization, sparsity
Beta[0, 1]α, βPriors on probabilities
Gamma(0, ∞)k, θPrecision priors, waiting times
Inverse Gamma(0, ∞)α, βVariance priors
Exponential[0, ∞)λMemoryless waiting, Poisson
Weibull[0, ∞)k, λReliability, failure analysis
Log-Normal(0, ∞)μ, σMultiplicative processes
Uniform[a, b]a, bInitialization, uninformative priors
Cauchy(−∞, +∞)x0, γRobust estimation
Rayleigh[0, ∞)σ2D noise magnitude, radar
Chi-Squared(0, ∞)nHypothesis testing, covariance
Von Mises[−π, π)μ, κCircular data, heading estimation
Key relationships to remember: Gamma(1, θ) = Exponential. Gamma(n/2, 2) = χ²(n). Student-t(ν→∞) = Gaussian. Student-t(1) = Cauchy. Beta(1,1) = Uniform(0,1). Weibull(2, λ) ≈ Rayleigh. exp(Gaussian) = Log-Normal. 1/Gamma = Inverse Gamma.
Which distribution would you use as a prior on the variance parameter of a Gaussian likelihood?