When Gaussians aren't enough. The exotic probability distributions that power modern ML, robotics, finance, and physics.
You know the Gaussian. You know the Exponential. Maybe you know the Poisson and Binomial. These cover a huge range of problems. But try to model any of these situations and they fall apart:
A robot arm needs to express uncertainty over rotations in 3D space. Rotations live on a curved manifold, not a flat number line — a Gaussian centered at 359 degrees would put probability mass at 360, 361, 362... but those are the same as 0, 1, 2 degrees. Standard distributions have no idea that the space wraps around.
An RL agent chooses actions between -1 and +1. It learns a Gaussian policy, but Gaussians have infinite support — they assign probability to actions at +47, which is meaningless. You need a distribution that respects the bounds.
An earthquake happens. Now the probability of another earthquake spikes — aftershocks cluster. Standard distributions treat events as independent. You need a process where events excite more events.
Toggle between scenarios to see why standard distributions fail. Each colored region shows the problem domain.
You have a neural network that outputs raw scores (logits) for 5 classes: [2.0, 1.0, 0.1, -1.0, 3.5]. Which class should it pick? The scores aren't probabilities — they can be negative, they don't sum to 1. You need a way to convert arbitrary real numbers into a valid probability distribution.
The softmax function (also called the Gibbs distribution in statistical mechanics) does exactly this. For a vector of scores z = [z1, ..., zK], it produces:
The parameter τ (tau) is the temperature. At high temperature (τ → ∞), the distribution becomes uniform — all classes equally likely. At low temperature (τ → 0), all probability concentrates on the highest-scoring class. Think of τ as a "sharpness knob."
| Property | Value |
|---|---|
| Input | Any real-valued vector z ∈ RK |
| Output | Probability simplex: pi > 0, ∑ pi = 1 |
| Temperature τ → 0 | argmax (greedy selection) |
| Temperature τ → ∞ | Uniform distribution |
| Shift invariance | softmax(z + c) = softmax(z) |
Five logits are shown as bars. Adjust temperature to see how the probability distribution changes. Low temp = peaky. High temp = flat.
Logits: z = [2.0, 1.0, 0.1]. With τ = 1:
exp(2.0) = 7.389, exp(1.0) = 2.718, exp(0.1) = 1.105. Sum = 11.212.
p = [0.659, 0.242, 0.099]. The highest logit gets 66% of the probability mass.
python import numpy as np def softmax(z, tau=1.0): e = np.exp(z / tau - np.max(z / tau)) # subtract max for numerical stability return e / e.sum() logits = np.array([2.0, 1.0, 0.1, -1.0, 3.5]) print(softmax(logits, tau=1.0)) # peaked at class 4 print(softmax(logits, tau=0.1)) # nearly one-hot print(softmax(logits, tau=10.0)) # nearly uniform
A robot arm grasps a cup. To describe the cup's orientation, you need three angles (roll, pitch, yaw). But these angles don't live on a flat Euclidean space — they live on a curved surface called a manifold. Specifically, 3D rotations form the group SO(3) (Special Orthogonal group in 3 dimensions).
Why can't we just use a Gaussian on Euler angles? Because of gimbal lock (singularities where two axes align) and wrapping (an angle of 361° is the same as 1°). A Gaussian centered at 350° with σ = 20° would naively split probability between 330-370, but 370° = 10°, so the distribution should wrap around.
The standard solution uses quaternions (4D unit vectors) or rotation matrices (3×3 orthogonal matrices). On these representations, we define distributions directly on the manifold.
The Bingham distribution is the analog of a Gaussian on the unit sphere. For a unit quaternion q:
where M is a symmetric matrix controlling the concentration and preferred direction. The Matrix Fisher distribution works directly on rotation matrices R:
where F encodes the mean rotation and concentration.
Samples from a distribution on SO(3), projected to a sphere. Adjust concentration κ — low = spread out, high = concentrated around the mean rotation.
python import numpy as np from scipy.spatial.transform import Rotation def sample_rotation_vmf(mean_axis, kappa, n=100): """Sample rotations concentrated around an axis using von Mises-Fisher.""" # Sample angle from vMF on the circle angles = np.random.vonmises(0, kappa, n) # Create rotation objects rotations = [Rotation.from_rotvec(a * mean_axis) for a in angles] return rotations samples = sample_rotation_vmf(np.array([0,0,1]), kappa=10)
You're studying wind directions at a weather station. Wind can blow from any compass heading: 0° to 360°, and then it wraps back to 0°. You need a probability distribution on the circle. The von Mises distribution (the "circular Gaussian") is one option, but what if your data has heavy tails — occasional extreme deviations from the mean direction?
The Wrapped Cauchy distribution is the circular analog of the Cauchy distribution. Take a Cauchy distribution on the real line, wrap it around the unit circle, and you get a distribution with much heavier tails than von Mises. Its PDF on the circle [0, 2π) is:
Here μ is the mean direction and ρ ∈ [0, 1) is the concentration parameter. When ρ = 0, the distribution is uniform on the circle. As ρ → 1, it concentrates around μ — but always with heavier tails than von Mises at the same concentration.
Compare the two circular distributions. The orange is Wrapped Cauchy, teal is von Mises. Notice the heavier tails of the Wrapped Cauchy at the same concentration.
python import numpy as np def wrapped_cauchy_pdf(theta, mu, rho): """Wrapped Cauchy PDF on [0, 2pi).""" return (1-rho**2) / (2*np.pi * (1+rho**2 - 2*rho*np.cos(theta-mu))) # Sample via inverse CDF def sample_wrapped_cauchy(mu, rho, n=1000): u = np.random.uniform(0, 1, n) return (mu + 2*np.arctan2((1-rho)*np.tan(np.pi*(u-0.5)), 1+rho)) % (2*np.pi)
You're monitoring a river's water level. Most days it's between 2-4 meters. But during a flood, it could hit 10, 15, even 20 meters. Insurance companies, engineers, and risk analysts don't care about typical days — they need to model the extreme events. How high could the river get in a once-per-century flood?
The Generalized Pareto Distribution (GPD) is the theoretically justified distribution for modeling exceedances over a high threshold. By the Pickands-Balkema-de Haan theorem, the distribution of values above a high threshold converges to a GPD, regardless of the original distribution.
The shape parameter ξ (xi) controls the tail behavior. When ξ > 0, the tail is heavy (power-law decay — extreme events are relatively likely). When ξ = 0, it reduces to the exponential distribution. When ξ < 0, the distribution has a finite upper bound at -σ/ξ.
| ξ value | Tail type | Example |
|---|---|---|
| ξ > 0 | Heavy (Pareto-like) | Financial losses, insurance claims |
| ξ = 0 | Exponential | Waiting times above threshold |
| ξ < 0 | Bounded | Physical measurements with limits |
Adjust ξ to see how the tail weight changes. Orange is the GPD density, gray dashed is exponential for reference.
python from scipy import stats # Fit GPD to exceedances above a threshold data = np.random.pareto(2, 10000) # heavy-tailed data threshold = np.percentile(data, 95) exceedances = data[data > threshold] - threshold shape, loc, scale = stats.genpareto.fit(exceedances, floc=0) print(f"Shape xi = {shape:.3f}, Scale sigma = {scale:.3f}")
What if you want a distribution that can be any shape? Not Gaussian, not Pareto, not anything from a textbook — but a complex, multimodal, asymmetric blob learned from data? Normalizing flows let you start with a simple distribution (like a Gaussian) and warp it through a sequence of invertible transformations until it matches your target.
The key idea: if z ~ N(0, I) and x = f(z) where f is an invertible, differentiable function, then the density of x is given by the change of variables formula:
where J is the Jacobian matrix of partial derivatives. The determinant tells you how much the transformation stretches or compresses space locally.
By stacking K invertible layers, x = fK ˆ ... ˆ f1(z), you can learn arbitrarily complex distributions. Each layer must be (1) invertible (so you can compute f-1), and (2) have a tractable Jacobian determinant (so you can evaluate the density).
Watch a simple Gaussian (left) get transformed through flow layers. Each layer stretches and bends space. The result (right) can be multimodal.
python import torch from torch import nn class AffineFlow(nn.Module): """Simplest normalizing flow: x = z * exp(s) + t""" def __init__(self): super().__init__() self.s = nn.Parameter(torch.zeros(1)) self.t = nn.Parameter(torch.zeros(1)) def forward(self, z): x = z * torch.exp(self.s) + self.t log_det = self.s # log|det J| = s return x, log_det def inverse(self, x): z = (x - self.t) * torch.exp(-self.s) return z
You're training a robot to control a motor. The action (torque) must be between -1 and +1. Your policy network outputs a mean μ and standard deviation σ, defining a Gaussian. But a Gaussian has infinite support — it assigns probability to torques of +50 or -100, which are physically impossible.
The squashed Gaussian fixes this by passing the Gaussian sample through tanh, which squishes any real number into (-1, 1). If u ~ N(μ, σ2) and a = tanh(u), then a is guaranteed to be in (-1, 1). But we need the density of a, not u.
Since tanh'(u) = 1 - tanh2(u), we divide by this factor. In log-probability (which is what we use for policy gradients):
Gray is the raw Gaussian. Orange is after tanh squashing. Notice the density piles up near the boundaries ±1.
python import torch from torch.distributions import Normal def squashed_gaussian_sample(mu, sigma): dist = Normal(mu, sigma) u = dist.rsample() # reparameterized sample a = torch.tanh(u) # squash to (-1, 1) # Log-prob with correction log_prob = dist.log_prob(u) - torch.log(1 - a**2 + 1e-6) return a, log_prob.sum(-1)
Stock returns are famously asymmetric. Losses tend to be larger and more sudden than gains. If you fit a Gaussian, you miss this asymmetry entirely — the bell curve is perfectly symmetric. You need a distribution that can lean to one side.
The Skew-Normal distribution adds a single parameter α to the Gaussian that controls asymmetry. Its PDF is:
where φ(x) is the standard normal PDF and Φ(x) is the standard normal CDF. The parameter α controls skewness: α = 0 gives the standard Gaussian, α > 0 skews right (long right tail), α < 0 skews left.
| Parameter | Effect |
|---|---|
| α = 0 | Standard Gaussian (no skew) |
| α > 0 | Right skew (long right tail) |
| α < 0 | Left skew (long left tail) |
| |α| → ∞ | Approaches half-normal distribution |
Adjust α to skew the distribution left or right. The gray dashed line is the symmetric Gaussian for reference.
python from scipy import stats # skewnorm(a) where a = alpha (shape parameter) x = np.linspace(-4, 8, 200) for alpha in [-4, 0, 4, 8]: y = stats.skewnorm.pdf(x, alpha) print(f"alpha={alpha}: mean={stats.skewnorm.mean(alpha):.3f}, " f"skew={stats.skewnorm.stats(alpha, moments='s')[0]:.3f}")
The Gaussian penalizes outliers with squared distance (x - μ)2. The Laplace distribution uses absolute distance |x - μ|, which is gentler on outliers. What if you want something in between? Or even sharper than Gaussian? The Generalized Normal Distribution (GND) provides a single "shape dial."
The parameter β controls the tail weight:
| β | Distribution | Tail weight |
|---|---|---|
| β = 1 | Laplace (double exponential) | Heavy tails |
| β = 2 | Gaussian | Medium tails |
| β → ∞ | Uniform on [μ-α, μ+α] | No tails |
| β < 1 | Super-heavy tails | Extremely heavy |
Drag β to morph the distribution. Watch the tails grow heavy (β<2) or shrink (β>2). Gray dashed = Gaussian reference.
python from scipy import stats # scipy.stats.gennorm(beta) is the generalized normal x = np.linspace(-5, 5, 300) for beta in [0.5, 1, 2, 5]: y = stats.gennorm.pdf(x, beta) print(f"beta={beta}: kurtosis={stats.gennorm.stats(beta, moments='k')[0]:.2f}")
Two stocks both have heavy-tailed returns (modeled by, say, Student-t distributions). You know each stock's marginal distribution. But how do they move together? In calm markets they're weakly correlated, but during crashes they drop in lockstep. The dependency structure is separate from the individual distributions.
A copula separates the problem of modeling a multivariate distribution into two parts: (1) the marginals (each variable's individual distribution) and (2) the dependency structure (how they relate). Sklar's theorem guarantees this decomposition always exists:
where FX, FY are the marginal CDFs and C is the copula function mapping [0,1]2 → [0,1]. The copula C captures all the dependency information.
| Copula | Dependency type | Tail dependence |
|---|---|---|
| Gaussian | Symmetric, no tail dependence | None |
| Clayton | Lower tail dependence | Left tail only |
| Gumbel | Upper tail dependence | Right tail only |
| Frank | Symmetric, no tail dependence | None |
| Student-t | Symmetric tail dependence | Both tails |
Samples from different copulas with uniform marginals. Notice how Clayton concentrates points in the lower-left (joint lows), while Gumbel clusters the upper-right (joint highs).
python from scipy.stats import norm import numpy as np def gaussian_copula_sample(rho, n=500): """Sample from a Gaussian copula with correlation rho.""" mean = [0, 0] cov = [[1, rho], [rho, 1]] z = np.random.multivariate_normal(mean, cov, n) u = norm.cdf(z) # transform to uniform marginals return u
An earthquake strikes. Over the next few hours, the rate of aftershocks spikes dramatically, then slowly decays. Each aftershock can itself trigger more aftershocks. A tweet goes viral: every retweet increases the chance of more retweets. These are self-exciting processes — events make future events more likely.
The Hawkes process is a point process where the event rate (intensity) at time t depends on past events. Its conditional intensity is:
Three parameters tell the whole story: μ is the background rate (events that happen spontaneously). α is the excitation (how much each event boosts the rate). β is the decay rate (how fast the excitement fades). The process is stable (doesn't explode) when α < 1.
| Domain | Events | Self-excitation mechanism |
|---|---|---|
| Seismology | Earthquakes | Aftershock sequences |
| Social media | Posts/retweets | Viral cascades |
| Finance | Trades | Order flow clustering |
| Crime | Incidents | Copycat/retaliation |
| Neuroscience | Neural spikes | Post-synaptic excitation |
Watch events cascade. Each event (vertical tick) boosts the intensity (orange curve). Adjust α to see the process go from quiet to nearly explosive.
python def simulate_hawkes(mu, alpha, beta, T=100): """Simulate a univariate Hawkes process via thinning.""" events = [] t = 0 while t < T: lam_bar = mu + alpha * beta * sum( np.exp(-beta*(t - ti)) for ti in events) dt = np.random.exponential(1 / lam_bar) t += dt if t >= T: break lam_t = mu + alpha * beta * sum( np.exp(-beta*(t - ti)) for ti in events) if np.random.uniform() < lam_t / lam_bar: events.append(t) return events
The Central Limit Theorem says that if you average many independent, finite-variance random variables, you get a Gaussian. But what if the variance is infinite? City sizes, wealth distributions, earthquake magnitudes, and stock market crashes all have such heavy tails that their variance doesn't converge to a finite number. The Gaussian CLT doesn't apply.
The stable distributions (also called Lévy alpha-stable) generalize the Gaussian to handle these cases. They satisfy a generalized CLT: the sum of heavy-tailed i.i.d. variables converges to a stable distribution.
A stable distribution has four parameters:
| Parameter | Role | Range |
|---|---|---|
| α (stability) | Tail heaviness: smaller = heavier | (0, 2] |
| β (skewness) | Asymmetry | [-1, 1] |
| γ (scale) | Width | > 0 |
| δ (location) | Center | Any real |
When α = 2, you get the Gaussian. When α = 1 and β = 0, you get the Cauchy. For α < 2, the distribution has infinite variance. For α ≤ 1, even the mean is undefined.
Drag α from 2 (Gaussian) toward 1 (Cauchy). Watch the tails grow heavier. The gray dashed Gaussian reference stays fixed.
python from scipy import stats # scipy.stats.levy_stable(alpha, beta, loc, scale) # alpha=2 → Gaussian, alpha=1, beta=0 → Cauchy x = np.linspace(-10, 10, 500) for alpha in [0.5, 1.0, 1.5, 2.0]: y = stats.levy_stable.pdf(x, alpha, beta=0) # Note: for alpha < 2, variance = infinity
You want to generate photorealistic images. One approach: start with pure noise and gradually denoise it into a crisp image. But to denoise, you need to know the score function — the gradient of the log-density ∇x log p(x) — at every noise level. This is the core idea behind diffusion models.
The forward process gradually adds Gaussian noise to data x0 over T steps:
After enough steps, xT ≈ N(0, I) — pure noise. The noise schedule {βt} controls how fast the data is destroyed. Common schedules include:
| Schedule | Formula | Property |
|---|---|---|
| Linear | βt linearly from 10-4 to 0.02 | Simple, original DDPM |
| Cosine | βt from cosine of cumulative α | Gentler, better for small images |
| Sigmoid | βt from sigmoid function | Smooth acceleration |
The reverse process learns to denoise: a neural network sθ(xt, t) predicts the score (noise direction) at each step. The score matching loss is:
A 1D "image" (initial data distribution, two peaks) gets progressively noisier. Drag the timestep to watch it dissolve into a Gaussian.
python import torch def linear_schedule(T=1000, start=1e-4, end=0.02): return torch.linspace(start, end, T) def forward_diffusion(x0, t, betas): alphas = 1 - betas alpha_bar = torch.cumprod(alphas, dim=0)[t] noise = torch.randn_like(x0) xt = torch.sqrt(alpha_bar) * x0 + torch.sqrt(1 - alpha_bar) * noise return xt, noise
We've covered 12 specialized distributions. Each exists because some domain has structure that standard distributions can't capture. Here's the complete taxonomy:
| Structure | Distribution | Standard Alternative |
|---|---|---|
| Scores → probabilities | Softmax / Gibbs | Normalization |
| Rotation manifold | SO(3) / SE(3) (Bingham, Fisher) | Gaussian on Euler angles |
| Circular domain | Wrapped Cauchy | von Mises |
| Extreme values | Generalized Pareto | Gaussian tail |
| Arbitrary shape | Normalizing Flows | GMM |
| Bounded actions | Squashed Gaussian | Clipped Gaussian |
| Asymmetric noise | Skew-Normal | Gaussian |
| Adjustable tails | Generalized Normal | Gaussian or Laplace |
| Dependency modeling | Copulas | Multivariate Gaussian |
| Self-excitation | Hawkes Process | Poisson process |
| Infinite variance | Lévy / Stable | Gaussian (fails) |
| Noise schedules | Diffusion / Score-Based | VAE latent space |
How the 12 specialized distributions relate to standard ones and to each other. Hover over nodes to highlight connections.
These distributions appear throughout modern ML and robotics:
Diffusion Models — Full lesson on the denoising framework using diffusion/score-based distributions.
Flow Matching — A simpler alternative to normalizing flows with straight-line paths.
RL Algorithms — SAC uses the squashed Gaussian policy.
VLAs — Robot policies use SO(3)/SE(3) distributions for pose prediction.
Classical SLAM — SE(3) distributions for robot pose uncertainty.
Bayesian Estimation — Conjugate priors, the foundation for many of these distributions.
"All models are wrong, but some are useful." — George Box