Probability & Statistics

Specialized Distributions

When Gaussians aren't enough. The exotic probability distributions that power modern ML, robotics, finance, and physics.

Prerequisites: Basic probability + Gaussian/Exponential familiarity. Everything else built from scratch.
14
Chapters
14+
Simulations
0
Assumed Knowledge

Chapter 0: Why Specialized?

You know the Gaussian. You know the Exponential. Maybe you know the Poisson and Binomial. These cover a huge range of problems. But try to model any of these situations and they fall apart:

A robot arm needs to express uncertainty over rotations in 3D space. Rotations live on a curved manifold, not a flat number line — a Gaussian centered at 359 degrees would put probability mass at 360, 361, 362... but those are the same as 0, 1, 2 degrees. Standard distributions have no idea that the space wraps around.

An RL agent chooses actions between -1 and +1. It learns a Gaussian policy, but Gaussians have infinite support — they assign probability to actions at +47, which is meaningless. You need a distribution that respects the bounds.

An earthquake happens. Now the probability of another earthquake spikes — aftershocks cluster. Standard distributions treat events as independent. You need a process where events excite more events.

The pattern: Specialized distributions exist because domains have structure. Circular spaces, bounded intervals, heavy tails, self-excitation, manifold geometry. Each distribution in this lesson was invented because someone hit a wall with the standard toolkit and had to build something new.
Standard vs Specialized: Where Gaussians Break

Toggle between scenarios to see why standard distributions fail. Each colored region shows the problem domain.

Why do we need specialized distributions beyond the Gaussian?

Chapter 1: Softmax / Gibbs Distribution

You have a neural network that outputs raw scores (logits) for 5 classes: [2.0, 1.0, 0.1, -1.0, 3.5]. Which class should it pick? The scores aren't probabilities — they can be negative, they don't sum to 1. You need a way to convert arbitrary real numbers into a valid probability distribution.

The softmax function (also called the Gibbs distribution in statistical mechanics) does exactly this. For a vector of scores z = [z1, ..., zK], it produces:

p(i) = exp(zi / τ) / ∑j exp(zj / τ)

The parameter τ (tau) is the temperature. At high temperature (τ → ∞), the distribution becomes uniform — all classes equally likely. At low temperature (τ → 0), all probability concentrates on the highest-scoring class. Think of τ as a "sharpness knob."

Why "Gibbs"? In statistical mechanics, the Boltzmann distribution assigns probability to energy states as p(state) ∝ exp(-E / kT). Replace energy E with negative logit -z and physical temperature kT with τ, and you get softmax. Neural networks borrowed this idea from 19th-century physics.

Properties

PropertyValue
InputAny real-valued vector z ∈ RK
OutputProbability simplex: pi > 0, ∑ pi = 1
Temperature τ → 0argmax (greedy selection)
Temperature τ → ∞Uniform distribution
Shift invariancesoftmax(z + c) = softmax(z)
Interactive: Temperature Controls Sharpness

Five logits are shown as bars. Adjust temperature to see how the probability distribution changes. Low temp = peaky. High temp = flat.

Temperature τ1.00

Worked Example

Logits: z = [2.0, 1.0, 0.1]. With τ = 1:

exp(2.0) = 7.389, exp(1.0) = 2.718, exp(0.1) = 1.105. Sum = 11.212.

p = [0.659, 0.242, 0.099]. The highest logit gets 66% of the probability mass.

python
import numpy as np

def softmax(z, tau=1.0):
    e = np.exp(z / tau - np.max(z / tau))  # subtract max for numerical stability
    return e / e.sum()

logits = np.array([2.0, 1.0, 0.1, -1.0, 3.5])
print(softmax(logits, tau=1.0))   # peaked at class 4
print(softmax(logits, tau=0.1))   # nearly one-hot
print(softmax(logits, tau=10.0))  # nearly uniform
What happens to the softmax distribution as temperature τ approaches zero?

Chapter 2: SO(3)/SE(3) Distributions

A robot arm grasps a cup. To describe the cup's orientation, you need three angles (roll, pitch, yaw). But these angles don't live on a flat Euclidean space — they live on a curved surface called a manifold. Specifically, 3D rotations form the group SO(3) (Special Orthogonal group in 3 dimensions).

Why can't we just use a Gaussian on Euler angles? Because of gimbal lock (singularities where two axes align) and wrapping (an angle of 361° is the same as 1°). A Gaussian centered at 350° with σ = 20° would naively split probability between 330-370, but 370° = 10°, so the distribution should wrap around.

The standard solution uses quaternions (4D unit vectors) or rotation matrices (3×3 orthogonal matrices). On these representations, we define distributions directly on the manifold.

Key Distributions on SO(3)

The Bingham distribution is the analog of a Gaussian on the unit sphere. For a unit quaternion q:

p(q) ∝ exp(qT M q)

where M is a symmetric matrix controlling the concentration and preferred direction. The Matrix Fisher distribution works directly on rotation matrices R:

p(R) ∝ exp(tr(FT R))

where F encodes the mean rotation and concentration.

SE(3) = SO(3) + translation: A full 6-DoF pose (position + orientation) lives in SE(3). Distributions on SE(3) combine a spatial distribution (e.g., Gaussian on R3) with a rotational distribution (e.g., Bingham on SO(3)). This is essential for SLAM, grasp planning, and 3D object detection.
Rotational Uncertainty on a Sphere

Samples from a distribution on SO(3), projected to a sphere. Adjust concentration κ — low = spread out, high = concentrated around the mean rotation.

Concentration κ10
python
import numpy as np
from scipy.spatial.transform import Rotation

def sample_rotation_vmf(mean_axis, kappa, n=100):
    """Sample rotations concentrated around an axis using von Mises-Fisher."""
    # Sample angle from vMF on the circle
    angles = np.random.vonmises(0, kappa, n)
    # Create rotation objects
    rotations = [Rotation.from_rotvec(a * mean_axis) for a in angles]
    return rotations

samples = sample_rotation_vmf(np.array([0,0,1]), kappa=10)
Why can't we simply use a Gaussian distribution over Euler angles for 3D rotations?

Chapter 3: Wrapped Cauchy

You're studying wind directions at a weather station. Wind can blow from any compass heading: 0° to 360°, and then it wraps back to 0°. You need a probability distribution on the circle. The von Mises distribution (the "circular Gaussian") is one option, but what if your data has heavy tails — occasional extreme deviations from the mean direction?

The Wrapped Cauchy distribution is the circular analog of the Cauchy distribution. Take a Cauchy distribution on the real line, wrap it around the unit circle, and you get a distribution with much heavier tails than von Mises. Its PDF on the circle [0, 2π) is:

f(θ; μ, ρ) = (1 / 2π) · (1 − ρ2) / (1 + ρ2 − 2ρ cos(θ − μ))

Here μ is the mean direction and ρ ∈ [0, 1) is the concentration parameter. When ρ = 0, the distribution is uniform on the circle. As ρ → 1, it concentrates around μ — but always with heavier tails than von Mises at the same concentration.

Why heavy tails on a circle? In directional statistics, "heavy tails" means the distribution assigns more probability to directions far from the mean. A weather vane that occasionally spins wildly, or a migrating bird with occasional off-course deviations — the Wrapped Cauchy captures this better than von Mises.
Wrapped Cauchy vs von Mises

Compare the two circular distributions. The orange is Wrapped Cauchy, teal is von Mises. Notice the heavier tails of the Wrapped Cauchy at the same concentration.

Concentration ρ0.50
Mean μ0.0
python
import numpy as np

def wrapped_cauchy_pdf(theta, mu, rho):
    """Wrapped Cauchy PDF on [0, 2pi)."""
    return (1-rho**2) / (2*np.pi * (1+rho**2 - 2*rho*np.cos(theta-mu)))

# Sample via inverse CDF
def sample_wrapped_cauchy(mu, rho, n=1000):
    u = np.random.uniform(0, 1, n)
    return (mu + 2*np.arctan2((1-rho)*np.tan(np.pi*(u-0.5)), 1+rho)) % (2*np.pi)
What does the parameter ρ = 0 give in the Wrapped Cauchy distribution?

Chapter 4: Generalized Pareto

You're monitoring a river's water level. Most days it's between 2-4 meters. But during a flood, it could hit 10, 15, even 20 meters. Insurance companies, engineers, and risk analysts don't care about typical days — they need to model the extreme events. How high could the river get in a once-per-century flood?

The Generalized Pareto Distribution (GPD) is the theoretically justified distribution for modeling exceedances over a high threshold. By the Pickands-Balkema-de Haan theorem, the distribution of values above a high threshold converges to a GPD, regardless of the original distribution.

F(x) = 1 − (1 + ξ x / σ)−1/ξ,   x ≥ 0

The shape parameter ξ (xi) controls the tail behavior. When ξ > 0, the tail is heavy (power-law decay — extreme events are relatively likely). When ξ = 0, it reduces to the exponential distribution. When ξ < 0, the distribution has a finite upper bound at -σ/ξ.

Extreme Value Theory in one sentence: If you only care about the worst-case outcomes (floods, market crashes, server failures), the GPD is your distribution. It's not a modeling choice — it's a mathematical theorem that extremes behave this way.
ξ valueTail typeExample
ξ > 0Heavy (Pareto-like)Financial losses, insurance claims
ξ = 0ExponentialWaiting times above threshold
ξ < 0BoundedPhysical measurements with limits
GPD: Shape Controls the Tail

Adjust ξ to see how the tail weight changes. Orange is the GPD density, gray dashed is exponential for reference.

Shape ξ0.30
Scale σ1.0
python
from scipy import stats

# Fit GPD to exceedances above a threshold
data = np.random.pareto(2, 10000)  # heavy-tailed data
threshold = np.percentile(data, 95)
exceedances = data[data > threshold] - threshold

shape, loc, scale = stats.genpareto.fit(exceedances, floc=0)
print(f"Shape xi = {shape:.3f}, Scale sigma = {scale:.3f}")
What makes the GPD special compared to fitting a Gaussian to extreme events?

Chapter 5: Normalizing Flows

What if you want a distribution that can be any shape? Not Gaussian, not Pareto, not anything from a textbook — but a complex, multimodal, asymmetric blob learned from data? Normalizing flows let you start with a simple distribution (like a Gaussian) and warp it through a sequence of invertible transformations until it matches your target.

The key idea: if z ~ N(0, I) and x = f(z) where f is an invertible, differentiable function, then the density of x is given by the change of variables formula:

p(x) = pz(f-1(x)) · |det(Jf-1(x))|

where J is the Jacobian matrix of partial derivatives. The determinant tells you how much the transformation stretches or compresses space locally.

By stacking K invertible layers, x = fK ˆ ... ˆ f1(z), you can learn arbitrarily complex distributions. Each layer must be (1) invertible (so you can compute f-1), and (2) have a tractable Jacobian determinant (so you can evaluate the density).

Popular flow architectures: RealNVP uses affine coupling layers (splits dimensions and scales half conditioned on the other half). GLOW adds invertible 1×1 convolutions. Neural Spline Flows use monotonic rational-quadratic splines for maximum flexibility.
Flow: Warping a Gaussian

Watch a simple Gaussian (left) get transformed through flow layers. Each layer stretches and bends space. The result (right) can be multimodal.

Flow layers0
python
import torch
from torch import nn

class AffineFlow(nn.Module):
    """Simplest normalizing flow: x = z * exp(s) + t"""
    def __init__(self):
        super().__init__()
        self.s = nn.Parameter(torch.zeros(1))
        self.t = nn.Parameter(torch.zeros(1))

    def forward(self, z):
        x = z * torch.exp(self.s) + self.t
        log_det = self.s  # log|det J| = s
        return x, log_det

    def inverse(self, x):
        z = (x - self.t) * torch.exp(-self.s)
        return z
What mathematical formula lets us compute the density of the transformed variable in a normalizing flow?

Chapter 6: Squashed Gaussian

You're training a robot to control a motor. The action (torque) must be between -1 and +1. Your policy network outputs a mean μ and standard deviation σ, defining a Gaussian. But a Gaussian has infinite support — it assigns probability to torques of +50 or -100, which are physically impossible.

The squashed Gaussian fixes this by passing the Gaussian sample through tanh, which squishes any real number into (-1, 1). If u ~ N(μ, σ2) and a = tanh(u), then a is guaranteed to be in (-1, 1). But we need the density of a, not u.

π(a | s) = μu(u | s) · |da/du|-1 = μu(u | s) / (1 − tanh2(u))

Since tanh'(u) = 1 - tanh2(u), we divide by this factor. In log-probability (which is what we use for policy gradients):

log π(a) = log N(u; μ, σ2) − ∑ log(1 − ai2)
Where this lives: Soft Actor-Critic (SAC), one of the most successful continuous-control RL algorithms, uses the squashed Gaussian as its policy distribution. Without the tanh squashing, SAC would sample invalid actions. Without the log-probability correction, the policy gradient would be wrong.
Gaussian vs Squashed Gaussian

Gray is the raw Gaussian. Orange is after tanh squashing. Notice the density piles up near the boundaries ±1.

Mean μ0.0
Std σ1.0
python
import torch
from torch.distributions import Normal

def squashed_gaussian_sample(mu, sigma):
    dist = Normal(mu, sigma)
    u = dist.rsample()            # reparameterized sample
    a = torch.tanh(u)             # squash to (-1, 1)
    # Log-prob with correction
    log_prob = dist.log_prob(u) - torch.log(1 - a**2 + 1e-6)
    return a, log_prob.sum(-1)
Why do we subtract log(1 - a2) from the Gaussian log-probability in the squashed Gaussian?

Chapter 7: Skew-Normal

Stock returns are famously asymmetric. Losses tend to be larger and more sudden than gains. If you fit a Gaussian, you miss this asymmetry entirely — the bell curve is perfectly symmetric. You need a distribution that can lean to one side.

The Skew-Normal distribution adds a single parameter α to the Gaussian that controls asymmetry. Its PDF is:

f(x) = 2 · φ(x) · Φ(α x)

where φ(x) is the standard normal PDF and Φ(x) is the standard normal CDF. The parameter α controls skewness: α = 0 gives the standard Gaussian, α > 0 skews right (long right tail), α < 0 skews left.

Why 2φ(x)Φ(αx)? Think of it this way: start with a symmetric Gaussian φ(x). Multiply by Φ(αx), which is the probability that a second Gaussian αx is below zero. When α > 0, positive values of x make Φ(αx) close to 1 (keeping the density), while negative values make it close to 0 (suppressing the density). The factor of 2 renormalizes.
ParameterEffect
α = 0Standard Gaussian (no skew)
α > 0Right skew (long right tail)
α < 0Left skew (long left tail)
|α| → ∞Approaches half-normal distribution
Skew-Normal: Adding Asymmetry

Adjust α to skew the distribution left or right. The gray dashed line is the symmetric Gaussian for reference.

Skewness α4.0
python
from scipy import stats

# skewnorm(a) where a = alpha (shape parameter)
x = np.linspace(-4, 8, 200)
for alpha in [-4, 0, 4, 8]:
    y = stats.skewnorm.pdf(x, alpha)
    print(f"alpha={alpha}: mean={stats.skewnorm.mean(alpha):.3f}, "
          f"skew={stats.skewnorm.stats(alpha, moments='s')[0]:.3f}")
What does the Skew-Normal distribution reduce to when α = 0?

Chapter 8: Generalized Normal

The Gaussian penalizes outliers with squared distance (x - μ)2. The Laplace distribution uses absolute distance |x - μ|, which is gentler on outliers. What if you want something in between? Or even sharper than Gaussian? The Generalized Normal Distribution (GND) provides a single "shape dial."

f(x) = β / (2 α Γ(1/β)) · exp(−|x − μ|β / αβ)

The parameter β controls the tail weight:

βDistributionTail weight
β = 1Laplace (double exponential)Heavy tails
β = 2GaussianMedium tails
β → ∞Uniform on [μ-α, μ+α]No tails
β < 1Super-heavy tailsExtremely heavy
In machine learning: L1 regularization assumes Laplace priors (β=1) on weights. L2 regularization assumes Gaussian priors (β=2). The GND unifies these: choosing β lets you interpolate between sparse (L1) and smooth (L2) solutions. Elastic net regularization roughly corresponds to a mixture.
Shape Parameter β: From Laplace to Uniform

Drag β to morph the distribution. Watch the tails grow heavy (β<2) or shrink (β>2). Gray dashed = Gaussian reference.

Shape β2.0
python
from scipy import stats

# scipy.stats.gennorm(beta) is the generalized normal
x = np.linspace(-5, 5, 300)
for beta in [0.5, 1, 2, 5]:
    y = stats.gennorm.pdf(x, beta)
    print(f"beta={beta}: kurtosis={stats.gennorm.stats(beta, moments='k')[0]:.2f}")
At what value of β does the Generalized Normal reduce to the standard Gaussian?

Chapter 9: Copulas

Two stocks both have heavy-tailed returns (modeled by, say, Student-t distributions). You know each stock's marginal distribution. But how do they move together? In calm markets they're weakly correlated, but during crashes they drop in lockstep. The dependency structure is separate from the individual distributions.

A copula separates the problem of modeling a multivariate distribution into two parts: (1) the marginals (each variable's individual distribution) and (2) the dependency structure (how they relate). Sklar's theorem guarantees this decomposition always exists:

F(x, y) = C(FX(x), FY(y))

where FX, FY are the marginal CDFs and C is the copula function mapping [0,1]2 → [0,1]. The copula C captures all the dependency information.

Common Copulas

CopulaDependency typeTail dependence
GaussianSymmetric, no tail dependenceNone
ClaytonLower tail dependenceLeft tail only
GumbelUpper tail dependenceRight tail only
FrankSymmetric, no tail dependenceNone
Student-tSymmetric tail dependenceBoth tails
The 2008 financial crisis and copulas: Banks used Gaussian copulas to price CDOs, assuming no tail dependence between mortgage defaults. When the housing market crashed, defaults happened simultaneously — exactly the scenario a Gaussian copula says is near-impossible. The lesson: your copula choice encodes assumptions about extreme co-movements.
Copula Explorer: Dependency Structures

Samples from different copulas with uniform marginals. Notice how Clayton concentrates points in the lower-left (joint lows), while Gumbel clusters the upper-right (joint highs).

Dependence θ2.0
python
from scipy.stats import norm
import numpy as np

def gaussian_copula_sample(rho, n=500):
    """Sample from a Gaussian copula with correlation rho."""
    mean = [0, 0]
    cov = [[1, rho], [rho, 1]]
    z = np.random.multivariate_normal(mean, cov, n)
    u = norm.cdf(z)  # transform to uniform marginals
    return u
What does a copula separate in a multivariate distribution?

Chapter 10: Hawkes Process

An earthquake strikes. Over the next few hours, the rate of aftershocks spikes dramatically, then slowly decays. Each aftershock can itself trigger more aftershocks. A tweet goes viral: every retweet increases the chance of more retweets. These are self-exciting processes — events make future events more likely.

The Hawkes process is a point process where the event rate (intensity) at time t depends on past events. Its conditional intensity is:

λ(t) = μ + ∑ti < t α · β · exp(−β(t − ti))

Three parameters tell the whole story: μ is the background rate (events that happen spontaneously). α is the excitation (how much each event boosts the rate). β is the decay rate (how fast the excitement fades). The process is stable (doesn't explode) when α < 1.

The branching ratio α: Each event triggers, on average, α child events. If α = 0.8, each earthquake causes 0.8 aftershocks on average, each aftershock causes 0.64 more, and so on. Total expected events from one trigger: 1/(1-α) = 5. If α ≥ 1, the process explodes — an infinite cascade.

Applications

DomainEventsSelf-excitation mechanism
SeismologyEarthquakesAftershock sequences
Social mediaPosts/retweetsViral cascades
FinanceTradesOrder flow clustering
CrimeIncidentsCopycat/retaliation
NeuroscienceNeural spikesPost-synaptic excitation
Hawkes Process: Self-Exciting Events

Watch events cascade. Each event (vertical tick) boosts the intensity (orange curve). Adjust α to see the process go from quiet to nearly explosive.

Background μ0.5
Excitation α0.50
Decay β2.0
python
def simulate_hawkes(mu, alpha, beta, T=100):
    """Simulate a univariate Hawkes process via thinning."""
    events = []
    t = 0
    while t < T:
        lam_bar = mu + alpha * beta * sum(
            np.exp(-beta*(t - ti)) for ti in events)
        dt = np.random.exponential(1 / lam_bar)
        t += dt
        if t >= T: break
        lam_t = mu + alpha * beta * sum(
            np.exp(-beta*(t - ti)) for ti in events)
        if np.random.uniform() < lam_t / lam_bar:
            events.append(t)
    return events
What happens when the branching ratio α exceeds 1 in a Hawkes process?

Chapter 11: Lévy / Stable Distributions

The Central Limit Theorem says that if you average many independent, finite-variance random variables, you get a Gaussian. But what if the variance is infinite? City sizes, wealth distributions, earthquake magnitudes, and stock market crashes all have such heavy tails that their variance doesn't converge to a finite number. The Gaussian CLT doesn't apply.

The stable distributions (also called Lévy alpha-stable) generalize the Gaussian to handle these cases. They satisfy a generalized CLT: the sum of heavy-tailed i.i.d. variables converges to a stable distribution.

A stable distribution has four parameters:

ParameterRoleRange
α (stability)Tail heaviness: smaller = heavier(0, 2]
β (skewness)Asymmetry[-1, 1]
γ (scale)Width> 0
δ (location)CenterAny real

When α = 2, you get the Gaussian. When α = 1 and β = 0, you get the Cauchy. For α < 2, the distribution has infinite variance. For α ≤ 1, even the mean is undefined.

Lévy flights: Random walks where step sizes follow a stable distribution with α < 2 produce Lévy flights — trajectories with occasional enormous jumps. This models animal foraging, particle diffusion in turbulent flows, and financial market jumps. The "fractal" quality of these walks is controlled by α.
Stable Distributions: From Gaussian to Cauchy

Drag α from 2 (Gaussian) toward 1 (Cauchy). Watch the tails grow heavier. The gray dashed Gaussian reference stays fixed.

Stability α2.00
Skewness β0.0
python
from scipy import stats

# scipy.stats.levy_stable(alpha, beta, loc, scale)
# alpha=2 → Gaussian, alpha=1, beta=0 → Cauchy
x = np.linspace(-10, 10, 500)
for alpha in [0.5, 1.0, 1.5, 2.0]:
    y = stats.levy_stable.pdf(x, alpha, beta=0)
    # Note: for alpha < 2, variance = infinity
What special property do stable distributions with α < 2 have?

Chapter 12: Diffusion / Score-Based

You want to generate photorealistic images. One approach: start with pure noise and gradually denoise it into a crisp image. But to denoise, you need to know the score function — the gradient of the log-density ∇x log p(x) — at every noise level. This is the core idea behind diffusion models.

The forward process gradually adds Gaussian noise to data x0 over T steps:

q(xt | xt-1) = N(xt; √(1 − βt) xt-1, βt I)

After enough steps, xT ≈ N(0, I) — pure noise. The noise schedulet} controls how fast the data is destroyed. Common schedules include:

ScheduleFormulaProperty
Linearβt linearly from 10-4 to 0.02Simple, original DDPM
Cosineβt from cosine of cumulative αGentler, better for small images
Sigmoidβt from sigmoid functionSmooth acceleration

The reverse process learns to denoise: a neural network sθ(xt, t) predicts the score (noise direction) at each step. The score matching loss is:

L = Et, x0, ε [ || sθ(xt, t) − ε ||2 ]
The key insight: At each noise level, xt follows a known distribution (a Gaussian convolution of the data). The score function ∇ log p(xt) points "uphill" toward regions of higher density. By following the score from pure noise back toward data, we reverse the diffusion.
Forward Diffusion: Destroying Data

A 1D "image" (initial data distribution, two peaks) gets progressively noisier. Drag the timestep to watch it dissolve into a Gaussian.

Timestep t0
python
import torch

def linear_schedule(T=1000, start=1e-4, end=0.02):
    return torch.linspace(start, end, T)

def forward_diffusion(x0, t, betas):
    alphas = 1 - betas
    alpha_bar = torch.cumprod(alphas, dim=0)[t]
    noise = torch.randn_like(x0)
    xt = torch.sqrt(alpha_bar) * x0 + torch.sqrt(1 - alpha_bar) * noise
    return xt, noise
In diffusion models, what does the neural network learn to predict during training?

Chapter 13: Taxonomy & Connections

We've covered 12 specialized distributions. Each exists because some domain has structure that standard distributions can't capture. Here's the complete taxonomy:

By Domain Structure

StructureDistributionStandard Alternative
Scores → probabilitiesSoftmax / GibbsNormalization
Rotation manifoldSO(3) / SE(3) (Bingham, Fisher)Gaussian on Euler angles
Circular domainWrapped Cauchyvon Mises
Extreme valuesGeneralized ParetoGaussian tail
Arbitrary shapeNormalizing FlowsGMM
Bounded actionsSquashed GaussianClipped Gaussian
Asymmetric noiseSkew-NormalGaussian
Adjustable tailsGeneralized NormalGaussian or Laplace
Dependency modelingCopulasMultivariate Gaussian
Self-excitationHawkes ProcessPoisson process
Infinite varianceLévy / StableGaussian (fails)
Noise schedulesDiffusion / Score-BasedVAE latent space

Relationship Map

Distribution Family Tree

How the 12 specialized distributions relate to standard ones and to each other. Hover over nodes to highlight connections.

When to Reach for Each

Quick decision guide: Need to convert scores to probs? → Softmax. Working with rotations? → SO(3). Circular data? → Wrapped Cauchy (heavy tails) or von Mises (light tails). Modeling rare extremes? → GPD. Need any-shape density? → Normalizing Flows. Bounded continuous actions? → Squashed Gaussian. Asymmetric bell curve? → Skew-Normal. Tunable tails? → Generalized Normal. Separate marginals from dependency? → Copulas. Events trigger events? → Hawkes Process. Power-law tails, no finite variance? → Lévy Stable. Generative model via denoising? → Diffusion.

Connections to Other Lessons

These distributions appear throughout modern ML and robotics:

Diffusion Models — Full lesson on the denoising framework using diffusion/score-based distributions.

Flow Matching — A simpler alternative to normalizing flows with straight-line paths.

RL Algorithms — SAC uses the squashed Gaussian policy.

VLAs — Robot policies use SO(3)/SE(3) distributions for pose prediction.

Classical SLAM — SE(3) distributions for robot pose uncertainty.

Bayesian Estimation — Conjugate priors, the foundation for many of these distributions.

The meta-lesson: Distributions are tools. The Gaussian is a hammer — versatile, essential, and wrong for most specialized jobs. Knowing which wrench, screwdriver, or saw to reach for is what separates a practitioner from someone who forces everything into a normal distribution and hopes for the best.

"All models are wrong, but some are useful." — George Box