Specialized Distributions — From Standard to Exotic

Chapter 0: Why Specialized?

You know the Gaussian. You know the Exponential. Maybe you know the Poisson and Binomial. These cover a huge range of problems. But try to model any of these situations and they fall apart:

A robot arm needs to express uncertainty over rotations in 3D space. Rotations live on a curved manifold, not a flat number line — a Gaussian centered at 359 degrees would put probability mass at 360, 361, 362... but those are the same as 0, 1, 2 degrees. Standard distributions have no idea that the space wraps around.

An RL agent chooses actions between -1 and +1. It learns a Gaussian policy, but Gaussians have infinite support — they assign probability to actions at +47, which is meaningless. You need a distribution that respects the bounds.

An earthquake happens. Now the probability of another earthquake spikes — aftershocks cluster. Standard distributions treat events as independent. You need a process where events excite more events.

The pattern: Specialized distributions exist because domains have structure. Circular spaces, bounded intervals, heavy tails, self-excitation, manifold geometry. Each distribution in this lesson was invented because someone hit a wall with the standard toolkit and had to build something new.

Standard vs Specialized: Where Gaussians Break

Toggle between scenarios to see why standard distributions fail. Each colored region shows the problem domain.

Why do we need specialized distributions beyond the Gaussian?

Because Gaussians are mathematically incorrect Because many real domains have structure (boundaries, curvature, self-excitation) that standard distributions cannot capture Because specialized distributions are always more accurate

Chapter 1: Softmax / Gibbs Distribution

You have a neural network that outputs raw scores (logits) for 5 classes: [2.0, 1.0, 0.1, -1.0, 3.5]. Which class should it pick? The scores aren't probabilities — they can be negative, they don't sum to 1. You need a way to convert arbitrary real numbers into a valid probability distribution.

The softmax function (also called the Gibbs distribution in statistical mechanics) does exactly this. For a vector of scores z = [z₁, ..., z_K], it produces:

p(i) = exp(z_i / τ) / ∑_j exp(z_j / τ)

The parameter τ (tau) is the temperature. At high temperature (τ → ∞), the distribution becomes uniform — all classes equally likely. At low temperature (τ → 0), all probability concentrates on the highest-scoring class. Think of τ as a "sharpness knob."

Why "Gibbs"? In statistical mechanics, the Boltzmann distribution assigns probability to energy states as p(state) ∝ exp(-E / kT). Replace energy E with negative logit -z and physical temperature kT with τ, and you get softmax. Neural networks borrowed this idea from 19th-century physics.

Properties

Property	Value
Input	Any real-valued vector z ∈ R^K
Output	Probability simplex: p_i > 0, ∑ p_i = 1
Temperature τ → 0	argmax (greedy selection)
Temperature τ → ∞	Uniform distribution
Shift invariance	softmax(z + c) = softmax(z)

Interactive: Temperature Controls Sharpness

Five logits are shown as bars. Adjust temperature to see how the probability distribution changes. Low temp = peaky. High temp = flat.

Temperature τ1.00

Worked Example

Logits: z = [2.0, 1.0, 0.1]. With τ = 1:

exp(2.0) = 7.389, exp(1.0) = 2.718, exp(0.1) = 1.105. Sum = 11.212.

p = [0.659, 0.242, 0.099]. The highest logit gets 66% of the probability mass.

python
import numpy as np

def softmax(z, tau=1.0):
    e = np.exp(z / tau - np.max(z / tau))  # subtract max for numerical stability
    return e / e.sum()

logits = np.array([2.0, 1.0, 0.1, -1.0, 3.5])
print(softmax(logits, tau=1.0))   # peaked at class 4
print(softmax(logits, tau=0.1))   # nearly one-hot
print(softmax(logits, tau=10.0))  # nearly uniform

What happens to the softmax distribution as temperature τ approaches zero?

It becomes uniform over all classes All probabilities become zero All probability concentrates on the highest-scoring class

Chapter 2: SO(3)/SE(3) Distributions

A robot arm grasps a cup. To describe the cup's orientation, you need three angles (roll, pitch, yaw). But these angles don't live on a flat Euclidean space — they live on a curved surface called a manifold. Specifically, 3D rotations form the group SO(3) (Special Orthogonal group in 3 dimensions).

Why can't we just use a Gaussian on Euler angles? Because of gimbal lock (singularities where two axes align) and wrapping (an angle of 361° is the same as 1°). A Gaussian centered at 350° with σ = 20° would naively split probability between 330-370, but 370° = 10°, so the distribution should wrap around.

The standard solution uses quaternions (4D unit vectors) or rotation matrices (3×3 orthogonal matrices). On these representations, we define distributions directly on the manifold.

Key Distributions on SO(3)

The Bingham distribution is the analog of a Gaussian on the unit sphere. For a unit quaternion q:

p(q) ∝ exp(q^T M q)

where M is a symmetric matrix controlling the concentration and preferred direction. The Matrix Fisher distribution works directly on rotation matrices R:

p(R) ∝ exp(tr(F^T R))

where F encodes the mean rotation and concentration.

SE(3) = SO(3) + translation: A full 6-DoF pose (position + orientation) lives in SE(3). Distributions on SE(3) combine a spatial distribution (e.g., Gaussian on R³) with a rotational distribution (e.g., Bingham on SO(3)). This is essential for SLAM, grasp planning, and 3D object detection.

Rotational Uncertainty on a Sphere

Samples from a distribution on SO(3), projected to a sphere. Adjust concentration κ — low = spread out, high = concentrated around the mean rotation.

Concentration κ10

python
import numpy as np
from scipy.spatial.transform import Rotation

def sample_rotation_vmf(mean_axis, kappa, n=100):
    """Sample rotations concentrated around an axis using von Mises-Fisher."""
    # Sample angle from vMF on the circle
    angles = np.random.vonmises(0, kappa, n)
    # Create rotation objects
    rotations = [Rotation.from_rotvec(a * mean_axis) for a in angles]
    return rotations

samples = sample_rotation_vmf(np.array([0,0,1]), kappa=10)

Why can't we simply use a Gaussian distribution over Euler angles for 3D rotations?

Because Euler angles have singularities (gimbal lock) and wrapping issues that violate Gaussian assumptions Because Gaussians can only model 1D data Because rotations only come in discrete increments

Chapter 3: Wrapped Cauchy

You're studying wind directions at a weather station. Wind can blow from any compass heading: 0° to 360°, and then it wraps back to 0°. You need a probability distribution on the circle. The von Mises distribution (the "circular Gaussian") is one option, but what if your data has heavy tails — occasional extreme deviations from the mean direction?

The Wrapped Cauchy distribution is the circular analog of the Cauchy distribution. Take a Cauchy distribution on the real line, wrap it around the unit circle, and you get a distribution with much heavier tails than von Mises. Its PDF on the circle [0, 2π) is:

f(θ; μ, ρ) = (1 / 2π) · (1 − ρ²) / (1 + ρ² − 2ρ cos(θ − μ))

Here μ is the mean direction and ρ ∈ [0, 1) is the concentration parameter. When ρ = 0, the distribution is uniform on the circle. As ρ → 1, it concentrates around μ — but always with heavier tails than von Mises at the same concentration.

Why heavy tails on a circle? In directional statistics, "heavy tails" means the distribution assigns more probability to directions far from the mean. A weather vane that occasionally spins wildly, or a migrating bird with occasional off-course deviations — the Wrapped Cauchy captures this better than von Mises.

Wrapped Cauchy vs von Mises

Compare the two circular distributions. The orange is Wrapped Cauchy, teal is von Mises. Notice the heavier tails of the Wrapped Cauchy at the same concentration.

Concentration ρ0.50

Mean μ0.0

python
import numpy as np

def wrapped_cauchy_pdf(theta, mu, rho):
    """Wrapped Cauchy PDF on [0, 2pi)."""
    return (1-rho**2) / (2*np.pi * (1+rho**2 - 2*rho*np.cos(theta-mu)))

# Sample via inverse CDF
def sample_wrapped_cauchy(mu, rho, n=1000):
    u = np.random.uniform(0, 1, n)
    return (mu + 2*np.arctan2((1-rho)*np.tan(np.pi*(u-0.5)), 1+rho)) % (2*np.pi)

What does the parameter ρ = 0 give in the Wrapped Cauchy distribution?

A point mass at the mean direction A uniform distribution on the circle A standard Cauchy distribution on the real line

Chapter 4: Generalized Pareto

You're monitoring a river's water level. Most days it's between 2-4 meters. But during a flood, it could hit 10, 15, even 20 meters. Insurance companies, engineers, and risk analysts don't care about typical days — they need to model the extreme events. How high could the river get in a once-per-century flood?

The Generalized Pareto Distribution (GPD) is the theoretically justified distribution for modeling exceedances over a high threshold. By the Pickands-Balkema-de Haan theorem, the distribution of values above a high threshold converges to a GPD, regardless of the original distribution.

F(x) = 1 − (1 + ξ x / σ)^−1/ξ, x ≥ 0

The shape parameter ξ (xi) controls the tail behavior. When ξ > 0, the tail is heavy (power-law decay — extreme events are relatively likely). When ξ = 0, it reduces to the exponential distribution. When ξ < 0, the distribution has a finite upper bound at -σ/ξ.

Extreme Value Theory in one sentence: If you only care about the worst-case outcomes (floods, market crashes, server failures), the GPD is your distribution. It's not a modeling choice — it's a mathematical theorem that extremes behave this way.

ξ value	Tail type	Example
ξ > 0	Heavy (Pareto-like)	Financial losses, insurance claims
ξ = 0	Exponential	Waiting times above threshold
ξ < 0	Bounded	Physical measurements with limits

GPD: Shape Controls the Tail

Adjust ξ to see how the tail weight changes. Orange is the GPD density, gray dashed is exponential for reference.

Shape ξ0.30

Scale σ1.0

python
from scipy import stats

# Fit GPD to exceedances above a threshold
data = np.random.pareto(2, 10000)  # heavy-tailed data
threshold = np.percentile(data, 95)
exceedances = data[data > threshold] - threshold

shape, loc, scale = stats.genpareto.fit(exceedances, floc=0)
print(f"Shape xi = {shape:.3f}, Scale sigma = {scale:.3f}")

What makes the GPD special compared to fitting a Gaussian to extreme events?

Theory guarantees that threshold exceedances converge to a GPD regardless of the underlying distribution The GPD has more parameters than a Gaussian GPD is faster to compute

Chapter 5: Normalizing Flows

What if you want a distribution that can be any shape? Not Gaussian, not Pareto, not anything from a textbook — but a complex, multimodal, asymmetric blob learned from data? Normalizing flows let you start with a simple distribution (like a Gaussian) and warp it through a sequence of invertible transformations until it matches your target.

The key idea: if z ~ N(0, I) and x = f(z) where f is an invertible, differentiable function, then the density of x is given by the change of variables formula:

p(x) = p_z(f^-1(x)) · |det(J_f^-1(x))|

where J is the Jacobian matrix of partial derivatives. The determinant tells you how much the transformation stretches or compresses space locally.

By stacking K invertible layers, x = f_K ˆ ... ˆ f₁(z), you can learn arbitrarily complex distributions. Each layer must be (1) invertible (so you can compute f^-1), and (2) have a tractable Jacobian determinant (so you can evaluate the density).

Popular flow architectures: RealNVP uses affine coupling layers (splits dimensions and scales half conditioned on the other half). GLOW adds invertible 1×1 convolutions. Neural Spline Flows use monotonic rational-quadratic splines for maximum flexibility.

Flow: Warping a Gaussian

Watch a simple Gaussian (left) get transformed through flow layers. Each layer stretches and bends space. The result (right) can be multimodal.

Flow layers0

python
import torch
from torch import nn

class AffineFlow(nn.Module):
    """Simplest normalizing flow: x = z * exp(s) + t"""
    def __init__(self):
        super().__init__()
        self.s = nn.Parameter(torch.zeros(1))
        self.t = nn.Parameter(torch.zeros(1))

    def forward(self, z):
        x = z * torch.exp(self.s) + self.t
        log_det = self.s  # log|det J| = s
        return x, log_det

    def inverse(self, x):
        z = (x - self.t) * torch.exp(-self.s)
        return z

What mathematical formula lets us compute the density of the transformed variable in a normalizing flow?

Bayes' theorem The change of variables formula with the Jacobian determinant The central limit theorem

Chapter 6: Squashed Gaussian

You're training a robot to control a motor. The action (torque) must be between -1 and +1. Your policy network outputs a mean μ and standard deviation σ, defining a Gaussian. But a Gaussian has infinite support — it assigns probability to torques of +50 or -100, which are physically impossible.

The squashed Gaussian fixes this by passing the Gaussian sample through tanh, which squishes any real number into (-1, 1). If u ~ N(μ, σ²) and a = tanh(u), then a is guaranteed to be in (-1, 1). But we need the density of a, not u.

π(a | s) = μ_u(u | s) · |da/du|^-1 = μ_u(u | s) / (1 − tanh²(u))

Since tanh'(u) = 1 - tanh²(u), we divide by this factor. In log-probability (which is what we use for policy gradients):

log π(a) = log N(u; μ, σ²) − ∑ log(1 − a_i²)

Where this lives: Soft Actor-Critic (SAC), one of the most successful continuous-control RL algorithms, uses the squashed Gaussian as its policy distribution. Without the tanh squashing, SAC would sample invalid actions. Without the log-probability correction, the policy gradient would be wrong.

Gaussian vs Squashed Gaussian

Gray is the raw Gaussian. Orange is after tanh squashing. Notice the density piles up near the boundaries ±1.

Mean μ0.0

Std σ1.0

python
import torch
from torch.distributions import Normal

def squashed_gaussian_sample(mu, sigma):
    dist = Normal(mu, sigma)
    u = dist.rsample()            # reparameterized sample
    a = torch.tanh(u)             # squash to (-1, 1)
    # Log-prob with correction
    log_prob = dist.log_prob(u) - torch.log(1 - a**2 + 1e-6)
    return a, log_prob.sum(-1)

Why do we subtract log(1 - a²) from the Gaussian log-probability in the squashed Gaussian?

To account for the change-of-variables correction from the tanh transformation To make the distribution wider To ensure the mean is zero

Chapter 7: Skew-Normal

Stock returns are famously asymmetric. Losses tend to be larger and more sudden than gains. If you fit a Gaussian, you miss this asymmetry entirely — the bell curve is perfectly symmetric. You need a distribution that can lean to one side.

The Skew-Normal distribution adds a single parameter α to the Gaussian that controls asymmetry. Its PDF is:

f(x) = 2 · φ(x) · Φ(α x)

where φ(x) is the standard normal PDF and Φ(x) is the standard normal CDF. The parameter α controls skewness: α = 0 gives the standard Gaussian, α > 0 skews right (long right tail), α < 0 skews left.

Why 2φ(x)Φ(αx)? Think of it this way: start with a symmetric Gaussian φ(x). Multiply by Φ(αx), which is the probability that a second Gaussian αx is below zero. When α > 0, positive values of x make Φ(αx) close to 1 (keeping the density), while negative values make it close to 0 (suppressing the density). The factor of 2 renormalizes.

Parameter	Effect
α = 0	Standard Gaussian (no skew)
α > 0	Right skew (long right tail)
α < 0	Left skew (long left tail)
\|α\| → ∞	Approaches half-normal distribution

Skew-Normal: Adding Asymmetry

Adjust α to skew the distribution left or right. The gray dashed line is the symmetric Gaussian for reference.

Skewness α4.0

python
from scipy import stats

# skewnorm(a) where a = alpha (shape parameter)
x = np.linspace(-4, 8, 200)
for alpha in [-4, 0, 4, 8]:
    y = stats.skewnorm.pdf(x, alpha)
    print(f"alpha={alpha}: mean={stats.skewnorm.mean(alpha):.3f}, "
          f"skew={stats.skewnorm.stats(alpha, moments='s')[0]:.3f}")

What does the Skew-Normal distribution reduce to when α = 0?

A uniform distribution A half-normal distribution A standard Gaussian

Chapter 8: Generalized Normal

The Gaussian penalizes outliers with squared distance (x - μ)². The Laplace distribution uses absolute distance |x - μ|, which is gentler on outliers. What if you want something in between? Or even sharper than Gaussian? The Generalized Normal Distribution (GND) provides a single "shape dial."

f(x) = β / (2 α Γ(1/β)) · exp(−|x − μ|^β / α^β)

The parameter β controls the tail weight:

β	Distribution	Tail weight
β = 1	Laplace (double exponential)	Heavy tails
β = 2	Gaussian	Medium tails
β → ∞	Uniform on [μ-α, μ+α]	No tails
β < 1	Super-heavy tails	Extremely heavy

In machine learning: L1 regularization assumes Laplace priors (β=1) on weights. L2 regularization assumes Gaussian priors (β=2). The GND unifies these: choosing β lets you interpolate between sparse (L1) and smooth (L2) solutions. Elastic net regularization roughly corresponds to a mixture.

Shape Parameter β: From Laplace to Uniform

Drag β to morph the distribution. Watch the tails grow heavy (β<2) or shrink (β>2). Gray dashed = Gaussian reference.

Shape β2.0

python
from scipy import stats

# scipy.stats.gennorm(beta) is the generalized normal
x = np.linspace(-5, 5, 300)
for beta in [0.5, 1, 2, 5]:
    y = stats.gennorm.pdf(x, beta)
    print(f"beta={beta}: kurtosis={stats.gennorm.stats(beta, moments='k')[0]:.2f}")

At what value of β does the Generalized Normal reduce to the standard Gaussian?

β = 1 β = 2 β = ∞

Chapter 9: Copulas

Two stocks both have heavy-tailed returns (modeled by, say, Student-t distributions). You know each stock's marginal distribution. But how do they move together? In calm markets they're weakly correlated, but during crashes they drop in lockstep. The dependency structure is separate from the individual distributions.

A copula separates the problem of modeling a multivariate distribution into two parts: (1) the marginals (each variable's individual distribution) and (2) the dependency structure (how they relate). Sklar's theorem guarantees this decomposition always exists:

F(x, y) = C(F_X(x), F_Y(y))

where F_X, F_Y are the marginal CDFs and C is the copula function mapping [0,1]² → [0,1]. The copula C captures all the dependency information.

Common Copulas

Copula	Dependency type	Tail dependence
Gaussian	Symmetric, no tail dependence	None
Clayton	Lower tail dependence	Left tail only
Gumbel	Upper tail dependence	Right tail only
Frank	Symmetric, no tail dependence	None
Student-t	Symmetric tail dependence	Both tails

The 2008 financial crisis and copulas: Banks used Gaussian copulas to price CDOs, assuming no tail dependence between mortgage defaults. When the housing market crashed, defaults happened simultaneously — exactly the scenario a Gaussian copula says is near-impossible. The lesson: your copula choice encodes assumptions about extreme co-movements.

Copula Explorer: Dependency Structures

Samples from different copulas with uniform marginals. Notice how Clayton concentrates points in the lower-left (joint lows), while Gumbel clusters the upper-right (joint highs).

Dependence θ2.0

python
from scipy.stats import norm
import numpy as np

def gaussian_copula_sample(rho, n=500):
    """Sample from a Gaussian copula with correlation rho."""
    mean = [0, 0]
    cov = [[1, rho], [rho, 1]]
    z = np.random.multivariate_normal(mean, cov, n)
    u = norm.cdf(z)  # transform to uniform marginals
    return u

What does a copula separate in a multivariate distribution?

Mean from variance Marginal distributions from the dependency structure Discrete from continuous parts

Chapter 10: Hawkes Process

An earthquake strikes. Over the next few hours, the rate of aftershocks spikes dramatically, then slowly decays. Each aftershock can itself trigger more aftershocks. A tweet goes viral: every retweet increases the chance of more retweets. These are self-exciting processes — events make future events more likely.

The Hawkes process is a point process where the event rate (intensity) at time t depends on past events. Its conditional intensity is:

λ(t) = μ + ∑_{t_i < t} α · β · exp(−β(t − t_i))

Three parameters tell the whole story: μ is the background rate (events that happen spontaneously). α is the excitation (how much each event boosts the rate). β is the decay rate (how fast the excitement fades). The process is stable (doesn't explode) when α < 1.

The branching ratio α: Each event triggers, on average, α child events. If α = 0.8, each earthquake causes 0.8 aftershocks on average, each aftershock causes 0.64 more, and so on. Total expected events from one trigger: 1/(1-α) = 5. If α ≥ 1, the process explodes — an infinite cascade.

Applications

Domain	Events	Self-excitation mechanism
Seismology	Earthquakes	Aftershock sequences
Social media	Posts/retweets	Viral cascades
Finance	Trades	Order flow clustering
Crime	Incidents	Copycat/retaliation
Neuroscience	Neural spikes	Post-synaptic excitation

Hawkes Process: Self-Exciting Events

Watch events cascade. Each event (vertical tick) boosts the intensity (orange curve). Adjust α to see the process go from quiet to nearly explosive.

Background μ0.5

Excitation α0.50

Decay β2.0

python
def simulate_hawkes(mu, alpha, beta, T=100):
    """Simulate a univariate Hawkes process via thinning."""
    events = []
    t = 0
    while t < T:
        lam_bar = mu + alpha * beta * sum(
            np.exp(-beta*(t - ti)) for ti in events)
        dt = np.random.exponential(1 / lam_bar)
        t += dt
        if t >= T: break
        lam_t = mu + alpha * beta * sum(
            np.exp(-beta*(t - ti)) for ti in events)
        if np.random.uniform() < lam_t / lam_bar:
            events.append(t)
    return events

What happens when the branching ratio α exceeds 1 in a Hawkes process?

The process becomes explosive — events cascade without bound The process becomes periodic The process stops generating events

Chapter 11: Lévy / Stable Distributions

The Central Limit Theorem says that if you average many independent, finite-variance random variables, you get a Gaussian. But what if the variance is infinite? City sizes, wealth distributions, earthquake magnitudes, and stock market crashes all have such heavy tails that their variance doesn't converge to a finite number. The Gaussian CLT doesn't apply.

The stable distributions (also called Lévy alpha-stable) generalize the Gaussian to handle these cases. They satisfy a generalized CLT: the sum of heavy-tailed i.i.d. variables converges to a stable distribution.

A stable distribution has four parameters:

Parameter	Role	Range
α (stability)	Tail heaviness: smaller = heavier	(0, 2]
β (skewness)	Asymmetry	[-1, 1]
γ (scale)	Width	> 0
δ (location)	Center	Any real

When α = 2, you get the Gaussian. When α = 1 and β = 0, you get the Cauchy. For α < 2, the distribution has infinite variance. For α ≤ 1, even the mean is undefined.

Lévy flights: Random walks where step sizes follow a stable distribution with α < 2 produce Lévy flights — trajectories with occasional enormous jumps. This models animal foraging, particle diffusion in turbulent flows, and financial market jumps. The "fractal" quality of these walks is controlled by α.

Stable Distributions: From Gaussian to Cauchy

Drag α from 2 (Gaussian) toward 1 (Cauchy). Watch the tails grow heavier. The gray dashed Gaussian reference stays fixed.

Stability α2.00

Skewness β0.0

python
from scipy import stats

# scipy.stats.levy_stable(alpha, beta, loc, scale)
# alpha=2 → Gaussian, alpha=1, beta=0 → Cauchy
x = np.linspace(-10, 10, 500)
for alpha in [0.5, 1.0, 1.5, 2.0]:
    y = stats.levy_stable.pdf(x, alpha, beta=0)
    # Note: for alpha < 2, variance = infinity

What special property do stable distributions with α < 2 have?

They always have zero mean They have infinite variance They are always symmetric

Chapter 12: Diffusion / Score-Based

You want to generate photorealistic images. One approach: start with pure noise and gradually denoise it into a crisp image. But to denoise, you need to know the score function — the gradient of the log-density ∇_x log p(x) — at every noise level. This is the core idea behind diffusion models.

The forward process gradually adds Gaussian noise to data x₀ over T steps:

q(x_t | x_t-1) = N(x_t; √(1 − β_t) x_t-1, β_t I)

After enough steps, x_T ≈ N(0, I) — pure noise. The noise schedule {β_t} controls how fast the data is destroyed. Common schedules include:

Schedule	Formula	Property
Linear	β_t linearly from 10^-4 to 0.02	Simple, original DDPM
Cosine	β_t from cosine of cumulative α	Gentler, better for small images
Sigmoid	β_t from sigmoid function	Smooth acceleration

The reverse process learns to denoise: a neural network s_θ(x_t, t) predicts the score (noise direction) at each step. The score matching loss is:

L = E_{t, x₀, ε} [ || s_θ(x_t, t) − ε ||² ]

The key insight: At each noise level, x_t follows a known distribution (a Gaussian convolution of the data). The score function ∇ log p(x_t) points "uphill" toward regions of higher density. By following the score from pure noise back toward data, we reverse the diffusion.

Forward Diffusion: Destroying Data

A 1D "image" (initial data distribution, two peaks) gets progressively noisier. Drag the timestep to watch it dissolve into a Gaussian.

Timestep t0

python
import torch

def linear_schedule(T=1000, start=1e-4, end=0.02):
    return torch.linspace(start, end, T)

def forward_diffusion(x0, t, betas):
    alphas = 1 - betas
    alpha_bar = torch.cumprod(alphas, dim=0)[t]
    noise = torch.randn_like(x0)
    xt = torch.sqrt(alpha_bar) * x0 + torch.sqrt(1 - alpha_bar) * noise
    return xt, noise

In diffusion models, what does the neural network learn to predict during training?

The noise (or equivalently, the score function) at each timestep The final clean image directly The noise schedule parameters

Chapter 13: Taxonomy & Connections

We've covered 12 specialized distributions. Each exists because some domain has structure that standard distributions can't capture. Here's the complete taxonomy:

By Domain Structure

Structure	Distribution	Standard Alternative
Scores → probabilities	Softmax / Gibbs	Normalization
Rotation manifold	SO(3) / SE(3) (Bingham, Fisher)	Gaussian on Euler angles
Circular domain	Wrapped Cauchy	von Mises
Extreme values	Generalized Pareto	Gaussian tail
Arbitrary shape	Normalizing Flows	GMM
Bounded actions	Squashed Gaussian	Clipped Gaussian
Asymmetric noise	Skew-Normal	Gaussian
Adjustable tails	Generalized Normal	Gaussian or Laplace
Dependency modeling	Copulas	Multivariate Gaussian
Self-excitation	Hawkes Process	Poisson process
Infinite variance	Lévy / Stable	Gaussian (fails)
Noise schedules	Diffusion / Score-Based	VAE latent space

Relationship Map

Distribution Family Tree

How the 12 specialized distributions relate to standard ones and to each other. Hover over nodes to highlight connections.

When to Reach for Each

Quick decision guide: Need to convert scores to probs? → Softmax. Working with rotations? → SO(3). Circular data? → Wrapped Cauchy (heavy tails) or von Mises (light tails). Modeling rare extremes? → GPD. Need any-shape density? → Normalizing Flows. Bounded continuous actions? → Squashed Gaussian. Asymmetric bell curve? → Skew-Normal. Tunable tails? → Generalized Normal. Separate marginals from dependency? → Copulas. Events trigger events? → Hawkes Process. Power-law tails, no finite variance? → Lévy Stable. Generative model via denoising? → Diffusion.

Connections to Other Lessons

These distributions appear throughout modern ML and robotics:

Diffusion Models — Full lesson on the denoising framework using diffusion/score-based distributions.

Flow Matching — A simpler alternative to normalizing flows with straight-line paths.

RL Algorithms — SAC uses the squashed Gaussian policy.

VLAs — Robot policies use SO(3)/SE(3) distributions for pose prediction.

Classical SLAM — SE(3) distributions for robot pose uncertainty.

Bayesian Estimation — Conjugate priors, the foundation for many of these distributions.

The meta-lesson: Distributions are tools. The Gaussian is a hammer — versatile, essential, and wrong for most specialized jobs. Knowing which wrench, screwdriver, or saw to reach for is what separates a practitioner from someone who forces everything into a normal distribution and hopes for the best.

"All models are wrong, but some are useful." — George Box