Discrete Distributions — From Coins to Categories

Chapter 0: Why Discrete?

A self-driving car classifies each pixel as road, car, pedestrian, or sky. A spam filter says yes or no. A language model picks the next word from 50,000 candidates. A sensor counts how many packets arrived in the last second.

All of these are discrete outcomes — they come from a finite or countable set of possibilities. You can list them: heads or tails, category 1 through K, 0 events or 1 event or 2 events. There are no values "between" them. You can't get 2.7 heads in 5 coin flips.

Contrast this with continuous outcomes: a temperature reading of 37.284°C, a voltage of 3.14159V, a distance of 2.718 meters. Those live on the real number line with infinitely many values between any two points. Discrete outcomes snap to specific, separated values.

Why does this distinction matter? Because it determines the mathematical tools you'll use. Discrete distributions let you sum probabilities directly: P(X = 3) + P(X = 4) = p(3) + p(4). Continuous distributions require integration: P(3 ≤ X ≤ 4) = ∫₃⁴ f(x)dx. In practice, many AI systems deliberately convert continuous quantities into discrete ones — quantization in neural networks, tokenization in language models, binning in histograms — precisely because discrete distributions are easier to work with computationally.

The cumulative distribution function (CDF) works for both: F(k) = P(X ≤ k). For discrete distributions, the CDF is a staircase function that jumps at each supported value. The jump height at k equals p(k). CDFs are useful for answering "at most" and "at least" questions: P(X ≥ 5) = 1 − F(4).

PMF vs PDF

Continuous distributions (like the Gaussian) assign probability densities to intervals — you must integrate to get actual probabilities. Discrete distributions assign actual probabilities to individual outcomes directly. The function that does this is the probability mass function (PMF):

P(X = k) = p(k), where ∑_k p(k) = 1

Each bar in a PMF chart has a definite height between 0 and 1, and all bars sum to exactly 1. No integrals needed — just addition. This is why discrete distributions are often easier to work with: you can enumerate all possibilities and add up their probabilities.

The big picture: Every discrete distribution answers a specific type of counting question. Bernoulli: "yes or no?" Binomial: "how many yeses in n tries?" Poisson: "how many events in a window?" Knowing which question you're asking tells you which distribution to reach for.

Discrete vs Continuous: A Quick Guide

How do you know if your problem is discrete? Ask yourself: can the outcome be listed? If you can write down all possible values (even if the list is infinite, like 0, 1, 2, 3, ...), it's discrete. If the outcome can take any real value in some interval (like a temperature between 36.0 and 42.0), it's continuous.

Feature	Discrete (PMF)	Continuous (PDF)
Outcome space	Countable (finite or infinite list)	Uncountable (real intervals)
P(X = x)	Nonzero for each outcome	Always zero for any single point
Sum/integral to 1	∑ p(k) = 1	∫ f(x) dx = 1
Examples	Coin flips, counts, categories	Heights, voltages, durations

The Nine Distributions

In this lesson, we'll build up a family of nine distributions, each designed for a different scenario you'll encounter in AI, robotics, and estimation:

Binary outcome

Bernoulli → Binomial → Geometric → Neg. Binomial

↓

Multi-category

Categorical → Multinomial

↓

Count / rate

Poisson · Hypergeometric · Discrete Uniform

PMF vs PDF — Spot the Difference

Left: a discrete PMF (bars sum to 1). Right: a continuous PDF (area under curve = 1). Toggle between them to see the fundamental difference.

Why it matters for AI: Classification outputs are Categorical. Token prediction is Categorical. Counting detections is Poisson. Sampling batches uses Hypergeometric logic. Every probabilistic model in machine learning is built on these distributions. Without them, you cannot define loss functions (cross-entropy needs the Categorical), generative models (VAEs need sampling), or Bayesian inference (priors and posteriors are distributions).

Reading PMF Charts

Throughout this lesson, every distribution gets an interactive bar chart. Here's how to read them:

The x-axis shows the possible outcomes (the support). The y-axis shows the probability of each outcome. Each bar's height is P(X = k) for that value of k. The bars must always sum to 1 — probability has to go somewhere.

Key patterns to watch for as you explore:

Shape	What It Means	Example
Tall narrow peak	High confidence, low variance	Bernoulli with p near 0 or 1
Flat/uniform	Maximum uncertainty	Fair die, Discrete Uniform
Right-skewed	Most values small, occasional large ones	Poisson with small λ
Symmetric bell	Values cluster around the mean	Binomial with p ≈ 0.5
Long tail	Rare but possible extreme values	Geometric with small p

What does a probability mass function (PMF) assign to each outcome?

A density that must be integrated over an interval A probability between 0 and 1, with all values summing to 1 A weight that can exceed 1 as long as the total is finite

Chapter 1: The Bernoulli Distribution

The simplest possible random experiment: something happens, or it doesn't. A coin lands heads. An email is spam. A patient tests positive. A neural network predicts "cat." There are exactly two outcomes, and we label them 1 (success) and 0 (failure).

A Bernoulli random variable X (named after Swiss mathematician Jacob Bernoulli, 1655-1705) has a single parameter p — the probability of success. Its PMF is beautifully compact:

P(X = k) = p^k(1 − p)^1−k, k ∈ {0, 1}

Let's unpack this formula. When k = 1 (success): P(X = 1) = p¹(1 − p)⁰ = p. When k = 0 (failure): P(X = 0) = p⁰(1 − p)¹ = 1 − p. The two probabilities sum to exactly 1. That's the entire distribution.

The mean (expected value) is E[X] = 0·(1−p) + 1·p = p. This makes intuitive sense: the average value of many Bernoulli draws converges to p (the law of large numbers). The variance is Var(X) = p(1 − p). Notice: variance is maximized at p = 0.5 (maximum uncertainty, like a fair coin) and drops to zero at p = 0 or p = 1 (certain outcome).

Property	Value
Support	{0, 1}
Mean	p
Variance	p(1 − p)
Parameters	p ∈ [0, 1]

Where you'll see it: Binary classification (cat vs dog), logistic regression output (the sigmoid function outputs a Bernoulli parameter p), each bit in a communication channel, each pixel in a binary image, dropout in neural networks (each neuron is kept with probability p). The Bernoulli is the atom from which Binomial, Geometric, and Negative Binomial are built.

The Entropy Connection

The entropy of a Bernoulli is H = −p log₂ p − (1−p) log₂(1−p). This is maximized at p = 0.5 (1 bit of information — you need exactly one bit to encode the outcome) and zero at p = 0 or p = 1 (no uncertainty, zero bits needed). Binary cross-entropy loss in neural network training is exactly this formula applied to each prediction, using natural log instead of log₂. It measures how far the model's predicted p is from the true label.

When your model's sigmoid output gives p = 0.5 for every input, it has maximum entropy — it's learned nothing. Training drives p toward 0 or 1 for each example, reducing entropy and increasing the model's confidence.

Bernoulli PMF Explorer

Drag p to see how the two bars change. The variance peaks at p = 0.5 — maximum uncertainty about the outcome.

p (success prob)0.50

Deriving the Variance

Let's derive the variance from scratch, step by step. We need E[X²] and E[X].

E[X] = 0 · (1−p) + 1 · p = p.

E[X²] = 0² · (1−p) + 1² · p = p. (For Bernoulli, E[X²] = E[X] since X is 0 or 1.)

Var(X) = E[X²] − (E[X])² = p − p² = p(1−p).

This parabola opens downward, hitting zero at the endpoints (certainty) and peaking at 0.25 when p = 0.5 (maximum confusion). The fact that X² = X for Bernoulli is a unique property — it only works because the support is {0, 1}.

Worked Example

A spam filter assigns p = 0.85 that an email is spam. What's the probability it's not spam? What's the uncertainty?

P(not spam) = 1 − 0.85 = 0.15. The variance is 0.85 × 0.15 = 0.1275. Compare this to a filter that outputs p = 0.99: variance = 0.99 × 0.01 = 0.0099. The more confident the classifier, the lower the variance — which is exactly what we want in a well-calibrated model.

python
from scipy.stats import bernoulli

rv = bernoulli(p=0.85)
rv.pmf(1)    # 0.85  — P(spam)
rv.pmf(0)    # 0.15  — P(not spam)
rv.mean()    # 0.85
rv.var()     # 0.1275

# Simulate 1000 emails
samples = rv.rvs(size=1000)
print(samples.mean())  # ≈ 0.85

# Binary cross-entropy loss
import numpy as np
y_true = 1  # actual label
p_pred = 0.85
bce = -(y_true * np.log(p_pred) + (1 - y_true) * np.log(1 - p_pred))
print(bce)  # 0.163 — low loss for a good prediction

A Bernoulli variable has p = 0.3. What is its variance?

0.30 0.21 0.09

Chapter 2: The Categorical Distribution

A Bernoulli picks between two outcomes. But what if your die has K faces? A language model picks from 50,000 tokens. An image classifier chooses from 1,000 classes. A robot decides among 8 possible actions. You need a distribution over K categories, each with its own probability.

The Categorical distribution (also called a generalized Bernoulli, a discrete distribution, or sometimes a "multinoulli") has K parameters p₁, p₂, ..., p_K where each p_i ≥ 0 and they sum to 1. Only K−1 parameters are free since p_K = 1 − p₁ − ... − p_K−1. The PMF for outcome i is simply:

P(X = i) = p_i, i ∈ {1, 2, ..., K}

That's it. No fancy formula — you just look up the probability for category i. The beauty is in what this enables: by assigning different probabilities to different categories, you can encode everything from a fair die (all p_i = 1/K) to a completely confident classifier (one p_i = 1, all others = 0), to anything in between. The space of all possible Categorical distributions over K categories forms a (K−1)-dimensional simplex — a triangle for K=3, a tetrahedron for K=4.

For any single category i, the indicator "did category i occur?" is itself a Bernoulli(p_i). So the mean of that indicator is p_i and its variance is p_i(1 − p_i). The entropy of the Categorical is H = −∑ p_i log p_i, which measures how "spread out" the distribution is. A uniform Categorical has maximum entropy (maximum uncertainty); a one-hot distribution (one p_i = 1, rest = 0) has zero entropy (no uncertainty).

Property	Value
Support	{1, 2, ..., K}
Mean (of indicator i)	p_i
Variance (of indicator i)	p_i(1 − p_i)
Parameters	p₁, ..., p_K with ∑p_i = 1

The softmax connection: Every neural network classifier ends with a softmax layer that converts raw scores (logits) into a Categorical distribution. When GPT predicts the next word, it outputs a Categorical over the entire vocabulary. Temperature scaling reshapes this distribution — low temperature makes it peaky (confident), high temperature makes it flat (random). Temperature = 0 gives argmax (greedy decoding).

Sampling Strategies

Once you have a Categorical distribution, how do you pick from it? There are several strategies, each used in different AI contexts:

Strategy	Method	Use Case
Greedy (argmax)	Always pick the highest-probability category	Classification, deterministic decoding
Sampling	Draw randomly according to the probabilities	Creative text generation
Top-k	Zero out all but the top k probabilities, renormalize, then sample	Focused but diverse generation
Nucleus (top-p)	Keep the smallest set of categories whose cumulative probability ≥ p	Adaptive diversity in LLMs

All of these operate on the same underlying Categorical distribution — they just filter or reshape it before sampling.

One-Hot Encoding

In neural networks, we typically represent a Categorical outcome as a one-hot vector: a vector of all zeros except for a 1 in the position of the chosen category. For K = 4 classes, outcome 2 becomes [0, 1, 0, 0]. The cross-entropy loss between the true one-hot vector and the predicted probability vector is:

L = −∑_i y_i log(p_i) = −log(p_{true class})

Since only one y_i is nonzero, the loss simplifies to the negative log of the predicted probability for the true class. Minimizing cross-entropy is equivalent to maximizing the likelihood of the Categorical distribution — it's the same optimization problem viewed from two angles. When the model assigns p = 0.99 to the correct class, the loss is −log(0.99) = 0.01 (tiny). When it assigns p = 0.01, the loss is −log(0.01) = 4.6 (huge). This logarithmic penalty harshly punishes confident wrong answers.

Categorical PMF — Loaded Die

Adjust weights for face 1 and face 6. All weights auto-normalize to sum to 1. Faces 2-5 have fixed weight 1.0. Watch how loading one face steals probability from the others.

Face 1 weight1.0

Face 6 weight1.0

Worked Example

A language model outputs logits [2.0, 1.0, 0.5] for three tokens. Softmax converts these to probabilities:

p₁ = e^2.0 / (e^2.0 + e^1.0 + e^0.5) = 7.389 / (7.389 + 2.718 + 1.649) = 7.389 / 11.756 ≈ 0.628

Similarly p₂ ≈ 0.231, p₃ ≈ 0.140. The model then samples from this Categorical to pick a token.

python
import numpy as np

logits = np.array([2.0, 1.0, 0.5])
probs = np.exp(logits) / np.sum(np.exp(logits))  # softmax
print(probs)  # [0.628, 0.231, 0.140]

# Sample a token
token = np.random.choice(len(probs), p=probs)

# Temperature scaling (T=0.5 = more confident)
T = 0.5
probs_cold = np.exp(logits / T) / np.sum(np.exp(logits / T))
print(probs_cold)  # [0.794, 0.145, 0.061] — peakier

A softmax output over 4 classes is [0.1, 0.6, 0.2, 0.1]. Which class is most likely?

Class 1 Class 2 Class 3

Chapter 3: The Binomial Distribution

Now we extend the Bernoulli from a single trial to many. You run a Bernoulli experiment n times, independently. A medical trial tests a drug on 100 patients — each patient either responds or doesn't. You send 50 network packets through a lossy channel — each arrives or gets lost. You classify 200 images and want to know how many are correct. An A/B test shows a new landing page to 1,000 visitors — each visitor converts or bounces.

The question in every case: how many successes in n independent trials?

The Binomial distribution Bin(n, p) counts the total number of successes k out of n independent Bernoulli(p) trials. Each trial has the same probability p of success, and trials don't affect each other. Its PMF is:

P(X = k) = C(n, k) · p^k · (1 − p)^n−k

Let's derive why this formula makes sense. Any specific sequence with exactly k successes and (n−k) failures has probability p^k(1−p)^n−k. But there are many such sequences — the successes can fall on any k of the n positions. The number of ways to choose those k positions is the binomial coefficient C(n, k) = n! / (k!(n−k)!). Multiply the two together and you get the PMF.

The mean is E[X] = np (each trial contributes p on average, and expectation is linear). The variance is Var(X) = np(1−p). Why? Each trial is an independent Bernoulli(p) with variance p(1−p). Since the trials are independent, variances add: np(1−p). This additivity of variance only works because the trials don't influence each other.

The Binomial Coefficient

C(n, k) = n! / (k!(n−k)!) deserves a moment. For n = 10, k = 3: C(10, 3) = 10!/(3! × 7!) = (10 × 9 × 8)/(3 × 2 × 1) = 120. There are 120 different ways to place 3 successes among 10 trial positions. You might hear this called "n choose k" — it counts combinations (order doesn't matter). A common computational trick: compute C(n, k) iteratively as a product to avoid overflow from large factorials.

Pascal's triangle gives all binomial coefficients: each entry is the sum of the two entries above it. The identity C(n, k) = C(n−1, k−1) + C(n−1, k) has a beautiful probabilistic interpretation: the k successes in n trials either include the last trial (in which case the first n−1 trials must contain k−1 successes) or they don't (in which case the first n−1 trials contain all k successes). This recursive structure is why dynamic programming can efficiently compute binomial coefficients.

Property	Value
Support	{0, 1, 2, ..., n}
Mean	np
Variance	np(1 − p)
Parameters	n ∈ {1, 2, ...}, p ∈ [0, 1]

Where you'll see it: A/B testing (how many users convert out of n visitors), error counting in communication channels, number of defective items in a batch, accuracy of a classifier on a test set (k correct out of n), dropout masks (how many neurons survive out of n).

Normal approximation: When n is large and p is not too close to 0 or 1, the Binomial is well-approximated by a Normal distribution: Bin(n, p) ≈ N(np, np(1−p)). A common rule of thumb: the approximation works when np ≥ 5 and n(1−p) ≥ 5. This is the basis of many statistical tests, including z-tests for proportions.

Binomial in Practice: The Classifier Test

Suppose your classifier achieves 92% accuracy on a test set of n = 500. Is this significantly better than random (50%)? Under the null hypothesis (random guessing), the number of correct predictions follows Bin(500, 0.5). The expected correct count would be 250 with std dev = √(500 × 0.5 × 0.5) = 11.18. Your observed 460 correct predictions is (460 − 250)/11.18 = 18.8 standard deviations above the mean — astronomically significant. But is 92% significantly better than 90%? Now the null is Bin(500, 0.90), with mean 450 and std dev 6.71. Your observed 460 is (460 − 450)/6.71 = 1.49 standard deviations — not quite significant at the 95% level. The Binomial provides the math to answer these questions rigorously.

Binomial PMF Explorer

Drag n and p to see how the distribution shifts and spreads. At p = 0.5, the distribution is symmetric. Large n makes it approach a bell curve.

n (trials)10

p (success prob)0.50

Worked Example

An A/B test shows 200 visitors version B, with a 5% conversion rate (p = 0.05). What's the probability of exactly 15 conversions?

X ~ Bin(200, 0.05). Mean = np = 200 × 0.05 = 10. Std dev = √(np(1−p)) = √(200 × 0.05 × 0.95) = √9.5 ≈ 3.08.

P(X = 15) = C(200, 15) · 0.05¹⁵ · 0.95¹⁸⁵ ≈ 0.0346. That's about 1.6 standard deviations above the mean — uncommon but not rare.

What if we want a confidence interval? Using the Normal approximation: 95% of the time, conversions will be in the range np ± 1.96√(np(1−p)) = 10 ± 6.04, so between about 4 and 16. If we observe 20 conversions, that's over 3 standard deviations from the mean — strong evidence that the true rate is higher than 5%.

python
from scipy.stats import binom

rv = binom(n=200, p=0.05)
rv.pmf(15)    # 0.0346
rv.mean()    # 10.0
rv.std()     # 3.08

# P(at least 15 conversions) — upper tail
1 - rv.cdf(14)  # 0.0559

# 95% confidence interval
lo, hi = rv.ppf(0.025), rv.ppf(0.975)
print(f"95% CI: [{lo:.0f}, {hi:.0f}]")  # [4, 16]

You flip a fair coin 20 times. What is the expected number of heads?

10 20 5

Chapter 4: The Poisson Distribution

How many customers arrive at a store per hour? How many photons hit a sensor per millisecond? How many server errors occur per day? How many typos appear on a page? How many objects does a lidar detect per scan? These are all counts of events happening at some average rate in a fixed interval. The events are independent, could theoretically be any non-negative integer, and there is no hard upper limit on the count.

Unlike the Binomial, which requires a fixed number of trials n, the Poisson has no upper bound. Events can keep happening. This makes it ideal for situations where the "number of opportunities" is either unknown or effectively infinite.

The Poisson distribution (named after French mathematician Simeon Denis Poisson, 1781-1840) Pois(λ) has a single parameter λ (lambda, Greek letter for "rate") — the average number of events per interval. Its PMF is:

P(X = k) = λ^k e^−λ / k!

Where does this come from? Imagine dividing the interval into n tiny sub-intervals, each with probability p = λ/n of containing an event. The count of events is Bin(n, λ/n). As n → ∞:

C(n, k) · (λ/n)^k · (1 − λ/n)^n−k → λ^k/k! · e^−λ

The factor e^−λ is the probability of zero events — the "nothing happens" baseline. It's surprisingly large for small λ: when λ = 1, there's a 37% chance of zero events. The factor λ^k/k! distributes the remaining probability among configurations with exactly k events, with the factorial ensuring proper normalization.

The elegant thing about the Poisson: λ is simultaneously the mean AND the variance. No other common distribution has this property. When λ is small (say, 0.5), the distribution is heavily skewed right — most intervals have 0 events, some have 1, and 2+ is rare. When λ is moderate (say, 5), the distribution starts looking bell-shaped. As λ grows past 15-20, it becomes nearly indistinguishable from a Normal(λ, λ).

Property	Value
Support	{0, 1, 2, ...} (unbounded)
Mean	λ
Variance	λ
Parameters	λ > 0

The Poisson limit theorem: If you take a Binomial(n, p) with n → ∞ and p → 0 while np = λ stays constant, you get a Poisson(λ). This is why the Poisson models "rare events in many trials" — website crashes, particle decays, server errors, typos. If the individual probability is tiny but there are many opportunities, reach for the Poisson.

Poisson PMF Explorer

Watch how the distribution shifts right and becomes more symmetric as λ increases. At λ = 1, most outcomes are 0 or 1. At λ = 15, it looks nearly Gaussian.

λ (rate)4.0

Worked Example

A lidar sensor detects an average of λ = 3 obstacles per scan. What's the probability of detecting exactly 5 in one scan?

P(X = 5) = 3⁵ · e⁻³ / 5! = 243 × 0.04979 / 120 ≈ 0.1008.

What about zero obstacles? P(X = 0) = e⁻³ ≈ 0.0498. So there's about a 5% chance of a completely clear scan.

python
from scipy.stats import poisson

rv = poisson(mu=3)
rv.pmf(5)     # 0.1008
rv.pmf(0)     # 0.0498
rv.mean()     # 3.0
rv.var()      # 3.0  (mean = variance!)

# P(more than 5 obstacles) — when to worry
1 - rv.cdf(5)  # 0.0839

# Superposition: two independent sensors
from scipy.stats import poisson
rv_combined = poisson(mu=3 + 5)  # combined rate
rv_combined.mean()  # 8.0

Mean = Variance test: If your count data has variance much larger than the mean, the Poisson assumption is violated. This is called overdispersion, and you should use the Negative Binomial instead (Chapter 5). Conversely, if variance < mean (underdispersion), a Binomial with a known n might be more appropriate.

Poisson Process

The Poisson distribution is intimately connected to the Poisson process — a continuous-time model where events occur independently at a constant average rate. If events arrive at rate λ per unit time, then the number of events in any interval of length t follows Pois(λt). The time between consecutive events follows an Exponential(λ) distribution. This duality between counts (Poisson) and waiting times (Exponential) is one of the most useful relationships in applied probability.

Superposition and Thinning

Two useful properties make the Poisson especially flexible in practice:

Superposition: If X ~ Pois(λ₁) and Y ~ Pois(λ₂) are independent, then X + Y ~ Pois(λ₁ + λ₂). If two independent sensors detect events at rates 2/sec and 3/sec, the combined detection rate is Poisson with λ = 5/sec. This extends to any number of independent Poisson sources.

Thinning: If X ~ Pois(λ) and each event is independently kept with probability p (and discarded with probability 1−p), the kept events follow Pois(λp) and the discarded events follow Pois(λ(1−p)), and the two are independent. This models filtering: if a lidar sees 10 objects/scan on average but only 40% are cars, the car count is Pois(4) and the non-car count is Pois(6), independently.

These two properties — merging and splitting Poisson streams — are the foundation of queueing theory, which models waiting lines in everything from server load balancing to robotic task scheduling. The M/M/1 queue (Poisson arrivals, exponential service times, one server) is the simplest queueing model and a cornerstone of performance engineering.

A Poisson distribution has λ = 7. What is its variance?

7 49 √7

Chapter 5: The Negative Binomial

The Binomial asks "how many successes in n fixed trials?" The Negative Binomial flips the question entirely: "how many failures before we accumulate r successes?" Now the number of trials isn't fixed in advance — you keep going until you've hit your target. The experiment ends when the r-th success occurs.

Imagine you're debugging code and need to find 3 bugs. Each test run has a 20% chance of revealing a bug. How many failed tests before you've found all 3? Or you're training a reinforcement learning agent and need it to reach the goal r = 5 times to call training "done." How many failed episodes will there be?

NegBin(r, p) counts the number of failures k before the r-th success (some textbooks count total trials instead — always check the convention). Its PMF is:

P(X = k) = C(k + r − 1, k) · p^r · (1 − p)^k

Why C(k + r − 1, k)? Think of it this way. After all trials are done, the last trial is always a success (the r-th one). Before that, we've had k failures and r − 1 successes in some interleaved order — that's k + r − 1 trials total. The number of ways to arrange k failures among those k + r − 1 positions is C(k + r − 1, k). Multiply by p^r (probability of all successes) and (1−p)^k (probability of all failures).

Naming confusion alert: The "Negative Binomial" name comes from the fact that the PMF can be written using a negative binomial coefficient: C(−r, k)·(−1)^k = C(k+r−1, k). Different sources use different parameterizations. Python's scipy uses nbinom(n=r, p=p). R uses dnbinom(x, size=r, prob=p). Always verify whether k counts failures or total trials.

Property	Value
Support	{0, 1, 2, ...} (failures before r-th success)
Mean	r(1 − p) / p
Variance	r(1 − p) / p²
Parameters	r ∈ {1, 2, ...}, p ∈ (0, 1]

Why AI cares: The Negative Binomial handles overdispersed count data — data where the variance exceeds the mean. Check: variance/mean = r(1−p)/(p²) / (r(1−p)/p) = 1/p > 1 whenever p < 1. Real-world event counts (website visits, gene expression levels, word frequencies in documents) are almost always overdispersed. When the Poisson assumption (mean = variance) fails, the Negative Binomial is the standard upgrade.

Negative Binomial PMF Explorer

r = target successes, p = success probability per trial. Watch how the distribution stretches with more targets or lower success rates.

r (successes needed)3

p (success prob)0.40

Why Variance > Mean

For the Negative Binomial, variance = r(1−p)/p² and mean = r(1−p)/p. The ratio variance/mean = 1/p, which is always > 1 when p < 1. This built-in overdispersion is why the Negative Binomial is the go-to replacement when Poisson doesn't fit.

In practice, you can estimate r and p from data using the method of moments: set the sample mean = r(1−p)/p and sample variance = r(1−p)/p², then solve for r and p. Or use maximum likelihood, which is more efficient but requires numerical optimization.

Gamma-Poisson mixture: The Negative Binomial can also be derived as a Poisson distribution with a Gamma-distributed rate. If λ ~ Gamma(r, p/(1−p)) and X | λ ~ Pois(λ), then marginally X ~ NegBin(r, p). This "random rate" interpretation explains why real count data is overdispersed: the rate isn't truly constant across observations (different days, locations, or conditions produce different rates).

Negative Binomial in Modern ML

The Negative Binomial appears in several modern ML applications:

scRNA-seq: Single-cell RNA sequencing data is heavily overdispersed. Gene expression counts are modeled as Negative Binomial, not Poisson. Tools like DESeq2 and scVI use NegBin likelihoods.

NLP count models: In some text analysis pipelines, word frequencies within documents are modeled as NegBin when the Poisson fails to capture the burstiness of natural language (a word that appears once in a document tends to appear again).

Count regression: When Poisson regression gives poor fits (residual deviance >> degrees of freedom, a clear sign of overdispersion), switching to Negative Binomial regression often resolves the issue. The extra parameter r absorbs the excess variance, producing well-calibrated confidence intervals and p-values. In Python, statsmodels.discrete.discrete_model.NegativeBinomial handles this.

Worked Example

You need r = 3 positive training examples from a noisy dataset where only p = 0.2 of samples are positive. How many negatives will you wade through?

Expected failures = r(1−p)/p = 3 × 0.8/0.2 = 12 failures. So you'll examine about 15 samples total. The variance is 3 × 0.8/0.04 = 60, so the standard deviation is √60 ≈ 7.7 — quite variable!

python
from scipy.stats import nbinom

rv = nbinom(n=3, p=0.2)  # n=r in scipy
rv.mean()    # 12.0 failures expected
rv.var()     # 60.0 (variance >> mean = overdispersed)
rv.pmf(10)   # P(exactly 10 failures before 3rd success)

# Compare: Poisson with same mean has variance = 12
# NegBin variance = 60 — five times higher!

Why would you choose a Negative Binomial over a Poisson for modeling event counts?

When events happen at a constant rate When the variance of counts exceeds the mean (overdispersion) When you only have two possible outcomes

Chapter 6: The Geometric Distribution

You keep sending a network packet until it gets through. You keep rolling a die until you get a 6. You keep asking users until one clicks your ad. You keep sampling from a noisy dataset until you find a valid example. You keep generating candidate solutions until one passes a test. The question: how many failures until the very first success?

The Geometric distribution Geom(p) is the simplest waiting-time distribution and the special case of the Negative Binomial with r = 1. It counts the number of failures k before the very first success:

P(X = k) = (1 − p)^k · p, k = 0, 1, 2, ...

The logic is clean: to have exactly k failures followed by a success, you need k consecutive failures (each with probability 1−p) and then one success (probability p). The probability decays exponentially — long waits are possible but increasingly unlikely. The CDF has a simple closed form: P(X ≤ k) = 1 − (1−p)^k+1.

Property	Value
Support	{0, 1, 2, ...} (failures before 1st success)
Mean	(1 − p) / p
Variance	(1 − p) / p²
Parameters	p ∈ (0, 1]

The memoryless property: The Geometric is the only discrete distribution that is memoryless. If you've already failed 10 times, the probability of succeeding on the next trial is still p — the same as if you'd never tried at all. Formally: P(X > s + t | X > s) = P(X > t). Past failures give you zero information about when the next success will come. This makes it the discrete analog of the continuous Exponential distribution.

Geometric PMF Explorer

Low p means long expected waits — watch the tail stretch as p shrinks. High p concentrates mass at k = 0 (succeed immediately).

p (success prob)0.30

Connection to the Exponential

The Geometric is to discrete time what the Exponential is to continuous time. Both are memoryless. Both model "waiting for the first event." If you discretize an Exponential waiting time into integer time steps, you get a Geometric. This makes the Geometric the natural model for counting discrete retries, rounds, or attempts.

The Coupon Collector Problem

A famous application: how many random draws (with replacement) until you've collected all K distinct items? Each item you still need arrives like a Geometric waiting time. The first unique item takes 1 draw. The second takes Geom(p = (K−1)/K) extra draws. The k-th takes Geom(p = (K−k+1)/K) extra draws. The total expected draws is K · H_K where H_K = 1 + 1/2 + 1/3 + ... + 1/K is the K-th harmonic number. For K = 50 stickers, you need about 225 purchases — much more than 50!

Worked Example

A wireless channel has p = 0.3 packet success rate. How many retransmissions on average before a packet gets through?

Expected failures = (1 − 0.3)/0.3 = 2.33 retransmissions. Total expected attempts = 1/p = 3.33. The variance is 0.7/0.09 = 7.78, so there's wide variability — sometimes the packet goes through first try, sometimes it takes 10+ attempts.

What's the probability the packet takes more than 5 attempts? P(X > 4 failures) = (1 − 0.3)⁵ = 0.7⁵ = 0.168. About 1 in 6 packets will need 6+ attempts. This long tail is critical for timeout calculations in network protocols.

Now consider the memoryless property in action. After 3 failed attempts, what's the probability the next one succeeds? It's still just p = 0.3. The 3 failures gave you zero information about the future. This is deeply counterintuitive — in real life, repeated failure often signals a deeper problem (maybe the server is down, not just noisy). But the Geometric assumes each trial is truly independent: the channel doesn't degrade, the server doesn't crash, each coin flip is fresh. When this independence assumption fails (correlated failures, degrading systems), you need more sophisticated models like Markov chains or hidden Markov models.

python
from scipy.stats import geom

# scipy's geom counts trials (starting at 1), not failures
rv = geom(p=0.3)
rv.mean()    # 3.33 (total trials including the success)
rv.pmf(1)    # 0.3  (succeed on first try)
rv.pmf(4)    # 0.3 * 0.7^3 = 0.1029

# P(need more than 5 attempts)
1 - rv.cdf(5)  # 0.168

What does "memoryless" mean for the Geometric distribution?

The distribution has no parameters Past failures don't change the probability of success on the next trial The mean is always 1

Chapter 7: The Hypergeometric Distribution

You have a dataset of 1,000 images: 200 are cats, 800 are dogs. You randomly draw 50 without replacement for your test set. How many cats will you get?

Unlike the Binomial, each draw changes the composition of what's left. After pulling out 3 cats, the pool has 197 cats out of 997 remaining — the probability of drawing a cat on the next draw has shifted slightly. This is sampling without replacement, and it's fundamentally different from independent draws.

The Hypergeometric distribution models exactly this scenario. It has three integer parameters: N (total population size), K (number of "success" items in the population), and n (number of draws without replacement). The PMF for getting exactly k successes among your n draws is:

P(X = k) = C(K, k) · C(N − K, n − k) / C(N, n)

Think of it as a counting argument. The numerator counts favorable draws: C(K, k) ways to choose k items from the K successes, times C(N−K, n−k) ways to choose the remaining n−k items from the N−K non-successes. The denominator C(N, n) counts all possible draws of n from N. The ratio gives the probability.

Unlike the Binomial, the support isn't always {0, 1, ..., n}. If n > N − K, you're forced to include some successes. The minimum k is max(0, n + K − N), and the maximum is min(n, K).

Property	Value
Support	{max(0, n+K−N), ..., min(n, K)}
Mean	nK / N
Variance	n · (K/N) · (1−K/N) · (N−n)/(N−1)
Parameters	N, K, n (integers)

Hypergeometric vs Binomial: The variance includes a factor (N−n)/(N−1) called the finite population correction (FPC). When N is huge relative to n (say, drawing 50 from a million), this factor ≈ 1 and the Hypergeometric ≈ Binomial. But when n is a significant fraction of N (say, drawing 500 from 1,000), this factor shrinks the variance considerably — sampling without replacement gives tighter results because each draw reveals information about the remaining population.

In machine learning, the Hypergeometric arises naturally in stratified sampling (ensuring class balance in train/test splits), feature selection (Fisher's exact test for gene expression), and quality control (acceptance sampling). Whenever you're selecting a subset from a finite labeled pool, you're in Hypergeometric territory.

Fisher's Exact Test

Fisher's exact test is one of the most important applications of the Hypergeometric. It tests whether two categorical variables are associated. Given a 2x2 contingency table, the probability of observing the data under the null hypothesis of independence follows a Hypergeometric distribution. Unlike the chi-squared test, Fisher's exact test works even with small sample sizes — it computes exact probabilities rather than relying on asymptotic approximations.

Dataset Splitting

When you split a dataset of N items (K positive, N−K negative) into a test set of size n, the number of positives in the test set follows Hypergeom(N, K, n). If you want a stratified split (same class proportions in train and test), you're essentially conditioning on the Hypergeometric outcome being close to its mean nK/N. Understanding this distribution tells you how much class imbalance to expect in random splits and when stratification is necessary.

For a concrete example: if your dataset has N = 1000 samples with K = 50 positives (5% prevalence) and you randomly take n = 200 for testing, the expected positives in the test set is nK/N = 10 with standard deviation ≈ 2.8 (including the finite population correction). There's about a 4% chance of getting fewer than 5 positives — potentially too few for reliable evaluation of metrics like precision and recall. This is why stratified splitting matters for imbalanced datasets.

python
from scipy.stats import hypergeom
# Dataset splitting scenario
rv = hypergeom(M=1000, n=50, N=200)
rv.mean()                  # 10.0 positives expected
rv.std()                   # 2.76
rv.cdf(4)                  # 0.038 — P(fewer than 5 positives)
1 - rv.cdf(15)             # 0.024 — P(more than 15 positives)

Hypergeometric PMF Explorer

N = population, K = successes in population, n = draws. Watch how sampling without replacement narrows the distribution compared to equivalent Binomial parameters.

N (population)50

K (successes in pop.)15

n (draws)10

Worked Example

A factory produces N = 100 chips, K = 5 are defective. An inspector tests n = 10 chips. What's the expected number of defectives, and what's the probability of catching at least one?

Expected defectives: nK/N = 10 × 5/100 = 0.5.

P(at least 1 defective) = 1 − P(X=0) = 1 − C(5,0)·C(95,10)/C(100,10) ≈ 0.41. So there's a 41% chance the inspector catches a problem. If the inspector tested n = 30 instead, P(at least 1) rises to about 0.81 — testing more items dramatically increases detection power when defect rates are low.

python
from scipy.stats import hypergeom

# scipy convention: M=pop, n=success_in_pop, N=draws
rv = hypergeom(M=100, n=5, N=10)
rv.mean()         # 0.5
1 - rv.pmf(0)    # 0.41 — P(at least 1 defective)

# Compare to Binomial(10, 0.05): P(>=1) = 0.40
# Close because n/N = 10/100 is small

When does the Hypergeometric differ most from the Binomial?

When the population is very large relative to the sample When the sample is a large fraction of the population When p is very small

Chapter 8: The Discrete Uniform

A fair die. A random integer between 1 and 100. An index chosen uniformly at random from an array. A random action in ε-greedy exploration. A pixel chosen at random from an image. When every outcome is equally likely and there is absolutely no reason to favor one over another, you have the Discrete Uniform distribution.

For outcomes {a, a+1, ..., b}, each has the same probability:

P(X = k) = 1 / (b − a + 1), k ∈ {a, a+1, ..., b}

This is the "maximum ignorance" distribution — you have no reason to prefer one outcome over another. Its entropy is log(b − a + 1), which is the maximum possible entropy for any distribution over the same support. In information theory, this means it carries the most uncertainty and requires the most bits to encode an outcome.

The mean is exactly the midpoint: (a + b)/2. The variance grows quadratically with the range — more possible outcomes means more spread. Specifically, for n = b − a + 1 outcomes, the variance is (n² − 1)/12. For a fair 6-sided die: variance = (36 − 1)/12 = 2.917.

Property	Value
Support	{a, a+1, ..., b}
Mean	(a + b) / 2
Variance	((b − a + 1)² − 1) / 12
Parameters	a, b (integers, a ≤ b)

Where you'll see it: Random weight initialization (uniform over integers), hash function outputs (ideal hash maps to uniform distribution), random sampling indices for mini-batches, Monte Carlo integration, the exploration component of ε-greedy RL policies (choose a random action uniformly), random seeds, and shuffling algorithms (Fisher-Yates uses uniform random integers).

Maximum Entropy Principle

The Discrete Uniform is the result of the maximum entropy principle: if all you know is the set of possible outcomes (no other constraints), the distribution that makes the fewest additional assumptions is the uniform. This principle, formalized by E.T. Jaynes, is why uninformative priors in Bayesian statistics are often uniform. It's the mathematical way of saying "I genuinely don't know."

Generating Uniform Random Integers

Most random number generators start with uniform randomness. random.randint(a, b) in Python gives you a Discrete Uniform variable. But beware of the modulo bias: if your underlying generator produces uniform integers in [0, M) and you compute x % n, the result is only truly uniform if n divides M evenly. Modern implementations (like Python's random module) handle this correctly, but it's a subtle source of bugs in hand-rolled solutions.

From Uniform to Any Distribution

A remarkable fact: you can sample from any discrete distribution given only a Uniform random number. This is the inverse CDF method. Generate U ~ Uniform(0, 1), then find the smallest k such that the CDF F(k) ≥ U. This works because the probability that U falls in the interval [F(k−1), F(k)] is exactly P(X = k). This is how most software implements discrete random number generation under the hood.

Discrete Uniform PMF Explorer

All bars are the same height — that's the whole point. Change the range to see how the probability per outcome shrinks as you add more outcomes.

a (min)1

b (max)6

Worked Example

An ε-greedy RL agent has 8 actions. With probability ε = 0.1, it explores by picking uniformly at random. With probability 1 − ε = 0.9, it picks the best known action. During exploration, what's the probability of selecting action 3?

P(action 3 | exploring) = 1/8 = 0.125. The overall probability of action 3 during exploration is ε × 1/8 = 0.0125. Mean action index during exploration = (0 + 7)/2 = 3.5. Variance = (8² − 1)/12 = 63/12 = 5.25.

python
from scipy.stats import randint
import numpy as np

rv = randint(low=0, high=8)  # [0, 8) = {0,1,...,7}
rv.pmf(3)     # 0.125
rv.mean()     # 3.5
rv.var()      # 5.25

# Epsilon-greedy: explore with prob epsilon
epsilon = 0.1
action = np.random.randint(8) if np.random.random() < epsilon else best_action

A fair 12-sided die follows DiscreteUniform(1, 12). What is P(X = 7)?

7/12 1/7 1/12

Chapter 9: The Multinomial Distribution

The Binomial extends Bernoulli from 1 trial to n trials. The Multinomial does the same for the Categorical — it counts outcomes across K categories over n independent trials.

You roll a loaded die 100 times and count how many of each face appeared. You classify 500 documents into 5 topics and count how many fall in each. You look at a paragraph and count the frequency of each word. You generate 1,000 tokens from an LLM and tally how many times each word appears.

The result is a count vector (x₁, x₂, ..., x_K) where each x_i counts how many times category i appeared, and all counts must sum to n. This vector lives on a simplex — a constrained space where only K−1 of the counts are truly free (the last is determined by n minus the rest).

P(X₁=x₁, ..., X_K=x_K) = n! / (x₁! ··· x_K!) · p₁^x₁ ··· p_K^x_K

The multinomial coefficient n!/(x₁!...x_K!) counts the number of distinct ways to arrange n trials into the given pattern of K categories. Think of it as: how many different orderings of the n trials produce the same count vector? Any single such ordering has probability p₁^x₁...p_K^x_K. Multiply by the number of orderings and you get the PMF.

For K = 2, the Multinomial reduces to the Binomial (the multinomial coefficient becomes the binomial coefficient, and the count vector is just (k, n−k)). For n = 1, it reduces to the Categorical (the coefficient is always 1, and the count vector is a one-hot). So the Multinomial truly unifies the counting distributions — it is the most general "repeat and count" distribution in this family.

Each marginal X_i (the count for a single category) follows a Binomial(n, p_i). But the X_i's are not independent — if many trials land in category 1, fewer are left for the others. They have negative covariance: Cov(X_i, X_j) = −np_ip_j. This negative correlation is intuitive: categories compete for the same pool of n trials.

The Dirichlet distribution is the conjugate prior for the Multinomial. If your prior over (p₁, ..., p_K) is Dirichlet(α), then after observing Multinomial counts, the posterior is also Dirichlet with updated parameters. This Dirichlet-Multinomial pair is the multivariate generalization of the Beta-Binomial pair.

Property	Value
Support	x_i ≥ 0, ∑x_i = n
Mean of X_i	np_i
Variance of X_i	np_i(1 − p_i)
Cov(X_i, X_j)	−np_ip_j
Parameters	n, (p₁, ..., p_K)

The bag-of-words model: In NLP, a document is often modeled as a Multinomial draw from a vocabulary distribution. Each word in the document is one "trial," and the vocabulary has K "categories." Topic models like Latent Dirichlet Allocation (LDA) assume each topic has its own Multinomial distribution over words, and each document is a mixture of topics. The Multinomial is also the natural distribution for the output of a multi-class classifier applied to n samples.

Chi-Squared Test

A classic application of the Multinomial is the chi-squared goodness-of-fit test. If you observe counts (x₁, ..., x_K) and expect (np₁, ..., np_K), the test statistic is:

χ² = ∑_i (x_i − np_i)² / np_i

Under the null hypothesis (the data really came from this Multinomial), this statistic follows a chi-squared distribution with K−1 degrees of freedom for large n. Large values mean the observed counts deviate significantly from expected — evidence that the assumed distribution is wrong. This test is used everywhere from A/B testing to checking if an LLM's output distribution matches the training distribution (detecting distribution shift).

Multinomial Sampler — Word Counts

Sample n words from a vocabulary of 8 tokens with given probabilities. Each bar shows the observed proportion for one "word." Click Re-sample to see how the proportions fluctuate.

n (trials)100

Worked Example

A 3-topic classifier with probabilities [0.5, 0.3, 0.2] classifies n = 200 documents. What are the expected counts, variances, and covariances?

Expected count vector: E[X] = [np₁, np₂, np₃] = [100, 60, 40].

Variance of topic 1 count: np₁(1−p₁) = 200 × 0.5 × 0.5 = 50. Standard deviation ≈ 7.07, so we'd expect roughly 100 ± 14 documents in topic 1 (95% interval).

Covariance between topic 1 and topic 2: Cov = −np₁p₂ = −200 × 0.5 × 0.3 = −30. The negative sign confirms that more documents in topic 1 means fewer in topic 2. The correlation is Cov/(std₁ × std₂) = −30/(√50 × √42) = −30/45.8 ≈ −0.655. The categories are moderately anti-correlated.

python
from scipy.stats import multinomial
import numpy as np

rv = multinomial(n=200, p=[0.5, 0.3, 0.2])
sample = rv.rvs(size=1)  # e.g. [[98, 63, 39]]
rv.mean()                 # [100., 60., 40.]

# Covariance between topic 1 and topic 2
# Cov = -n * p1 * p2 = -200 * 0.5 * 0.3 = -30
print(rv.cov())  # [[50, -30, -20], [-30, 42, -12], [-20, -12, 32]]

The Multinomial is to the Categorical as the Binomial is to the ___?

Bernoulli Poisson Geometric

Chapter 10: The Family Tree

These nine distributions aren't isolated — they form a family connected by limiting cases, special cases, and extensions. Understanding these relationships is the key to choosing the right distribution for your problem. When you see a new random variable, your first question should be: what is this a special case of? What does it converge to in the limit?

The relationship map below shows how the distributions connect. Every arrow represents either a special case (setting a parameter to a specific value) or a limiting case (taking a parameter to infinity). This web of connections means you never need to memorize all nine from scratch — understand a few core distributions (Bernoulli, Poisson, Categorical), and the rest follow as natural extensions.

A useful mental model: Bernoulli and Categorical are the "atoms" — single-trial distributions. Binomial and Multinomial are "molecules" (repeating the atom n times and counting). Geometric and Negative Binomial are "waiting time" versions (repeating until a stopping condition). Poisson is the limiting case of rare events. Hypergeometric is the finite-population version. And the Discrete Uniform is the "maximum ignorance" baseline that says "I have no idea."

Distribution Relationship Map

Nodes show distributions. Dashed lines show relationships: special cases, extensions, and limiting behaviors. The Normal (Gaussian) appears as the limiting distribution for large sample sizes.

When to Use Which

Start with the question you're asking. Each distribution answers a different type of question:

Question You're Asking	Distribution	AI/Robotics Example
Yes or no?	Bernoulli	Is this email spam?
Which category (one trial)?	Categorical	Which word comes next?
How many successes in n trials?	Binomial	How many users convert?
How many events in an interval?	Poisson	Server errors per hour
Trials until r successes?	Negative Binomial	Overdispersed counts
Trials until first success?	Geometric	Packet retransmissions
Successes when sampling w/o replacement?	Hypergeometric	Defectives in a sample
All outcomes equally likely?	Discrete Uniform	Random action in ε-greedy
Counts across K categories in n trials?	Multinomial	Word frequency vectors

Key Relationships

The distributions in this lesson are not independent inventions — they're connected by mathematical relationships that reveal deep structure. If you understand one, you can often derive or approximate another.

Special cases: Bernoulli = Binomial with n = 1. Geometric = Negative Binomial with r = 1. Binomial = Multinomial with K = 2. Bernoulli = Categorical with K = 2. Fair die = Discrete Uniform(1, 6).

Limiting cases: Binomial(n, p) → Poisson(λ = np) as n → ∞, p → 0. Hypergeometric → Binomial as N → ∞ (infinite population = sampling with replacement). Binomial → Normal as n → ∞ (central limit theorem). Poisson → Normal as λ → ∞.

Conjugate Priors

In Bayesian statistics, each of these distributions has a natural conjugate prior that makes updating tractable:

Likelihood	Conjugate Prior	What It Means
Bernoulli / Binomial	Beta	Prior on success probability p
Categorical / Multinomial	Dirichlet	Prior on probability vector (p₁,...,p_K)
Poisson	Gamma	Prior on rate λ
Geometric / Neg. Binomial	Beta	Prior on success probability p

Conjugacy means the posterior has the same functional form as the prior, just with updated parameters. This makes sequential Bayesian updating fast and analytically tractable — no numerical integration needed. For example, with a Beta(1,1) prior (uniform) and 7 heads in 10 coin flips (Binomial likelihood), the posterior is Beta(8, 4), which you can compute in closed form.

Comparison at a Glance

Distribution	Parameters	Mean	Variance
Bernoulli	p	p	p(1−p)
Categorical	p_1..K	p_i	p_i(1−p_i)
Binomial	n, p	np	np(1−p)
Poisson	λ	λ	λ
Neg. Binomial	r, p	r(1−p)/p	r(1−p)/p²
Geometric	p	(1−p)/p	(1−p)/p²
Hypergeometric	N, K, n	nK/N	nK(N−K)(N−n)/[N²(N−1)]
Discrete Uniform	a, b	(a+b)/2	((b−a+1)²−1)/12
Multinomial	n, p_1..K	np_i	np_i(1−p_i)

Decision Flowchart

Start here

What kind of outcome?

↓

Binary (yes/no)

1 trial → Bernoulli | n trials → Binomial | until success → Geometric

↓

K categories

1 trial → Categorical | n trials → Multinomial

↓

Unbounded count

Var ≈ Mean → Poisson | Var > Mean → Neg. Binomial

↓

Finite pool, no replacement

Hypergeometric

↓

No preference among outcomes

Discrete Uniform

The practitioner's decision rule: Start simple. If the outcome is binary, use Bernoulli/Binomial. If it's a count with no upper bound, try Poisson first — if variance > mean, upgrade to Negative Binomial. If you're sampling from a finite pool without replacement, use Hypergeometric. If you're choosing among K categories, it's Categorical (one trial) or Multinomial (n trials). If you have no information to prefer one outcome, it's Discrete Uniform.

Connections to Other Lessons

These distributions are the building blocks for everything else:

The Categorical is the output of every classifier — see the GPT and Transformer lessons for how softmax creates it.

The Bernoulli and Binomial appear in Bayesian Estimation (Beta-Binomial conjugacy) and Bayes Filters.

The Poisson drives event detection in Hidden Markov Models and sensor fusion in Kalman Filters.

The Multinomial is the heart of topic models and Bayesian Networks with discrete nodes.

What We Didn't Cover

This lesson focused on the most common discrete distributions. But the family extends further:

Distribution	What It Does	When You'd Need It
Beta-Binomial	Binomial where p is itself random (Beta-distributed)	Overdispersed binary data, hierarchical models
Dirichlet-Multinomial	Multinomial where probabilities are Dirichlet-distributed	Document modeling with uncertainty in topic proportions
Zero-Inflated Poisson	Mixture of point mass at 0 and Poisson	Count data with excess zeros (rare events)
Zipf / Power Law	P(X=k) ∝ k^−s	Word frequencies, city sizes, network degrees

Each of these builds on the foundations covered in this lesson. Once you understand the nine core distributions, these extensions are natural generalizations.

Practical Tips

For practitioners:
• Always plot your data as a histogram before choosing a distribution. The shape tells you a lot.
• Check mean vs variance: if they're approximately equal, Poisson is likely. If variance >> mean, use Negative Binomial.
• If your data has an upper bound (e.g., number correct out of n), it's probably Binomial. If it's unbounded, think Poisson or Negative Binomial.
• Use QQ-plots (quantile-quantile plots) to visually check if your data matches a candidate distribution.
• Scipy's .fit() method can estimate parameters from data for many distributions.
• When in doubt, simulate! Generate random samples from your candidate distribution and compare to your actual data.

Summary in One Sentence Each

Distribution	One Sentence
Bernoulli	A single coin flip with bias p.
Categorical	A single roll of a K-sided die.
Binomial	Count of heads in n coin flips.
Poisson	Count of events at a constant rate.
Neg. Binomial	Number of failures before r-th success.
Geometric	Number of failures before first success.
Hypergeometric	Successes when drawing without replacement.
Discrete Uniform	Every outcome equally likely.
Multinomial	Count vector from n rolls of a K-sided die.

You're modeling the number of sensor faults per day. The mean is 2.3 but the variance is 8.1. Which distribution is most appropriate?

Poisson (variance must equal mean) Negative Binomial (handles overdispersion) Binomial (fixed number of trials)

Where to Go From Here

The continuous world awaits. The most important continuous distributions for AI and robotics are:

Each continuous distribution pairs naturally with its discrete counterpart through conjugacy, limits, or analogous structure. The discrete foundations you've built in this lesson will make continuous distributions feel like natural extensions rather than entirely new concepts.

With these nine distributions as your vocabulary, you can describe and reason about discrete uncertainty in any AI system. They are the building blocks of probabilistic thinking.

"Probability theory is nothing but common sense reduced to calculation." — Pierre-Simon Laplace

Distribution	Discrete Analog	Where Used
Normal (Gaussian)	Binomial (large n limit)	Everywhere: noise, weights, beliefs
Exponential	Geometric	Waiting times, event arrivals
Beta	Prior for Bernoulli p	Bayesian inference, Thompson sampling
Gamma	Prior for Poisson λ	Rate estimation, mixture models
Dirichlet	Prior for Categorical (p₁,...,p_K)	Topic models, Bayesian RL