Nine probability distributions that power classification, counting, sampling, and decision-making across AI, estimation, and robotics.
A self-driving car classifies each pixel as road, car, pedestrian, or sky. A spam filter says yes or no. A language model picks the next word from 50,000 candidates. A sensor counts how many packets arrived in the last second.
All of these are discrete outcomes — they come from a finite or countable set of possibilities. You can list them: heads or tails, category 1 through K, 0 events or 1 event or 2 events. There are no values "between" them. You can't get 2.7 heads in 5 coin flips.
Contrast this with continuous outcomes: a temperature reading of 37.284°C, a voltage of 3.14159V, a distance of 2.718 meters. Those live on the real number line with infinitely many values between any two points. Discrete outcomes snap to specific, separated values.
Why does this distinction matter? Because it determines the mathematical tools you'll use. Discrete distributions let you sum probabilities directly: P(X = 3) + P(X = 4) = p(3) + p(4). Continuous distributions require integration: P(3 ≤ X ≤ 4) = ∫34 f(x)dx. In practice, many AI systems deliberately convert continuous quantities into discrete ones — quantization in neural networks, tokenization in language models, binning in histograms — precisely because discrete distributions are easier to work with computationally.
The cumulative distribution function (CDF) works for both: F(k) = P(X ≤ k). For discrete distributions, the CDF is a staircase function that jumps at each supported value. The jump height at k equals p(k). CDFs are useful for answering "at most" and "at least" questions: P(X ≥ 5) = 1 − F(4).
Continuous distributions (like the Gaussian) assign probability densities to intervals — you must integrate to get actual probabilities. Discrete distributions assign actual probabilities to individual outcomes directly. The function that does this is the probability mass function (PMF):
Each bar in a PMF chart has a definite height between 0 and 1, and all bars sum to exactly 1. No integrals needed — just addition. This is why discrete distributions are often easier to work with: you can enumerate all possibilities and add up their probabilities.
How do you know if your problem is discrete? Ask yourself: can the outcome be listed? If you can write down all possible values (even if the list is infinite, like 0, 1, 2, 3, ...), it's discrete. If the outcome can take any real value in some interval (like a temperature between 36.0 and 42.0), it's continuous.
| Feature | Discrete (PMF) | Continuous (PDF) |
|---|---|---|
| Outcome space | Countable (finite or infinite list) | Uncountable (real intervals) |
| P(X = x) | Nonzero for each outcome | Always zero for any single point |
| Sum/integral to 1 | ∑ p(k) = 1 | ∫ f(x) dx = 1 |
| Examples | Coin flips, counts, categories | Heights, voltages, durations |
In this lesson, we'll build up a family of nine distributions, each designed for a different scenario you'll encounter in AI, robotics, and estimation:
Left: a discrete PMF (bars sum to 1). Right: a continuous PDF (area under curve = 1). Toggle between them to see the fundamental difference.
Throughout this lesson, every distribution gets an interactive bar chart. Here's how to read them:
The x-axis shows the possible outcomes (the support). The y-axis shows the probability of each outcome. Each bar's height is P(X = k) for that value of k. The bars must always sum to 1 — probability has to go somewhere.
Key patterns to watch for as you explore:
| Shape | What It Means | Example |
|---|---|---|
| Tall narrow peak | High confidence, low variance | Bernoulli with p near 0 or 1 |
| Flat/uniform | Maximum uncertainty | Fair die, Discrete Uniform |
| Right-skewed | Most values small, occasional large ones | Poisson with small λ |
| Symmetric bell | Values cluster around the mean | Binomial with p ≈ 0.5 |
| Long tail | Rare but possible extreme values | Geometric with small p |
The simplest possible random experiment: something happens, or it doesn't. A coin lands heads. An email is spam. A patient tests positive. A neural network predicts "cat." There are exactly two outcomes, and we label them 1 (success) and 0 (failure).
A Bernoulli random variable X (named after Swiss mathematician Jacob Bernoulli, 1655-1705) has a single parameter p — the probability of success. Its PMF is beautifully compact:
Let's unpack this formula. When k = 1 (success): P(X = 1) = p1(1 − p)0 = p. When k = 0 (failure): P(X = 0) = p0(1 − p)1 = 1 − p. The two probabilities sum to exactly 1. That's the entire distribution.
The mean (expected value) is E[X] = 0·(1−p) + 1·p = p. This makes intuitive sense: the average value of many Bernoulli draws converges to p (the law of large numbers). The variance is Var(X) = p(1 − p). Notice: variance is maximized at p = 0.5 (maximum uncertainty, like a fair coin) and drops to zero at p = 0 or p = 1 (certain outcome).
| Property | Value |
|---|---|
| Support | {0, 1} |
| Mean | p |
| Variance | p(1 − p) |
| Parameters | p ∈ [0, 1] |
The entropy of a Bernoulli is H = −p log2 p − (1−p) log2(1−p). This is maximized at p = 0.5 (1 bit of information — you need exactly one bit to encode the outcome) and zero at p = 0 or p = 1 (no uncertainty, zero bits needed). Binary cross-entropy loss in neural network training is exactly this formula applied to each prediction, using natural log instead of log2. It measures how far the model's predicted p is from the true label.
When your model's sigmoid output gives p = 0.5 for every input, it has maximum entropy — it's learned nothing. Training drives p toward 0 or 1 for each example, reducing entropy and increasing the model's confidence.
Drag p to see how the two bars change. The variance peaks at p = 0.5 — maximum uncertainty about the outcome.
Let's derive the variance from scratch, step by step. We need E[X²] and E[X].
E[X] = 0 · (1−p) + 1 · p = p.
E[X²] = 0² · (1−p) + 1² · p = p. (For Bernoulli, E[X²] = E[X] since X is 0 or 1.)
Var(X) = E[X²] − (E[X])² = p − p² = p(1−p).
This parabola opens downward, hitting zero at the endpoints (certainty) and peaking at 0.25 when p = 0.5 (maximum confusion). The fact that X² = X for Bernoulli is a unique property — it only works because the support is {0, 1}.
A spam filter assigns p = 0.85 that an email is spam. What's the probability it's not spam? What's the uncertainty?
P(not spam) = 1 − 0.85 = 0.15. The variance is 0.85 × 0.15 = 0.1275. Compare this to a filter that outputs p = 0.99: variance = 0.99 × 0.01 = 0.0099. The more confident the classifier, the lower the variance — which is exactly what we want in a well-calibrated model.
python from scipy.stats import bernoulli rv = bernoulli(p=0.85) rv.pmf(1) # 0.85 — P(spam) rv.pmf(0) # 0.15 — P(not spam) rv.mean() # 0.85 rv.var() # 0.1275 # Simulate 1000 emails samples = rv.rvs(size=1000) print(samples.mean()) # ≈ 0.85 # Binary cross-entropy loss import numpy as np y_true = 1 # actual label p_pred = 0.85 bce = -(y_true * np.log(p_pred) + (1 - y_true) * np.log(1 - p_pred)) print(bce) # 0.163 — low loss for a good prediction
A Bernoulli picks between two outcomes. But what if your die has K faces? A language model picks from 50,000 tokens. An image classifier chooses from 1,000 classes. A robot decides among 8 possible actions. You need a distribution over K categories, each with its own probability.
The Categorical distribution (also called a generalized Bernoulli, a discrete distribution, or sometimes a "multinoulli") has K parameters p1, p2, ..., pK where each pi ≥ 0 and they sum to 1. Only K−1 parameters are free since pK = 1 − p1 − ... − pK−1. The PMF for outcome i is simply:
That's it. No fancy formula — you just look up the probability for category i. The beauty is in what this enables: by assigning different probabilities to different categories, you can encode everything from a fair die (all pi = 1/K) to a completely confident classifier (one pi = 1, all others = 0), to anything in between. The space of all possible Categorical distributions over K categories forms a (K−1)-dimensional simplex — a triangle for K=3, a tetrahedron for K=4.
For any single category i, the indicator "did category i occur?" is itself a Bernoulli(pi). So the mean of that indicator is pi and its variance is pi(1 − pi). The entropy of the Categorical is H = −∑ pi log pi, which measures how "spread out" the distribution is. A uniform Categorical has maximum entropy (maximum uncertainty); a one-hot distribution (one pi = 1, rest = 0) has zero entropy (no uncertainty).
| Property | Value |
|---|---|
| Support | {1, 2, ..., K} |
| Mean (of indicator i) | pi |
| Variance (of indicator i) | pi(1 − pi) |
| Parameters | p1, ..., pK with ∑pi = 1 |
Once you have a Categorical distribution, how do you pick from it? There are several strategies, each used in different AI contexts:
| Strategy | Method | Use Case |
|---|---|---|
| Greedy (argmax) | Always pick the highest-probability category | Classification, deterministic decoding |
| Sampling | Draw randomly according to the probabilities | Creative text generation |
| Top-k | Zero out all but the top k probabilities, renormalize, then sample | Focused but diverse generation |
| Nucleus (top-p) | Keep the smallest set of categories whose cumulative probability ≥ p | Adaptive diversity in LLMs |
All of these operate on the same underlying Categorical distribution — they just filter or reshape it before sampling.
In neural networks, we typically represent a Categorical outcome as a one-hot vector: a vector of all zeros except for a 1 in the position of the chosen category. For K = 4 classes, outcome 2 becomes [0, 1, 0, 0]. The cross-entropy loss between the true one-hot vector and the predicted probability vector is:
Since only one yi is nonzero, the loss simplifies to the negative log of the predicted probability for the true class. Minimizing cross-entropy is equivalent to maximizing the likelihood of the Categorical distribution — it's the same optimization problem viewed from two angles. When the model assigns p = 0.99 to the correct class, the loss is −log(0.99) = 0.01 (tiny). When it assigns p = 0.01, the loss is −log(0.01) = 4.6 (huge). This logarithmic penalty harshly punishes confident wrong answers.
Adjust weights for face 1 and face 6. All weights auto-normalize to sum to 1. Faces 2-5 have fixed weight 1.0. Watch how loading one face steals probability from the others.
A language model outputs logits [2.0, 1.0, 0.5] for three tokens. Softmax converts these to probabilities:
p1 = e2.0 / (e2.0 + e1.0 + e0.5) = 7.389 / (7.389 + 2.718 + 1.649) = 7.389 / 11.756 ≈ 0.628
Similarly p2 ≈ 0.231, p3 ≈ 0.140. The model then samples from this Categorical to pick a token.
python import numpy as np logits = np.array([2.0, 1.0, 0.5]) probs = np.exp(logits) / np.sum(np.exp(logits)) # softmax print(probs) # [0.628, 0.231, 0.140] # Sample a token token = np.random.choice(len(probs), p=probs) # Temperature scaling (T=0.5 = more confident) T = 0.5 probs_cold = np.exp(logits / T) / np.sum(np.exp(logits / T)) print(probs_cold) # [0.794, 0.145, 0.061] — peakier
Now we extend the Bernoulli from a single trial to many. You run a Bernoulli experiment n times, independently. A medical trial tests a drug on 100 patients — each patient either responds or doesn't. You send 50 network packets through a lossy channel — each arrives or gets lost. You classify 200 images and want to know how many are correct. An A/B test shows a new landing page to 1,000 visitors — each visitor converts or bounces.
The question in every case: how many successes in n independent trials?
The Binomial distribution Bin(n, p) counts the total number of successes k out of n independent Bernoulli(p) trials. Each trial has the same probability p of success, and trials don't affect each other. Its PMF is:
Let's derive why this formula makes sense. Any specific sequence with exactly k successes and (n−k) failures has probability pk(1−p)n−k. But there are many such sequences — the successes can fall on any k of the n positions. The number of ways to choose those k positions is the binomial coefficient C(n, k) = n! / (k!(n−k)!). Multiply the two together and you get the PMF.
The mean is E[X] = np (each trial contributes p on average, and expectation is linear). The variance is Var(X) = np(1−p). Why? Each trial is an independent Bernoulli(p) with variance p(1−p). Since the trials are independent, variances add: np(1−p). This additivity of variance only works because the trials don't influence each other.
C(n, k) = n! / (k!(n−k)!) deserves a moment. For n = 10, k = 3: C(10, 3) = 10!/(3! × 7!) = (10 × 9 × 8)/(3 × 2 × 1) = 120. There are 120 different ways to place 3 successes among 10 trial positions. You might hear this called "n choose k" — it counts combinations (order doesn't matter). A common computational trick: compute C(n, k) iteratively as a product to avoid overflow from large factorials.
Pascal's triangle gives all binomial coefficients: each entry is the sum of the two entries above it. The identity C(n, k) = C(n−1, k−1) + C(n−1, k) has a beautiful probabilistic interpretation: the k successes in n trials either include the last trial (in which case the first n−1 trials must contain k−1 successes) or they don't (in which case the first n−1 trials contain all k successes). This recursive structure is why dynamic programming can efficiently compute binomial coefficients.
| Property | Value |
|---|---|
| Support | {0, 1, 2, ..., n} |
| Mean | np |
| Variance | np(1 − p) |
| Parameters | n ∈ {1, 2, ...}, p ∈ [0, 1] |
Suppose your classifier achieves 92% accuracy on a test set of n = 500. Is this significantly better than random (50%)? Under the null hypothesis (random guessing), the number of correct predictions follows Bin(500, 0.5). The expected correct count would be 250 with std dev = √(500 × 0.5 × 0.5) = 11.18. Your observed 460 correct predictions is (460 − 250)/11.18 = 18.8 standard deviations above the mean — astronomically significant. But is 92% significantly better than 90%? Now the null is Bin(500, 0.90), with mean 450 and std dev 6.71. Your observed 460 is (460 − 450)/6.71 = 1.49 standard deviations — not quite significant at the 95% level. The Binomial provides the math to answer these questions rigorously.
Drag n and p to see how the distribution shifts and spreads. At p = 0.5, the distribution is symmetric. Large n makes it approach a bell curve.
An A/B test shows 200 visitors version B, with a 5% conversion rate (p = 0.05). What's the probability of exactly 15 conversions?
X ~ Bin(200, 0.05). Mean = np = 200 × 0.05 = 10. Std dev = √(np(1−p)) = √(200 × 0.05 × 0.95) = √9.5 ≈ 3.08.
P(X = 15) = C(200, 15) · 0.0515 · 0.95185 ≈ 0.0346. That's about 1.6 standard deviations above the mean — uncommon but not rare.
What if we want a confidence interval? Using the Normal approximation: 95% of the time, conversions will be in the range np ± 1.96√(np(1−p)) = 10 ± 6.04, so between about 4 and 16. If we observe 20 conversions, that's over 3 standard deviations from the mean — strong evidence that the true rate is higher than 5%.
python from scipy.stats import binom rv = binom(n=200, p=0.05) rv.pmf(15) # 0.0346 rv.mean() # 10.0 rv.std() # 3.08 # P(at least 15 conversions) — upper tail 1 - rv.cdf(14) # 0.0559 # 95% confidence interval lo, hi = rv.ppf(0.025), rv.ppf(0.975) print(f"95% CI: [{lo:.0f}, {hi:.0f}]") # [4, 16]
How many customers arrive at a store per hour? How many photons hit a sensor per millisecond? How many server errors occur per day? How many typos appear on a page? How many objects does a lidar detect per scan? These are all counts of events happening at some average rate in a fixed interval. The events are independent, could theoretically be any non-negative integer, and there is no hard upper limit on the count.
Unlike the Binomial, which requires a fixed number of trials n, the Poisson has no upper bound. Events can keep happening. This makes it ideal for situations where the "number of opportunities" is either unknown or effectively infinite.
The Poisson distribution (named after French mathematician Simeon Denis Poisson, 1781-1840) Pois(λ) has a single parameter λ (lambda, Greek letter for "rate") — the average number of events per interval. Its PMF is:
Where does this come from? Imagine dividing the interval into n tiny sub-intervals, each with probability p = λ/n of containing an event. The count of events is Bin(n, λ/n). As n → ∞:
C(n, k) · (λ/n)k · (1 − λ/n)n−k → λk/k! · e−λ
The factor e−λ is the probability of zero events — the "nothing happens" baseline. It's surprisingly large for small λ: when λ = 1, there's a 37% chance of zero events. The factor λk/k! distributes the remaining probability among configurations with exactly k events, with the factorial ensuring proper normalization.
The elegant thing about the Poisson: λ is simultaneously the mean AND the variance. No other common distribution has this property. When λ is small (say, 0.5), the distribution is heavily skewed right — most intervals have 0 events, some have 1, and 2+ is rare. When λ is moderate (say, 5), the distribution starts looking bell-shaped. As λ grows past 15-20, it becomes nearly indistinguishable from a Normal(λ, λ).
| Property | Value |
|---|---|
| Support | {0, 1, 2, ...} (unbounded) |
| Mean | λ |
| Variance | λ |
| Parameters | λ > 0 |
Watch how the distribution shifts right and becomes more symmetric as λ increases. At λ = 1, most outcomes are 0 or 1. At λ = 15, it looks nearly Gaussian.
A lidar sensor detects an average of λ = 3 obstacles per scan. What's the probability of detecting exactly 5 in one scan?
P(X = 5) = 35 · e−3 / 5! = 243 × 0.04979 / 120 ≈ 0.1008.
What about zero obstacles? P(X = 0) = e−3 ≈ 0.0498. So there's about a 5% chance of a completely clear scan.
python from scipy.stats import poisson rv = poisson(mu=3) rv.pmf(5) # 0.1008 rv.pmf(0) # 0.0498 rv.mean() # 3.0 rv.var() # 3.0 (mean = variance!) # P(more than 5 obstacles) — when to worry 1 - rv.cdf(5) # 0.0839 # Superposition: two independent sensors from scipy.stats import poisson rv_combined = poisson(mu=3 + 5) # combined rate rv_combined.mean() # 8.0
The Poisson distribution is intimately connected to the Poisson process — a continuous-time model where events occur independently at a constant average rate. If events arrive at rate λ per unit time, then the number of events in any interval of length t follows Pois(λt). The time between consecutive events follows an Exponential(λ) distribution. This duality between counts (Poisson) and waiting times (Exponential) is one of the most useful relationships in applied probability.
Two useful properties make the Poisson especially flexible in practice:
Superposition: If X ~ Pois(λ1) and Y ~ Pois(λ2) are independent, then X + Y ~ Pois(λ1 + λ2). If two independent sensors detect events at rates 2/sec and 3/sec, the combined detection rate is Poisson with λ = 5/sec. This extends to any number of independent Poisson sources.
Thinning: If X ~ Pois(λ) and each event is independently kept with probability p (and discarded with probability 1−p), the kept events follow Pois(λp) and the discarded events follow Pois(λ(1−p)), and the two are independent. This models filtering: if a lidar sees 10 objects/scan on average but only 40% are cars, the car count is Pois(4) and the non-car count is Pois(6), independently.
These two properties — merging and splitting Poisson streams — are the foundation of queueing theory, which models waiting lines in everything from server load balancing to robotic task scheduling. The M/M/1 queue (Poisson arrivals, exponential service times, one server) is the simplest queueing model and a cornerstone of performance engineering.
The Binomial asks "how many successes in n fixed trials?" The Negative Binomial flips the question entirely: "how many failures before we accumulate r successes?" Now the number of trials isn't fixed in advance — you keep going until you've hit your target. The experiment ends when the r-th success occurs.
Imagine you're debugging code and need to find 3 bugs. Each test run has a 20% chance of revealing a bug. How many failed tests before you've found all 3? Or you're training a reinforcement learning agent and need it to reach the goal r = 5 times to call training "done." How many failed episodes will there be?
NegBin(r, p) counts the number of failures k before the r-th success (some textbooks count total trials instead — always check the convention). Its PMF is:
Why C(k + r − 1, k)? Think of it this way. After all trials are done, the last trial is always a success (the r-th one). Before that, we've had k failures and r − 1 successes in some interleaved order — that's k + r − 1 trials total. The number of ways to arrange k failures among those k + r − 1 positions is C(k + r − 1, k). Multiply by pr (probability of all successes) and (1−p)k (probability of all failures).
nbinom(n=r, p=p). R uses dnbinom(x, size=r, prob=p). Always verify whether k counts failures or total trials.| Property | Value |
|---|---|
| Support | {0, 1, 2, ...} (failures before r-th success) |
| Mean | r(1 − p) / p |
| Variance | r(1 − p) / p² |
| Parameters | r ∈ {1, 2, ...}, p ∈ (0, 1] |
r = target successes, p = success probability per trial. Watch how the distribution stretches with more targets or lower success rates.
For the Negative Binomial, variance = r(1−p)/p² and mean = r(1−p)/p. The ratio variance/mean = 1/p, which is always > 1 when p < 1. This built-in overdispersion is why the Negative Binomial is the go-to replacement when Poisson doesn't fit.
In practice, you can estimate r and p from data using the method of moments: set the sample mean = r(1−p)/p and sample variance = r(1−p)/p², then solve for r and p. Or use maximum likelihood, which is more efficient but requires numerical optimization.
The Negative Binomial appears in several modern ML applications:
scRNA-seq: Single-cell RNA sequencing data is heavily overdispersed. Gene expression counts are modeled as Negative Binomial, not Poisson. Tools like DESeq2 and scVI use NegBin likelihoods.
NLP count models: In some text analysis pipelines, word frequencies within documents are modeled as NegBin when the Poisson fails to capture the burstiness of natural language (a word that appears once in a document tends to appear again).
Count regression: When Poisson regression gives poor fits (residual deviance >> degrees of freedom, a clear sign of overdispersion), switching to Negative Binomial regression often resolves the issue. The extra parameter r absorbs the excess variance, producing well-calibrated confidence intervals and p-values. In Python, statsmodels.discrete.discrete_model.NegativeBinomial handles this.
You need r = 3 positive training examples from a noisy dataset where only p = 0.2 of samples are positive. How many negatives will you wade through?
Expected failures = r(1−p)/p = 3 × 0.8/0.2 = 12 failures. So you'll examine about 15 samples total. The variance is 3 × 0.8/0.04 = 60, so the standard deviation is √60 ≈ 7.7 — quite variable!
python from scipy.stats import nbinom rv = nbinom(n=3, p=0.2) # n=r in scipy rv.mean() # 12.0 failures expected rv.var() # 60.0 (variance >> mean = overdispersed) rv.pmf(10) # P(exactly 10 failures before 3rd success) # Compare: Poisson with same mean has variance = 12 # NegBin variance = 60 — five times higher!
You keep sending a network packet until it gets through. You keep rolling a die until you get a 6. You keep asking users until one clicks your ad. You keep sampling from a noisy dataset until you find a valid example. You keep generating candidate solutions until one passes a test. The question: how many failures until the very first success?
The Geometric distribution Geom(p) is the simplest waiting-time distribution and the special case of the Negative Binomial with r = 1. It counts the number of failures k before the very first success:
The logic is clean: to have exactly k failures followed by a success, you need k consecutive failures (each with probability 1−p) and then one success (probability p). The probability decays exponentially — long waits are possible but increasingly unlikely. The CDF has a simple closed form: P(X ≤ k) = 1 − (1−p)k+1.
| Property | Value |
|---|---|
| Support | {0, 1, 2, ...} (failures before 1st success) |
| Mean | (1 − p) / p |
| Variance | (1 − p) / p² |
| Parameters | p ∈ (0, 1] |
Low p means long expected waits — watch the tail stretch as p shrinks. High p concentrates mass at k = 0 (succeed immediately).
The Geometric is to discrete time what the Exponential is to continuous time. Both are memoryless. Both model "waiting for the first event." If you discretize an Exponential waiting time into integer time steps, you get a Geometric. This makes the Geometric the natural model for counting discrete retries, rounds, or attempts.
A famous application: how many random draws (with replacement) until you've collected all K distinct items? Each item you still need arrives like a Geometric waiting time. The first unique item takes 1 draw. The second takes Geom(p = (K−1)/K) extra draws. The k-th takes Geom(p = (K−k+1)/K) extra draws. The total expected draws is K · HK where HK = 1 + 1/2 + 1/3 + ... + 1/K is the K-th harmonic number. For K = 50 stickers, you need about 225 purchases — much more than 50!
A wireless channel has p = 0.3 packet success rate. How many retransmissions on average before a packet gets through?
Expected failures = (1 − 0.3)/0.3 = 2.33 retransmissions. Total expected attempts = 1/p = 3.33. The variance is 0.7/0.09 = 7.78, so there's wide variability — sometimes the packet goes through first try, sometimes it takes 10+ attempts.
What's the probability the packet takes more than 5 attempts? P(X > 4 failures) = (1 − 0.3)5 = 0.75 = 0.168. About 1 in 6 packets will need 6+ attempts. This long tail is critical for timeout calculations in network protocols.
Now consider the memoryless property in action. After 3 failed attempts, what's the probability the next one succeeds? It's still just p = 0.3. The 3 failures gave you zero information about the future. This is deeply counterintuitive — in real life, repeated failure often signals a deeper problem (maybe the server is down, not just noisy). But the Geometric assumes each trial is truly independent: the channel doesn't degrade, the server doesn't crash, each coin flip is fresh. When this independence assumption fails (correlated failures, degrading systems), you need more sophisticated models like Markov chains or hidden Markov models.
python from scipy.stats import geom # scipy's geom counts trials (starting at 1), not failures rv = geom(p=0.3) rv.mean() # 3.33 (total trials including the success) rv.pmf(1) # 0.3 (succeed on first try) rv.pmf(4) # 0.3 * 0.7^3 = 0.1029 # P(need more than 5 attempts) 1 - rv.cdf(5) # 0.168
You have a dataset of 1,000 images: 200 are cats, 800 are dogs. You randomly draw 50 without replacement for your test set. How many cats will you get?
Unlike the Binomial, each draw changes the composition of what's left. After pulling out 3 cats, the pool has 197 cats out of 997 remaining — the probability of drawing a cat on the next draw has shifted slightly. This is sampling without replacement, and it's fundamentally different from independent draws.
The Hypergeometric distribution models exactly this scenario. It has three integer parameters: N (total population size), K (number of "success" items in the population), and n (number of draws without replacement). The PMF for getting exactly k successes among your n draws is:
Think of it as a counting argument. The numerator counts favorable draws: C(K, k) ways to choose k items from the K successes, times C(N−K, n−k) ways to choose the remaining n−k items from the N−K non-successes. The denominator C(N, n) counts all possible draws of n from N. The ratio gives the probability.
Unlike the Binomial, the support isn't always {0, 1, ..., n}. If n > N − K, you're forced to include some successes. The minimum k is max(0, n + K − N), and the maximum is min(n, K).
| Property | Value |
|---|---|
| Support | {max(0, n+K−N), ..., min(n, K)} |
| Mean | nK / N |
| Variance | n · (K/N) · (1−K/N) · (N−n)/(N−1) |
| Parameters | N, K, n (integers) |
Fisher's exact test is one of the most important applications of the Hypergeometric. It tests whether two categorical variables are associated. Given a 2x2 contingency table, the probability of observing the data under the null hypothesis of independence follows a Hypergeometric distribution. Unlike the chi-squared test, Fisher's exact test works even with small sample sizes — it computes exact probabilities rather than relying on asymptotic approximations.
When you split a dataset of N items (K positive, N−K negative) into a test set of size n, the number of positives in the test set follows Hypergeom(N, K, n). If you want a stratified split (same class proportions in train and test), you're essentially conditioning on the Hypergeometric outcome being close to its mean nK/N. Understanding this distribution tells you how much class imbalance to expect in random splits and when stratification is necessary.
For a concrete example: if your dataset has N = 1000 samples with K = 50 positives (5% prevalence) and you randomly take n = 200 for testing, the expected positives in the test set is nK/N = 10 with standard deviation ≈ 2.8 (including the finite population correction). There's about a 4% chance of getting fewer than 5 positives — potentially too few for reliable evaluation of metrics like precision and recall. This is why stratified splitting matters for imbalanced datasets.
python from scipy.stats import hypergeom # Dataset splitting scenario rv = hypergeom(M=1000, n=50, N=200) rv.mean() # 10.0 positives expected rv.std() # 2.76 rv.cdf(4) # 0.038 — P(fewer than 5 positives) 1 - rv.cdf(15) # 0.024 — P(more than 15 positives)
N = population, K = successes in population, n = draws. Watch how sampling without replacement narrows the distribution compared to equivalent Binomial parameters.
A factory produces N = 100 chips, K = 5 are defective. An inspector tests n = 10 chips. What's the expected number of defectives, and what's the probability of catching at least one?
Expected defectives: nK/N = 10 × 5/100 = 0.5.
P(at least 1 defective) = 1 − P(X=0) = 1 − C(5,0)·C(95,10)/C(100,10) ≈ 0.41. So there's a 41% chance the inspector catches a problem. If the inspector tested n = 30 instead, P(at least 1) rises to about 0.81 — testing more items dramatically increases detection power when defect rates are low.
python from scipy.stats import hypergeom # scipy convention: M=pop, n=success_in_pop, N=draws rv = hypergeom(M=100, n=5, N=10) rv.mean() # 0.5 1 - rv.pmf(0) # 0.41 — P(at least 1 defective) # Compare to Binomial(10, 0.05): P(>=1) = 0.40 # Close because n/N = 10/100 is small
A fair die. A random integer between 1 and 100. An index chosen uniformly at random from an array. A random action in ε-greedy exploration. A pixel chosen at random from an image. When every outcome is equally likely and there is absolutely no reason to favor one over another, you have the Discrete Uniform distribution.
For outcomes {a, a+1, ..., b}, each has the same probability:
This is the "maximum ignorance" distribution — you have no reason to prefer one outcome over another. Its entropy is log(b − a + 1), which is the maximum possible entropy for any distribution over the same support. In information theory, this means it carries the most uncertainty and requires the most bits to encode an outcome.
The mean is exactly the midpoint: (a + b)/2. The variance grows quadratically with the range — more possible outcomes means more spread. Specifically, for n = b − a + 1 outcomes, the variance is (n² − 1)/12. For a fair 6-sided die: variance = (36 − 1)/12 = 2.917.
| Property | Value |
|---|---|
| Support | {a, a+1, ..., b} |
| Mean | (a + b) / 2 |
| Variance | ((b − a + 1)² − 1) / 12 |
| Parameters | a, b (integers, a ≤ b) |
The Discrete Uniform is the result of the maximum entropy principle: if all you know is the set of possible outcomes (no other constraints), the distribution that makes the fewest additional assumptions is the uniform. This principle, formalized by E.T. Jaynes, is why uninformative priors in Bayesian statistics are often uniform. It's the mathematical way of saying "I genuinely don't know."
Most random number generators start with uniform randomness. random.randint(a, b) in Python gives you a Discrete Uniform variable. But beware of the modulo bias: if your underlying generator produces uniform integers in [0, M) and you compute x % n, the result is only truly uniform if n divides M evenly. Modern implementations (like Python's random module) handle this correctly, but it's a subtle source of bugs in hand-rolled solutions.
A remarkable fact: you can sample from any discrete distribution given only a Uniform random number. This is the inverse CDF method. Generate U ~ Uniform(0, 1), then find the smallest k such that the CDF F(k) ≥ U. This works because the probability that U falls in the interval [F(k−1), F(k)] is exactly P(X = k). This is how most software implements discrete random number generation under the hood.
All bars are the same height — that's the whole point. Change the range to see how the probability per outcome shrinks as you add more outcomes.
An ε-greedy RL agent has 8 actions. With probability ε = 0.1, it explores by picking uniformly at random. With probability 1 − ε = 0.9, it picks the best known action. During exploration, what's the probability of selecting action 3?
P(action 3 | exploring) = 1/8 = 0.125. The overall probability of action 3 during exploration is ε × 1/8 = 0.0125. Mean action index during exploration = (0 + 7)/2 = 3.5. Variance = (8² − 1)/12 = 63/12 = 5.25.
python from scipy.stats import randint import numpy as np rv = randint(low=0, high=8) # [0, 8) = {0,1,...,7} rv.pmf(3) # 0.125 rv.mean() # 3.5 rv.var() # 5.25 # Epsilon-greedy: explore with prob epsilon epsilon = 0.1 action = np.random.randint(8) if np.random.random() < epsilon else best_action
The Binomial extends Bernoulli from 1 trial to n trials. The Multinomial does the same for the Categorical — it counts outcomes across K categories over n independent trials.
You roll a loaded die 100 times and count how many of each face appeared. You classify 500 documents into 5 topics and count how many fall in each. You look at a paragraph and count the frequency of each word. You generate 1,000 tokens from an LLM and tally how many times each word appears.
The result is a count vector (x1, x2, ..., xK) where each xi counts how many times category i appeared, and all counts must sum to n. This vector lives on a simplex — a constrained space where only K−1 of the counts are truly free (the last is determined by n minus the rest).
The multinomial coefficient n!/(x1!...xK!) counts the number of distinct ways to arrange n trials into the given pattern of K categories. Think of it as: how many different orderings of the n trials produce the same count vector? Any single such ordering has probability p1x1...pKxK. Multiply by the number of orderings and you get the PMF.
For K = 2, the Multinomial reduces to the Binomial (the multinomial coefficient becomes the binomial coefficient, and the count vector is just (k, n−k)). For n = 1, it reduces to the Categorical (the coefficient is always 1, and the count vector is a one-hot). So the Multinomial truly unifies the counting distributions — it is the most general "repeat and count" distribution in this family.
Each marginal Xi (the count for a single category) follows a Binomial(n, pi). But the Xi's are not independent — if many trials land in category 1, fewer are left for the others. They have negative covariance: Cov(Xi, Xj) = −npipj. This negative correlation is intuitive: categories compete for the same pool of n trials.
| Property | Value |
|---|---|
| Support | xi ≥ 0, ∑xi = n |
| Mean of Xi | npi |
| Variance of Xi | npi(1 − pi) |
| Cov(Xi, Xj) | −npipj |
| Parameters | n, (p1, ..., pK) |
A classic application of the Multinomial is the chi-squared goodness-of-fit test. If you observe counts (x1, ..., xK) and expect (np1, ..., npK), the test statistic is:
Under the null hypothesis (the data really came from this Multinomial), this statistic follows a chi-squared distribution with K−1 degrees of freedom for large n. Large values mean the observed counts deviate significantly from expected — evidence that the assumed distribution is wrong. This test is used everywhere from A/B testing to checking if an LLM's output distribution matches the training distribution (detecting distribution shift).
Sample n words from a vocabulary of 8 tokens with given probabilities. Each bar shows the observed proportion for one "word." Click Re-sample to see how the proportions fluctuate.
A 3-topic classifier with probabilities [0.5, 0.3, 0.2] classifies n = 200 documents. What are the expected counts, variances, and covariances?
Expected count vector: E[X] = [np1, np2, np3] = [100, 60, 40].
Variance of topic 1 count: np1(1−p1) = 200 × 0.5 × 0.5 = 50. Standard deviation ≈ 7.07, so we'd expect roughly 100 ± 14 documents in topic 1 (95% interval).
Covariance between topic 1 and topic 2: Cov = −np1p2 = −200 × 0.5 × 0.3 = −30. The negative sign confirms that more documents in topic 1 means fewer in topic 2. The correlation is Cov/(std1 × std2) = −30/(√50 × √42) = −30/45.8 ≈ −0.655. The categories are moderately anti-correlated.
python from scipy.stats import multinomial import numpy as np rv = multinomial(n=200, p=[0.5, 0.3, 0.2]) sample = rv.rvs(size=1) # e.g. [[98, 63, 39]] rv.mean() # [100., 60., 40.] # Covariance between topic 1 and topic 2 # Cov = -n * p1 * p2 = -200 * 0.5 * 0.3 = -30 print(rv.cov()) # [[50, -30, -20], [-30, 42, -12], [-20, -12, 32]]
These nine distributions aren't isolated — they form a family connected by limiting cases, special cases, and extensions. Understanding these relationships is the key to choosing the right distribution for your problem. When you see a new random variable, your first question should be: what is this a special case of? What does it converge to in the limit?
The relationship map below shows how the distributions connect. Every arrow represents either a special case (setting a parameter to a specific value) or a limiting case (taking a parameter to infinity). This web of connections means you never need to memorize all nine from scratch — understand a few core distributions (Bernoulli, Poisson, Categorical), and the rest follow as natural extensions.
A useful mental model: Bernoulli and Categorical are the "atoms" — single-trial distributions. Binomial and Multinomial are "molecules" (repeating the atom n times and counting). Geometric and Negative Binomial are "waiting time" versions (repeating until a stopping condition). Poisson is the limiting case of rare events. Hypergeometric is the finite-population version. And the Discrete Uniform is the "maximum ignorance" baseline that says "I have no idea."
Nodes show distributions. Dashed lines show relationships: special cases, extensions, and limiting behaviors. The Normal (Gaussian) appears as the limiting distribution for large sample sizes.
Start with the question you're asking. Each distribution answers a different type of question:
| Question You're Asking | Distribution | AI/Robotics Example |
|---|---|---|
| Yes or no? | Bernoulli | Is this email spam? |
| Which category (one trial)? | Categorical | Which word comes next? |
| How many successes in n trials? | Binomial | How many users convert? |
| How many events in an interval? | Poisson | Server errors per hour |
| Trials until r successes? | Negative Binomial | Overdispersed counts |
| Trials until first success? | Geometric | Packet retransmissions |
| Successes when sampling w/o replacement? | Hypergeometric | Defectives in a sample |
| All outcomes equally likely? | Discrete Uniform | Random action in ε-greedy |
| Counts across K categories in n trials? | Multinomial | Word frequency vectors |
The distributions in this lesson are not independent inventions — they're connected by mathematical relationships that reveal deep structure. If you understand one, you can often derive or approximate another.
In Bayesian statistics, each of these distributions has a natural conjugate prior that makes updating tractable:
| Likelihood | Conjugate Prior | What It Means |
|---|---|---|
| Bernoulli / Binomial | Beta | Prior on success probability p |
| Categorical / Multinomial | Dirichlet | Prior on probability vector (p1,...,pK) |
| Poisson | Gamma | Prior on rate λ |
| Geometric / Neg. Binomial | Beta | Prior on success probability p |
Conjugacy means the posterior has the same functional form as the prior, just with updated parameters. This makes sequential Bayesian updating fast and analytically tractable — no numerical integration needed. For example, with a Beta(1,1) prior (uniform) and 7 heads in 10 coin flips (Binomial likelihood), the posterior is Beta(8, 4), which you can compute in closed form.
| Distribution | Parameters | Mean | Variance |
|---|---|---|---|
| Bernoulli | p | p | p(1−p) |
| Categorical | p1..K | pi | pi(1−pi) |
| Binomial | n, p | np | np(1−p) |
| Poisson | λ | λ | λ |
| Neg. Binomial | r, p | r(1−p)/p | r(1−p)/p² |
| Geometric | p | (1−p)/p | (1−p)/p² |
| Hypergeometric | N, K, n | nK/N | nK(N−K)(N−n)/[N²(N−1)] |
| Discrete Uniform | a, b | (a+b)/2 | ((b−a+1)²−1)/12 |
| Multinomial | n, p1..K | npi | npi(1−pi) |
These distributions are the building blocks for everything else:
This lesson focused on the most common discrete distributions. But the family extends further:
| Distribution | What It Does | When You'd Need It |
|---|---|---|
| Beta-Binomial | Binomial where p is itself random (Beta-distributed) | Overdispersed binary data, hierarchical models |
| Dirichlet-Multinomial | Multinomial where probabilities are Dirichlet-distributed | Document modeling with uncertainty in topic proportions |
| Zero-Inflated Poisson | Mixture of point mass at 0 and Poisson | Count data with excess zeros (rare events) |
| Zipf / Power Law | P(X=k) ∝ k−s | Word frequencies, city sizes, network degrees |
Each of these builds on the foundations covered in this lesson. Once you understand the nine core distributions, these extensions are natural generalizations.
.fit() method can estimate parameters from data for many distributions.
| Distribution | One Sentence |
|---|---|
| Bernoulli | A single coin flip with bias p. |
| Categorical | A single roll of a K-sided die. |
| Binomial | Count of heads in n coin flips. |
| Poisson | Count of events at a constant rate. |
| Neg. Binomial | Number of failures before r-th success. |
| Geometric | Number of failures before first success. |
| Hypergeometric | Successes when drawing without replacement. |
| Discrete Uniform | Every outcome equally likely. |
| Multinomial | Count vector from n rolls of a K-sided die. |
The continuous world awaits. The most important continuous distributions for AI and robotics are:
| Distribution | Discrete Analog | Where Used |
|---|---|---|
| Normal (Gaussian) | Binomial (large n limit) | Everywhere: noise, weights, beliefs |
| Exponential | Geometric | Waiting times, event arrivals |
| Beta | Prior for Bernoulli p | Bayesian inference, Thompson sampling |
| Gamma | Prior for Poisson λ | Rate estimation, mixture models |
| Dirichlet | Prior for Categorical (p1,...,pK) | Topic models, Bayesian RL |
Each continuous distribution pairs naturally with its discrete counterpart through conjugacy, limits, or analogous structure. The discrete foundations you've built in this lesson will make continuous distributions feel like natural extensions rather than entirely new concepts.
With these nine distributions as your vocabulary, you can describe and reason about discrete uncertainty in any AI system. They are the building blocks of probabilistic thinking.
"Probability theory is nothing but common sense reduced to calculation." — Pierre-Simon Laplace