The named families of discrete random variables — Bernoulli, Binomial, Poisson, Geometric, and beyond.
You are building a ride-sharing app. You need to answer questions like: "How many requests will we get in the next minute?" or "How many coin flips until we see heads?" or "Out of 100 servers, how many will crash today?" Each of these has a different shape, a different PMF, a different variance. Are you going to derive the expectation and variance from scratch every single time?
No. You are going to recognize that each of these situations fits a named distribution — a parametric family with pre-derived properties. Once you identify the type and plug in the parameters, you inherit the PMF, expectation, variance, and more for free. Think of it like calling a constructor in a programming language: X = Binomial(n=100, p=0.02).
This chapter introduces six families of discrete distributions. Each one models a different type of random phenomenon. By the end, you will have a toolkit that covers the vast majority of discrete random variables you will encounter in practice.
| Distribution | Models | Parameters |
|---|---|---|
| Bernoulli | Single yes/no trial | p |
| Binomial | Count of successes in n trials | n, p |
| Poisson | Count of events in a fixed interval | λ |
| Geometric | Number of trials until first success | p |
| Negative Binomial | Number of trials until r-th success | r, p |
| Categorical | One draw from k categories | p1, …, pk |
Notice the relationships already visible in this table. A Bernoulli is a Binomial with n = 1. A Geometric is a Negative Binomial with r = 1. A Binomial becomes a Poisson in a certain limit. These connections are not coincidences — they reflect deep structural relationships that we will derive explicitly.
The simplest parametric random variable. A coin flip, a bit, a yes/no question. Did the server crash? Did the user click? Did the patient test positive? Each of these is a Bernoulli trial — an experiment with exactly two outcomes.
If X is a Bernoulli random variable with parameter p, written X ~ Bern(p), then X = 1 with probability p (success) and X = 0 with probability 1 − p (failure). That is the entire distribution.
This compact formula encodes both cases: when x = 1, it gives p1(1 − p)0 = p. When x = 0, it gives p0(1 − p)1 = 1 − p. Elegant.
Expectation. From the definition of expectation for a discrete random variable:
Variance. First compute E[X2]. Since X only takes values 0 and 1, X2 = X, so E[X2] = p. Then:
Notice that the variance is maximized at p = 0.5, where uncertainty is greatest. At p = 0 or p = 1, there is no randomness, so the variance is zero.
Worked example. A website A/B test shows the new design to each visitor with probability p = 0.3. Let X ~ Bern(0.3) indicate whether a particular visitor sees the new design.
• P(X = 1) = 0.3 • P(X = 0) = 0.7
• E[X] = 0.3 • Var(X) = 0.3 × 0.7 = 0.21
| Property | Value |
|---|---|
| Notation | X ~ Bern(p) |
| Support | {0, 1} |
| PMF | px(1 − p)1−x |
| E[X] | p |
| Var(X) | p(1 − p) |
Now suppose you repeat a Bernoulli trial n times, independently. You flip a coin 20 times. You test 100 servers. You survey 50 people. Let X count the total number of successes. Then X follows a Binomial distribution, written X ~ Bin(n, p).
The key insight is that a Binomial is just a sum of independent Bernoullis: X = Y1 + Y2 + … + Yn, where each Yi ~ Bern(p).
Deriving the PMF. What is P(X = k) — the probability of exactly k successes in n trials? Think of it this way: any specific sequence with k successes and (n − k) failures has probability pk(1 − p)n−k. How many such sequences are there? We must choose which k of the n trials are successes — that is C(n, k). So:
where C(n, k) = n! / (k!(n − k)!) is the binomial coefficient from Chapter 1.
Expectation via indicator decomposition. This is the elegant approach. Since X = ∑ Yi and each Yi ~ Bern(p):
No PMF manipulation needed. The linearity of expectation does all the work, and it doesn't even require independence.
Variance. Because the Yi are independent, we can also sum the variances:
Below, an interactive simulation lets you see the Binomial PMF change shape as you adjust n and p. Notice how the bell curve emerges for large n and moderate p.
Drag the sliders to change n and p. The bars show P(X = k) for each k. The dotted line marks E[X].
| Property | Value |
|---|---|
| Notation | X ~ Bin(n, p) |
| Support | {0, 1, …, n} |
| PMF | C(n,k) pk (1−p)n−k |
| E[X] | np |
| Var(X) | np(1−p) |
You are monitoring a web server. From historical data you know it averages λ = 5 requests per minute. What is the probability of getting exactly 3 requests in the next minute? Or 0? Or 10?
The Poisson distribution models the count of events in a fixed interval when events occur at a constant average rate λ and independently of each other. We write X ~ Poi(λ).
Deriving the Poisson from the Binomial limit. This is one of the most beautiful derivations in probability. Here is the idea: split one minute into n tiny buckets. Each bucket is so small that at most one event can occur in it. The probability of an event in any single bucket is λ/n (chosen so that the expected total is λ). Then the total count is X ~ Bin(n, λ/n).
The approximation improves as n grows. At n = 60 (one-second buckets), it is rough. At n = 600 (decisecond buckets), it is better. At n → ∞, it becomes exact. Let's take the limit of the Binomial PMF:
Expanding C(n,k) = n!/(k!(n−k)!) and separating terms:
Now apply three limit facts: (1) n(n−1)…(n−k+1)/nk → 1 as n → ∞, (2) (1 − λ/n)n → e−λ by the definition of e, (3) (1 − λ/n)−k → 1 since k is fixed. Everything collapses:
Expectation and variance. Since the Poisson is the limit of Bin(n, λ/n), the expectation is the limit of n · (λ/n) = λ. The variance is the limit of n · (λ/n) · (1 − λ/n) = λ(1 − λ/n) → λ. So both the mean and variance of a Poisson are λ. This is a distinctive fingerprint: if you see data where the sample mean roughly equals the sample variance, think Poisson.
Worked example. A call center gets λ = 8 calls per hour. What is P(exactly 5 calls in the next hour)?
So about a 9.2% chance of exactly 5 calls. The expected count is 8, and the standard deviation is √8 ≈ 2.83.
Changing time frames. If events arrive at rate λ per minute, then in t minutes the count follows Poi(λt). For 20 minutes at λ = 5 per minute, use Poi(100).
The orange bars show Bin(n, λ/n). The teal dots show the true Poisson(λ). As n increases, they converge. Drag n to see the limit in action.
| Property | Value |
|---|---|
| Notation | X ~ Poi(λ) |
| Support | {0, 1, 2, …} |
| PMF | λk e−λ / k! |
| E[X] | λ |
| Var(X) | λ |
You are flipping a coin repeatedly and waiting for the first heads. You are refreshing a webpage until the server responds. You are interviewing candidates until you find a qualified one. In each case, you are counting how many trials until the first success. This is the Geometric distribution.
If X ~ Geo(p), then X is the number of independent Bernoulli(p) trials needed to get the first success. X can take values 1, 2, 3, … (it starts at 1 because the success trial itself is counted).
Deriving the PMF. For X = k, we need k − 1 failures followed by one success. Each failure has probability (1 − p) and the success has probability p. Since the trials are independent:
Expectation. We derive E[X] = 1/p. Intuitively: if each trial succeeds with probability p = 0.1, you expect to wait about 1/0.1 = 10 trials. Formally:
(using the identity ∑k=1∞ kxk−1 = 1/(1−x)2 for |x| < 1, with x = 1 − p).
Variance.
Proof of memorylessness. P(X > k) = (1 − p)k (all k trials fail). Then:
Worked example. A basketball player makes free throws with p = 0.75. Let X = number of attempts until the first make. X ~ Geo(0.75).
• P(X = 1) = 0.75 (makes it on the first try)
• P(X = 2) = 0.25 × 0.75 = 0.1875 (miss, then make)
• P(X = 3) = 0.252 × 0.75 = 0.0625 × 0.75 = 0.046875
• E[X] = 1/0.75 = 4/3 ≈ 1.33 attempts
• Var(X) = 0.25/0.5625 = 4/9 ≈ 0.444
| Property | Value |
|---|---|
| Notation | X ~ Geo(p) |
| Support | {1, 2, 3, …} |
| PMF | (1−p)k−1 p |
| E[X] | 1/p |
| Var(X) | (1−p)/p2 |
| Special | Memoryless |
The Geometric counts trials until the first success. What if you want the r-th success? For example: how many patients must you screen until you find 3 who are eligible for a clinical trial? This is the Negative Binomial, written X ~ NegBin(r, p).
X is the total number of trials required to accumulate r successes. Each trial is an independent Bernoulli(p). The support is x ∈ {r, r+1, r+2, …} because you need at least r trials.
Deriving the PMF. To have the r-th success on trial x, two things must happen: (1) among the first x − 1 trials, exactly r − 1 were successes, and (2) trial x is a success. Condition (1) is a Binomial count, and (2) is one more factor of p:
Expectation and variance. Since X is the sum of r independent Geometric(p) waiting times (wait for the 1st success, then the 2nd, …, then the r-th):
Worked example. A recruiter interviews candidates where each has probability p = 0.2 of being qualified. Let X = number of interviews to find r = 3 qualified candidates. X ~ NegBin(3, 0.2).
• E[X] = 3/0.2 = 15 interviews
• Var(X) = 3 × 0.8 / 0.04 = 2.4 / 0.04 = 60
• SD(X) = √60 ≈ 7.75 interviews
• P(X = 3) = C(2,2) × 0.23 × 0.80 = 1 × 0.008 × 1 = 0.008 (all three immediately qualified — rare!)
• P(X = 5) = C(4,2) × 0.23 × 0.82 = 6 × 0.008 × 0.64 = 0.03072
So even finding 3 qualified people in 5 interviews is only a ~3% chance. The expected 15 interviews really drives home how long the search can take when p is small.
| Property | Value |
|---|---|
| Notation | X ~ NegBin(r, p) |
| Support | {r, r+1, r+2, …} |
| PMF | C(x−1, r−1) pr (1−p)x−r |
| E[X] | r/p |
| Var(X) | r(1−p)/p2 |
So far, every distribution has taken numerical values — 0 and 1, counts, trial numbers. But what about a random variable for today's weather? Or the color of a randomly chosen M&M? The values are categories, not numbers.
A Categorical distribution is the generalization of Bernoulli to more than two outcomes. A random variable X is categorical if it takes one of k possible values {c1, c2, …, ck} with probabilities {p1, p2, …, pk} where p1 + p2 + … + pk = 1.
The PMF is simply a table:
| Value | Probability |
|---|---|
| Sunny | P(X = Sunny) = 0.49 |
| Cloudy | P(X = Cloudy) = 0.30 |
| Rainy | P(X = Rainy) = 0.20 |
| Snowy | P(X = Snowy) = 0.01 |
The probabilities must sum to 1.0. Since the values are not numbers, there is no meaningful expectation or variance in the numerical sense. You cannot average "Sunny" and "Rainy."
Connection to the Multinomial. Just as Binomial counts successes over n Bernoulli trials, the Multinomial distribution counts how many times each category appears over n Categorical trials. If you roll a 6-sided die 100 times and count how many of each face appear, the vector (count1, …, count6) follows a Multinomial. We will meet this in a later chapter.
Worked example. A user visits a website and clicks one of four buttons with probabilities: Home (0.45), Products (0.30), About (0.15), Contact (0.10). Let X be the button clicked. X is Categorical with k = 4.
• P(X = Home) = 0.45
• P(X ≠ Home) = 1 − 0.45 = 0.55
• P(X = Products or X = About) = 0.30 + 0.15 = 0.45
We can answer probability questions, but not compute E[X] (what is the "average" of Home and Contact?).
Now it is time to play. The interactive explorer below lets you switch between all five numeric distributions and adjust their parameters in real time. Watch how the PMF shape, expectation, and variance change. Try to build intuition for how each parameter controls the distribution.
Choose a distribution and adjust parameters. Bars show the PMF. The dashed line marks E[X]. Shaded region shows ±1 standard deviation from the mean.
With six distributions in our toolkit, it is easy to mix them up. This chapter puts them side by side and sharpens the distinctions. The key question for pattern recognition is: what is the random variable counting?
| Distribution | What X counts | Trials | Support | E[X] | Var(X) |
|---|---|---|---|---|---|
| Bern(p) | Success on one trial | 1 | {0,1} | p | p(1−p) |
| Bin(n,p) | Successes in n trials | n (fixed) | {0,…,n} | np | np(1−p) |
| Poi(λ) | Events in fixed interval | ∞ (limit) | {0,1,…} | λ | λ |
| Geo(p) | Trials to 1st success | Random | {1,2,…} | 1/p | (1−p)/p2 |
| NegBin(r,p) | Trials to r-th success | Random | {r,r+1,…} | r/p | r(1−p)/p2 |
| Categorical | Which category | 1 | {c1,…,ck} | N/A* | N/A* |
*Categorical has E[X] only if categories happen to be numeric.
Two distributions plotted together. Choose the pair and adjust shared parameters to compare shapes.
Decision flowchart. When you encounter a discrete random variable in the wild, ask these questions in order:
This chapter gave you six named distributions — the essential toolkit for modeling discrete random phenomena. Let's connect them to the bigger picture.
Looking back. Chapter 5 introduced random variables. Chapter 6 gave us expectation and variance as summary tools. This chapter shows that many real-world random variables fall into known families, so we rarely need to compute PMFs from scratch — we just identify the type and plug in parameters.
Looking forward. Chapter 8 moves from discrete to continuous distributions — random variables that take real-valued outcomes. The ideas are parallel: each continuous distribution has a density function (analogous to the PMF), expectation, and variance. The most important continuous distribution — the Normal — will turn out to be the limiting shape that the Binomial and Poisson approach for large parameters. That is the Central Limit Theorem, and it is one of the most profound results in all of mathematics.
One last connection worth noting: the Exponential distribution (Chapter 8) is the continuous analog of the Geometric. Just as the Geometric is memoryless among discrete distributions, the Exponential is memoryless among continuous distributions. And just as the Poisson counts events in an interval, the Exponential models the time between Poisson events. These pairs — Geometric/Exponential and Poisson/Exponential — are two sides of the same coin.