Ch 7: Discrete Distributions — Piech Probability CS

Chapter 0: Why Distributions?

You are building a ride-sharing app. You need to answer questions like: "How many requests will we get in the next minute?" or "How many coin flips until we see heads?" or "Out of 100 servers, how many will crash today?" Each of these has a different shape, a different PMF, a different variance. Are you going to derive the expectation and variance from scratch every single time?

No. You are going to recognize that each of these situations fits a named distribution — a parametric family with pre-derived properties. Once you identify the type and plug in the parameters, you inherit the PMF, expectation, variance, and more for free. Think of it like calling a constructor in a programming language: X = Binomial(n=100, p=0.02).

This chapter introduces six families of discrete distributions. Each one models a different type of random phenomenon. By the end, you will have a toolkit that covers the vast majority of discrete random variables you will encounter in practice.

The core idea: Named distributions are reusable abstractions. If you can argue that your random variable fits one of these families, you get its PMF, expectation, and variance instantly — no re-derivation needed. The skill is pattern recognition: which distribution matches this scenario?

Distribution	Models	Parameters
Bernoulli	Single yes/no trial	p
Binomial	Count of successes in n trials	n, p
Poisson	Count of events in a fixed interval	λ
Geometric	Number of trials until first success	p
Negative Binomial	Number of trials until r-th success	r, p
Categorical	One draw from k categories	p₁, …, p_k

Notice the relationships already visible in this table. A Bernoulli is a Binomial with n = 1. A Geometric is a Negative Binomial with r = 1. A Binomial becomes a Poisson in a certain limit. These connections are not coincidences — they reflect deep structural relationships that we will derive explicitly.

How to use this chapter: For each distribution, we will (1) motivate it with a concrete scenario, (2) state the PMF, (3) derive expectation and variance, (4) work a numerical example with full arithmetic, and (5) test your understanding with a quiz. By the end, you will recognize each pattern on sight.

Check: What is the main advantage of recognizing that a random variable follows a named distribution?

It makes the variable deterministic You inherit pre-derived PMF, expectation, and variance without re-deriving them It guarantees the variable has finite support

Chapter 1: The Bernoulli Distribution

The simplest parametric random variable. A coin flip, a bit, a yes/no question. Did the server crash? Did the user click? Did the patient test positive? Each of these is a Bernoulli trial — an experiment with exactly two outcomes.

If X is a Bernoulli random variable with parameter p, written X ~ Bern(p), then X = 1 with probability p (success) and X = 0 with probability 1 − p (failure). That is the entire distribution.

P(X = x) = p^x(1 − p)^1−x, x ∈ {0, 1}

This compact formula encodes both cases: when x = 1, it gives p¹(1 − p)⁰ = p. When x = 0, it gives p⁰(1 − p)¹ = 1 − p. Elegant.

Expectation. From the definition of expectation for a discrete random variable:

E[X] = ∑_x x · P(X = x) = 0 · (1 − p) + 1 · p = p

Variance. First compute E[X²]. Since X only takes values 0 and 1, X² = X, so E[X²] = p. Then:

Var(X) = E[X²] − (E[X])² = p − p² = p(1 − p)

Notice that the variance is maximized at p = 0.5, where uncertainty is greatest. At p = 0 or p = 1, there is no randomness, so the variance is zero.

Indicator variables: An indicator variable is a Bernoulli that encodes whether an event A occurred: I_A = 1 if A happens, 0 otherwise. Then E[I_A] = P(A). This trick — converting events into numbers — is enormously useful. We will use it to derive the Binomial expectation painlessly.

Worked example. A website A/B test shows the new design to each visitor with probability p = 0.3. Let X ~ Bern(0.3) indicate whether a particular visitor sees the new design.

• P(X = 1) = 0.3 • P(X = 0) = 0.7

• E[X] = 0.3 • Var(X) = 0.3 × 0.7 = 0.21

Property	Value
Notation	X ~ Bern(p)
Support	{0, 1}
PMF	p^x(1 − p)^1−x
E[X]	p
Var(X)	p(1 − p)

Check: If X ~ Bern(0.8), what is Var(X)?

0.64 0.16 0.80

Chapter 2: The Binomial Distribution

Now suppose you repeat a Bernoulli trial n times, independently. You flip a coin 20 times. You test 100 servers. You survey 50 people. Let X count the total number of successes. Then X follows a Binomial distribution, written X ~ Bin(n, p).

The key insight is that a Binomial is just a sum of independent Bernoullis: X = Y₁ + Y₂ + … + Y_n, where each Y_i ~ Bern(p).

Deriving the PMF. What is P(X = k) — the probability of exactly k successes in n trials? Think of it this way: any specific sequence with k successes and (n − k) failures has probability p^k(1 − p)^n−k. How many such sequences are there? We must choose which k of the n trials are successes — that is C(n, k). So:

P(X = k) = C(n, k) · p^k · (1 − p)^n−k, k = 0, 1, …, n

where C(n, k) = n! / (k!(n − k)!) is the binomial coefficient from Chapter 1.

Expectation via indicator decomposition. This is the elegant approach. Since X = ∑ Y_i and each Y_i ~ Bern(p):

E[X] = E[∑_i=1ⁿ Y_i] = ∑_i=1ⁿ E[Y_i] = ∑_i=1ⁿ p = n · p

No PMF manipulation needed. The linearity of expectation does all the work, and it doesn't even require independence.

Variance. Because the Y_i are independent, we can also sum the variances:

Var(X) = ∑_i=1ⁿ Var(Y_i) = ∑_i=1ⁿ p(1 − p) = n · p · (1 − p)

Worked example. A fair die is rolled 18 times. Let X = number of sixes. Then X ~ Bin(18, 1/6).
• E[X] = 18 × (1/6) = 3
• Var(X) = 18 × (1/6) × (5/6) = 18 × 5/36 = 90/36 = 2.5
• P(X = 3) = C(18,3) × (1/6)³ × (5/6)¹⁵
   = 816 × (1/216) × (5¹⁵/6¹⁵)
   = 816 × (1/216) × (30517578125 / 470184984576)
   ≈ 816 × 0.000300 ≈ 0.2452

Below, an interactive simulation lets you see the Binomial PMF change shape as you adjust n and p. Notice how the bell curve emerges for large n and moderate p.

Binomial PMF Explorer

Drag the sliders to change n and p. The bars show P(X = k) for each k. The dotted line marks E[X].

n20

p0.50

E[X] = 10.00   Var(X) = 5.00

Property	Value
Notation	X ~ Bin(n, p)
Support	{0, 1, …, n}
PMF	C(n,k) p^k (1−p)^n−k
E[X]	np
Var(X)	np(1−p)

Check: If X ~ Bin(10, 0.3), what is E[X]?

3 7 0.3

Chapter 3: The Poisson Distribution

You are monitoring a web server. From historical data you know it averages λ = 5 requests per minute. What is the probability of getting exactly 3 requests in the next minute? Or 0? Or 10?

The Poisson distribution models the count of events in a fixed interval when events occur at a constant average rate λ and independently of each other. We write X ~ Poi(λ).

Deriving the Poisson from the Binomial limit. This is one of the most beautiful derivations in probability. Here is the idea: split one minute into n tiny buckets. Each bucket is so small that at most one event can occur in it. The probability of an event in any single bucket is λ/n (chosen so that the expected total is λ). Then the total count is X ~ Bin(n, λ/n).

The approximation improves as n grows. At n = 60 (one-second buckets), it is rough. At n = 600 (decisecond buckets), it is better. At n → ∞, it becomes exact. Let's take the limit of the Binomial PMF:

P(X = k) = lim_n→∞ C(n,k) · (λ/n)^k · (1 − λ/n)^n−k

Expanding C(n,k) = n!/(k!(n−k)!) and separating terms:

= lim_n→∞ [n(n−1)…(n−k+1) / n^k] · (λ^k / k!) · (1 − λ/n)ⁿ · (1 − λ/n)^−k

Now apply three limit facts: (1) n(n−1)…(n−k+1)/n^k → 1 as n → ∞, (2) (1 − λ/n)ⁿ → e^−λ by the definition of e, (3) (1 − λ/n)^−k → 1 since k is fixed. Everything collapses:

P(X = k) = λ^k e^−λ / k!

The Poisson PMF: P(X = k) = λ^k e^−λ / k! for k = 0, 1, 2, … This emerged as the Binomial in the limit of many trials with small per-trial probability. The single parameter λ captures both the "number of trials" and "probability per trial" — it is just the expected count.

Expectation and variance. Since the Poisson is the limit of Bin(n, λ/n), the expectation is the limit of n · (λ/n) = λ. The variance is the limit of n · (λ/n) · (1 − λ/n) = λ(1 − λ/n) → λ. So both the mean and variance of a Poisson are λ. This is a distinctive fingerprint: if you see data where the sample mean roughly equals the sample variance, think Poisson.

E[X] = λ, Var(X) = λ

Worked example. A call center gets λ = 8 calls per hour. What is P(exactly 5 calls in the next hour)?

P(X = 5) = 8⁵ · e⁻⁸ / 5! = 32768 × 0.000335 / 120 = 10.987 / 120 ≈ 0.0916

So about a 9.2% chance of exactly 5 calls. The expected count is 8, and the standard deviation is √8 ≈ 2.83.

Changing time frames. If events arrive at rate λ per minute, then in t minutes the count follows Poi(λt). For 20 minutes at λ = 5 per minute, use Poi(100).

Poisson: From Binomial to Limit

The orange bars show Bin(n, λ/n). The teal dots show the true Poisson(λ). As n increases, they converge. Drag n to see the limit in action.

λ5

n (buckets)20

Property	Value
Notation	X ~ Poi(λ)
Support	{0, 1, 2, …}
PMF	λ^k e^−λ / k!
E[X]	λ
Var(X)	λ

Check: As n → ∞ in Bin(n, λ/n), what does the variance converge to?

λ λ² 0

Chapter 4: The Geometric Distribution

You are flipping a coin repeatedly and waiting for the first heads. You are refreshing a webpage until the server responds. You are interviewing candidates until you find a qualified one. In each case, you are counting how many trials until the first success. This is the Geometric distribution.

If X ~ Geo(p), then X is the number of independent Bernoulli(p) trials needed to get the first success. X can take values 1, 2, 3, … (it starts at 1 because the success trial itself is counted).

Deriving the PMF. For X = k, we need k − 1 failures followed by one success. Each failure has probability (1 − p) and the success has probability p. Since the trials are independent:

P(X = k) = (1 − p)^k−1 · p, k = 1, 2, 3, …

Expectation. We derive E[X] = 1/p. Intuitively: if each trial succeeds with probability p = 0.1, you expect to wait about 1/0.1 = 10 trials. Formally:

E[X] = ∑_k=1^∞ k · (1−p)^k−1 p = p ∑_k=1^∞ k(1−p)^k−1 = p · 1/(1−(1−p))² = p / p² = 1/p

(using the identity ∑_k=1^∞ kx^k−1 = 1/(1−x)² for |x| < 1, with x = 1 − p).

Variance.

Var(X) = (1 − p) / p²

The memoryless property. The Geometric is the only discrete distribution with this remarkable property: P(X > s + t | X > s) = P(X > t). In words: given that you have already waited s trials without success, the probability of waiting at least t more is the same as if you had just started. The past does not help you predict the future. Each trial is a fresh start.

Proof of memorylessness. P(X > k) = (1 − p)^k (all k trials fail). Then:

P(X > s + t | X > s) = P(X > s + t) / P(X > s) = (1−p)^s+t / (1−p)^s = (1−p)^t = P(X > t)

Worked example. A basketball player makes free throws with p = 0.75. Let X = number of attempts until the first make. X ~ Geo(0.75).

• P(X = 1) = 0.75 (makes it on the first try)

• P(X = 2) = 0.25 × 0.75 = 0.1875 (miss, then make)

• P(X = 3) = 0.25² × 0.75 = 0.0625 × 0.75 = 0.046875

• E[X] = 1/0.75 = 4/3 ≈ 1.33 attempts

• Var(X) = 0.25/0.5625 = 4/9 ≈ 0.444

Property	Value
Notation	X ~ Geo(p)
Support	{1, 2, 3, …}
PMF	(1−p)^k−1 p
E[X]	1/p
Var(X)	(1−p)/p²
Special	Memoryless

Check: You have been flipping a fair coin (p = 0.5) and gotten 10 tails in a row. What is the probability the next flip is heads?

0.5, by the memoryless property Greater than 0.5, because heads is "due" Less than 0.5, because the coin might be biased

Chapter 5: The Negative Binomial

The Geometric counts trials until the first success. What if you want the r-th success? For example: how many patients must you screen until you find 3 who are eligible for a clinical trial? This is the Negative Binomial, written X ~ NegBin(r, p).

X is the total number of trials required to accumulate r successes. Each trial is an independent Bernoulli(p). The support is x ∈ {r, r+1, r+2, …} because you need at least r trials.

Deriving the PMF. To have the r-th success on trial x, two things must happen: (1) among the first x − 1 trials, exactly r − 1 were successes, and (2) trial x is a success. Condition (1) is a Binomial count, and (2) is one more factor of p:

P(X = x) = C(x−1, r−1) · p^r · (1 − p)^x−r, x = r, r+1, …

Expectation and variance. Since X is the sum of r independent Geometric(p) waiting times (wait for the 1st success, then the 2nd, …, then the r-th):

E[X] = r/p, Var(X) = r(1 − p)/p²

Connection to Geometric: A Geometric(p) is exactly NegBin(1, p). Everything about the Geometric — PMF, expectation, variance — is the special case r = 1. The Negative Binomial generalizes it.

Worked example. A recruiter interviews candidates where each has probability p = 0.2 of being qualified. Let X = number of interviews to find r = 3 qualified candidates. X ~ NegBin(3, 0.2).

• E[X] = 3/0.2 = 15 interviews

• Var(X) = 3 × 0.8 / 0.04 = 2.4 / 0.04 = 60

• SD(X) = √60 ≈ 7.75 interviews

• P(X = 3) = C(2,2) × 0.2³ × 0.8⁰ = 1 × 0.008 × 1 = 0.008 (all three immediately qualified — rare!)

• P(X = 5) = C(4,2) × 0.2³ × 0.8² = 6 × 0.008 × 0.64 = 0.03072

So even finding 3 qualified people in 5 interviews is only a ~3% chance. The expected 15 interviews really drives home how long the search can take when p is small.

Property	Value
Notation	X ~ NegBin(r, p)
Support	{r, r+1, r+2, …}
PMF	C(x−1, r−1) p^r (1−p)^x−r
E[X]	r/p
Var(X)	r(1−p)/p²

Check: What distribution is NegBin(1, p) equivalent to?

Binomial(1, p) Geometric(p) Poisson(p)

Chapter 6: The Categorical Distribution

So far, every distribution has taken numerical values — 0 and 1, counts, trial numbers. But what about a random variable for today's weather? Or the color of a randomly chosen M&M? The values are categories, not numbers.

A Categorical distribution is the generalization of Bernoulli to more than two outcomes. A random variable X is categorical if it takes one of k possible values {c₁, c₂, …, c_k} with probabilities {p₁, p₂, …, p_k} where p₁ + p₂ + … + p_k = 1.

The PMF is simply a table:

Value	Probability
Sunny	P(X = Sunny) = 0.49
Cloudy	P(X = Cloudy) = 0.30
Rainy	P(X = Rainy) = 0.20
Snowy	P(X = Snowy) = 0.01

The probabilities must sum to 1.0. Since the values are not numbers, there is no meaningful expectation or variance in the numerical sense. You cannot average "Sunny" and "Rainy."

Bernoulli is a special case: A Bernoulli(p) is just a Categorical with two categories: {success, failure} with probabilities {p, 1−p}. We happened to encode those categories as the numbers 1 and 0, which is why Bernoulli has a meaningful expectation. But the underlying structure is categorical.

Connection to the Multinomial. Just as Binomial counts successes over n Bernoulli trials, the Multinomial distribution counts how many times each category appears over n Categorical trials. If you roll a 6-sided die 100 times and count how many of each face appear, the vector (count₁, …, count₆) follows a Multinomial. We will meet this in a later chapter.

Worked example. A user visits a website and clicks one of four buttons with probabilities: Home (0.45), Products (0.30), About (0.15), Contact (0.10). Let X be the button clicked. X is Categorical with k = 4.

• P(X = Home) = 0.45

• P(X ≠ Home) = 1 − 0.45 = 0.55

• P(X = Products or X = About) = 0.30 + 0.15 = 0.45

We can answer probability questions, but not compute E[X] (what is the "average" of Home and Contact?).

When categories become numbers: Sometimes categories are numbers — the face of a die, a zip code, a jersey number. In that case you can compute E[X] and Var(X), but only because you chose a numeric encoding. A die roll is Categorical with values {1,2,3,4,5,6} each at probability 1/6, and E[X] = 3.5 makes sense. The key point: the Categorical framework doesn't require the values to be numeric.

Check: Why does a Categorical random variable with non-numeric categories lack a meaningful expectation?

Because its PMF doesn't sum to 1 Because expectation requires multiplying values by probabilities, and you cannot multiply "Sunny" by 0.49 Because it has too many categories

Chapter 7: Showcase — Distribution Explorer

Now it is time to play. The interactive explorer below lets you switch between all five numeric distributions and adjust their parameters in real time. Watch how the PMF shape, expectation, and variance change. Try to build intuition for how each parameter controls the distribution.

Interactive Distribution Explorer

Choose a distribution and adjust parameters. Bars show the PMF. The dashed line marks E[X]. Shaded region shows ±1 standard deviation from the mean.

p0.50

E[X] = 0.50   Var(X) = 0.25

Things to try:
• Bernoulli: Drag p from 0 to 1 and watch the two bars tilt.
• Binomial: Fix p = 0.5, increase n. Watch the bell curve emerge.
• Poisson: Increase λ. Notice it also becomes bell-shaped. Compare Poi(20) to Bin(40, 0.5) — they look nearly identical.
• Geometric: Set p small (0.05). The tail stretches far to the right. Most of the probability is piled on the first few values, but extreme waits are always possible.
• Neg. Binomial: Increase r. The distribution shifts right and becomes more symmetric.

Check: Which distribution's PMF bars look most symmetric (bell-shaped) for large parameter values?

Geometric Bernoulli Binomial (or Poisson, which approaches it)

Chapter 8: Distribution Comparison

With six distributions in our toolkit, it is easy to mix them up. This chapter puts them side by side and sharpens the distinctions. The key question for pattern recognition is: what is the random variable counting?

Distribution	What X counts	Trials	Support	E[X]	Var(X)
Bern(p)	Success on one trial	1	{0,1}	p	p(1−p)
Bin(n,p)	Successes in n trials	n (fixed)	{0,…,n}	np	np(1−p)
Poi(λ)	Events in fixed interval	∞ (limit)	{0,1,…}	λ	λ
Geo(p)	Trials to 1st success	Random	{1,2,…}	1/p	(1−p)/p²
NegBin(r,p)	Trials to r-th success	Random	{r,r+1,…}	r/p	r(1−p)/p²
Categorical	Which category	1	{c₁,…,c_k}	N/A*	N/A*

*Categorical has E[X] only if categories happen to be numeric.

The family tree:
• Bernoulli → sum n copies → Binomial
• Binomial → n → ∞, p → 0, np = λ → Poisson
• Geometric → sum r copies → Negative Binomial
• Bernoulli → add more categories → Categorical
Every distribution in this chapter is connected to Bernoulli. It is the atomic building block.

Side-by-Side Comparison

Two distributions plotted together. Choose the pair and adjust shared parameters to compare shapes.

n30

p0.20

Decision flowchart. When you encounter a discrete random variable in the wild, ask these questions in order:

Is the outcome a category (not a count)?

→ Categorical

↓ No, it's numeric

Is there only one trial with two outcomes?

→ Bernoulli(p)

↓ No, multiple trials

Fixed number of trials, counting successes?

→ Binomial(n, p)

↓ No, variable trials

Counting trials until 1st success?

→ Geometric(p)

↓ No, until r-th success

Counting trials until r-th success?

→ NegBin(r, p)

↓ No, counting events in a time/space interval

Events at constant rate, independently?

→ Poisson(λ)

Check: A quality inspector tests items until finding the 5th defective one. Each item is independently defective with probability 0.02. What distribution models the total number of items tested?

Binomial(5, 0.02) NegBin(5, 0.02) Poisson(5)

Chapter 9: Connections

This chapter gave you six named distributions — the essential toolkit for modeling discrete random phenomena. Let's connect them to the bigger picture.

What we built: Starting from the simplest possible random variable (Bernoulli: one trial, two outcomes), we constructed every other distribution by changing what we count. Sum Bernoullis → Binomial. Let n → ∞ → Poisson. Count trials to success → Geometric. Count trials to r-th success → Negative Binomial. Add more categories → Categorical. One atom, six molecules.

Looking back. Chapter 5 introduced random variables. Chapter 6 gave us expectation and variance as summary tools. This chapter shows that many real-world random variables fall into known families, so we rarely need to compute PMFs from scratch — we just identify the type and plug in parameters.

Looking forward. Chapter 8 moves from discrete to continuous distributions — random variables that take real-valued outcomes. The ideas are parallel: each continuous distribution has a density function (analogous to the PMF), expectation, and variance. The most important continuous distribution — the Normal — will turn out to be the limiting shape that the Binomial and Poisson approach for large parameters. That is the Central Limit Theorem, and it is one of the most profound results in all of mathematics.

Ch 5–6: Random Variables & Expectation

The language and tools

↓

Ch 7: Discrete Distributions (you are here)

Six named families with pre-derived properties

↓

Ch 8: Continuous Distributions

Uniform, Exponential, Normal — the continuous counterparts

↓

Ch 9+: Joint, CLT, Inference

Multiple variables, limit theorems, real-world applications

One last connection worth noting: the Exponential distribution (Chapter 8) is the continuous analog of the Geometric. Just as the Geometric is memoryless among discrete distributions, the Exponential is memoryless among continuous distributions. And just as the Poisson counts events in an interval, the Exponential models the time between Poisson events. These pairs — Geometric/Exponential and Poisson/Exponential — are two sides of the same coin.

"God does not play dice with the universe —
but if He did, He would use named distributions."
— (loosely after Einstein)

Check: The Poisson distribution can be derived as the limit of which other distribution?

Geometric, as p → 0 Binomial, as n → ∞ and p → 0 with np = λ Negative Binomial, as r → ∞