Piech, Chapter 5

Random Variables

Turning outcomes into numbers — the bridge from sample spaces to computation.

Prerequisites: Chapter 4 (Applications of conditional probability). That's it.
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: Why Random Variables?

You flip three fair coins. You could describe the outcome as (H, T, H), or (T, T, T), or any of the eight possible sequences. But often you don't care about the exact sequence — you care about how many heads you got. That number — 0, 1, 2, or 3 — captures the thing you actually want to reason about.

A random variable is exactly this idea: a way to assign a number to each outcome of an experiment. It is the bridge between the messy world of sample spaces and the clean world of arithmetic.

Think of it like a variable in a programming language. In Python, x = len(results) maps a list of outcomes to a single integer. A random variable does the same thing for probability: it takes an outcome from the sample space and maps it to a number. The value hasn't been determined yet — it depends on which outcome occurs — but once the experiment runs, the variable takes on exactly one value.

The core idea: Random variables convert outcomes into numbers. Once you have numbers, you can compute averages, measure spread, compare distributions, and build the entire machinery of modern probability. They are to probability what variables are to programming — the foundational abstraction everything else is built on.

Here is the three-coin example. Define Y = number of heads on three fair coin flips:

EventOutcomesY valueProbability
No heads(T,T,T)01/8
One head(H,T,T), (T,H,T), (T,T,H)13/8
Two heads(H,H,T), (H,T,H), (T,H,H)23/8
Three heads(H,H,H)31/8

Notice something crucial: multiple outcomes can map to the same number. Three different sequences all give Y = 1. The random variable collapses the sample space into something more manageable. Instead of tracking 8 outcomes, we now track 4 possible values and their probabilities.

Why is this useful? Because once we have a mapping from outcomes to numbers, we can ask powerful questions: What is the average value? How spread out are the values? What is the probability the value exceeds some threshold? These questions would be awkward to phrase in terms of raw outcomes, but with numbers they become natural.

Random variables in CS: Every time your code calls random.randint(1, 6), it is sampling from a random variable. Hash functions, load balancers, randomized algorithms, A/B tests, machine learning — they all reason about random variables. This chapter gives you the formal language for that reasoning.
Check: What does a random variable do?

Chapter 1: Definition & Notation

Formally, a random variable is a function from the sample space Ω to the real numbers. If the experiment produces outcome ω ∈ Ω, then the random variable X assigns the number X(ω) to it.

X : Ω → ℝ

We use capital letters (X, Y, Z) for random variables and lowercase letters (x, y, z) for specific non-random values. This convention is universal in probability and statistics.

A random variable is not the same thing as an event. An event is a scenario (a subset of the sample space). A random variable is an object that takes on a value. But you can create events from random variables using relational operators:

ExpressionMeaningType
YThe random variable itselfRandom variable
Y = 2"Y takes on the value 2"Event
Y < 3"Y is less than 3"Event
Y ≥ 1"Y is at least 1"Event
P(Y = 2)Probability of the event Y = 2Number

This is just like coding. In Python, y is an integer variable, y == 2 is a boolean expression, and prob(y == 2) returns a float. Same hierarchy: variable → event (boolean) → probability (number).

Key insight: The expression P(Y = y) uses both conventions at once. Capital Y is the random variable, lowercase y is a specific value. The whole thing reads: "the probability that the random variable Y takes on the particular value y." This lets us write one formula that works for any value — plug in y = 0, y = 1, etc. to get specific probabilities.

Worked example: Let Y = number of heads on 3 coin flips. What is P(Y = y) for a general value y?

P(Y = y) = C(3, y) / 8     for y ∈ {0, 1, 2, 3}

Where C(3, y) = 3! / (y!(3 − y)!) is the binomial coefficient. Let's check: C(3, 0)/8 = 1/8, C(3, 1)/8 = 3/8, C(3, 2)/8 = 3/8, C(3, 3)/8 = 1/8. These match our table from Chapter 0.

Every random variable has a set of properties that fully describe its behavior:

PropertyNotationWhat it tells you
Support (range){0, 1, 2, 3}Which values the RV can take on
PMF / PDFP(X = x)How likely each value is
CDFP(X ≤ x)Cumulative probability up to x
ExpectationE[X]Weighted average (center of mass)
VarianceVar(X)How spread out the values are

We will build up each of these properties over this chapter and the next. The most fundamental is the PMF — once you know it, you can derive everything else.

Discrete vs. continuous: A discrete random variable takes on a countable set of values (integers, categories). A continuous random variable can take on any real value in an interval. This chapter focuses on discrete random variables. The continuous case comes later and is remarkably similar — sums become integrals, PMFs become PDFs, but the intuition is identical.
Check: In the expression P(Y = y), which is the random variable and which is the specific value?

Chapter 2: The PMF

The most important thing to know about a random variable is: how likely is each value? For discrete random variables, this information is captured by the probability mass function (PMF).

The PMF is a function that takes a value and returns the probability that the random variable equals that value:

PMF:   p(x) = P(X = x)

Think of "mass" literally. Imagine you have one kilogram of probability to distribute across all possible values. The PMF tells you how much mass sits at each point. A fair die puts 1/6 kg at each of {1, 2, 3, 4, 5, 6}. The sum of two dice puts the most mass at 7 and tapers off toward the edges.

Let's see both examples. For a single fair die, X = value shown:

P(X = x) = 1/6    for x ∈ {1, 2, 3, 4, 5, 6}

For Y = sum of two dice, the PMF has a triangular shape because there are more ways to roll a 7 (six combinations) than a 2 (one combination):

y23456789101112
P(Y=y)1/362/363/364/365/366/365/364/363/362/361/36
PMF Visualizer: Dice Distributions

Toggle between 1 die and 2 dice. Hover over bars to see exact probabilities. The PMF shows how probability mass is distributed across possible values.

Mode: 1 die

You can also write a PMF as code. Here is the PMF for the sum of two dice:

python
def pmf_sum_two_dice(y):
    # Returns P(Y = y) where Y = sum of two dice
    if y < 2 or y > 12:
        return 0
    if y <= 7:
        return (y - 1) / 36
    else:
        return (13 - y) / 36
PMFs sum to 1: For any valid random variable, the total probability across all values must equal 1: ∑k P(X = k) = 1. This follows from the axioms of probability — the events "X = k" for all k are mutually exclusive and exhaustive (X must take on some value), so their probabilities must sum to 1. If your PMF doesn't sum to 1, something is wrong.

Worked example — from data to PMF: You simulate rolling two dice 10,000 times and count: the value 7 appears 1,672 times, the value 2 appears 275 times. What are the approximate PMF values?

P(Y = 7) ≈ 1672 / 10000 = 0.1672. The true value is 6/36 = 0.1667. Very close.

P(Y = 2) ≈ 275 / 10000 = 0.0275. The true value is 1/36 = 0.0278. Again, close. With more samples, these empirical frequencies converge to the true PMF — this is the definition of probability at work.

Check: For the sum of two dice, which value has the highest probability mass?

Chapter 3: The CDF

The PMF answers "what is the probability of exactly this value?" But often you want to ask: "what is the probability the value is at most x?" This cumulative question is answered by the cumulative distribution function (CDF).

CDF:   F(x) = P(X ≤ x) = ∑k ≤ x P(X = k)

The CDF at any point x is just the sum of all PMF values at or below x. It accumulates probability from left to right — hence the name "cumulative."

Worked example: Let Y = number of heads on 3 coin flips. The PMF is P(0) = 1/8, P(1) = 3/8, P(2) = 3/8, P(3) = 1/8. The CDF is:

yP(Y = y) [PMF]F(y) = P(Y ≤ y) [CDF]
01/8 = 0.1251/8 = 0.125
13/8 = 0.3754/8 = 0.500
23/8 = 0.3757/8 = 0.875
31/8 = 0.1258/8 = 1.000

The CDF always starts near 0 (for very small x) and ends at 1 (for large x). It is non-decreasing — it can only go up or stay flat, never drop. For discrete random variables, the CDF is a staircase function: it jumps up at each value in the support, and is flat between them.

Key insight: You can always recover the PMF from the CDF and vice versa. The PMF is the "height of each step" in the CDF staircase: P(X = k) = F(k) − F(k − 1). The CDF is the running sum of the PMF. They encode the same information in different forms.

The CDF is especially useful for answering range queries:

QuestionFormulaExample (Y = heads on 3 flips)
P(Y ≤ 2)F(2)7/8 = 0.875
P(Y > 1)1 − F(1)1 − 4/8 = 0.500
P(1 < Y ≤ 3)F(3) − F(1)1 − 0.5 = 0.500
CDF Staircase: Coin Flips

The CDF for Y = number of heads on n coin flips. Toggle the number of coins. The staircase jumps at each integer, and the height of each jump equals the PMF at that point.

Coins: 3

Worked example: Using the CDF for Y = sum of two dice, what is P(4 ≤ Y ≤ 9)?

P(4 ≤ Y ≤ 9) = F(9) − F(3). We need F(9) = (1+2+3+4+5+6+5+4)/36 = 30/36. And F(3) = (1+2)/36 = 3/36. So P(4 ≤ Y ≤ 9) = 27/36 = 3/4 = 0.75. Three quarters of the time, two dice sum to something between 4 and 9.

Check: For a discrete CDF, what does the height of each "step" represent?

Chapter 4: Properties of PMF & CDF

Now that we have two ways to describe a distribution — the PMF and CDF — let's nail down their formal properties. These are the rules of the game, and every valid distribution must obey them.

Three properties of a valid PMF:

Non-negativity
P(X = x) ≥ 0 for all x. Probabilities can't be negative.
Sums to one
x P(X = x) = 1. All the mass must be accounted for.
Zero outside support
P(X = x) = 0 for any x not in the support of X.

Three properties of a valid CDF:

Non-decreasing
If a ≤ b then F(a) ≤ F(b). Cumulative probability never goes down.
Bounded [0, 1]
limx → −∞ F(x) = 0,   limx → +∞ F(x) = 1.
Right-continuous
F(x) = limt → x+ F(t). The function "claims" the top of each step.
Why right-continuous? For the CDF, F(x) = P(X ≤ x) includes the point x itself. So at a jump from, say, F = 0.5 to F = 0.875, the CDF takes the upper value at the jump point. This is a convention, but it ensures F is well-defined everywhere and consistent with the ≤ in P(X ≤ x).

Worked example — is this a valid PMF? Someone claims a random variable Z has P(Z = 1) = 0.3, P(Z = 2) = 0.5, P(Z = 3) = 0.1.

Check non-negativity: all values are ≥ 0. Check sum: 0.3 + 0.5 + 0.1 = 0.9 ≠ 1. This is not a valid PMF. The mass doesn't add up. There's 0.1 of probability unaccounted for — either there's a missing value or the probabilities are wrong.

Converting between PMF and CDF:

python
def pmf_to_cdf(pmf_values, support):
    # pmf_values[i] = P(X = support[i])
    cdf = []
    running_sum = 0
    for p in pmf_values:
        running_sum += p
        cdf.append(running_sum)
    return cdf

def cdf_to_pmf(cdf_values):
    # PMF is the difference of consecutive CDF values
    pmf = [cdf_values[0]]
    for i in range(1, len(cdf_values)):
        pmf.append(cdf_values[i] - cdf_values[i - 1])
    return pmf
Check: A function f(x) has f(1) = 0.4, f(2) = 0.3, f(3) = −0.1, f(4) = 0.4. Is this a valid PMF?

Chapter 5: Functions of Random Variables

If X is a random variable and g is a function, then g(X) is also a random variable. You take whatever value X produces and feed it through g. The new random variable inherits its randomness from X, but the distribution can look completely different.

Example: Let X = value of a fair die roll, so X ∈ {1, 2, 3, 4, 5, 6}. Define Y = X2. Then Y takes values in {1, 4, 9, 16, 25, 36}, each with probability 1/6.

To find the PMF of g(X), work through each value of X:

P(g(X) = y) = ∑x : g(x) = y P(X = x)

If g maps multiple x values to the same y, you sum their probabilities. This is where things get interesting.

Worked example: Let X = fair die roll and define Y = X mod 3. What is the PMF of Y?

X123456
Y = X mod 3120120

Y = 0 when X ∈ {3, 6}, so P(Y = 0) = 2/6 = 1/3.

Y = 1 when X ∈ {1, 4}, so P(Y = 1) = 2/6 = 1/3.

Y = 2 when X ∈ {2, 5}, so P(Y = 2) = 2/6 = 1/3.

The die roll mod 3 produces a uniform distribution over {0, 1, 2}. This is not obvious — it happens because exactly two of the six die values map to each remainder.

Key insight: Applying a function to a random variable can collapse, spread, or reshape the distribution entirely. Squaring makes the distribution skewed. Taking a modulus can make it uniform. Understanding how functions transform distributions is essential — it is the heart of the Law of the Unconscious Statistician (LOTUS), which we will meet in the Expectation chapter.

Worked example — non-uniform collapse: Let X be uniform on {1, 2, 3, 4} and define Y = |X − 2.5|. Then:

X = 1 → |1 − 2.5| = 1.5.   X = 2 → |2 − 2.5| = 0.5.   X = 3 → |3 − 2.5| = 0.5.   X = 4 → |4 − 2.5| = 1.5.

P(Y = 0.5) = P(X = 2) + P(X = 3) = 1/4 + 1/4 = 1/2.   P(Y = 1.5) = P(X = 1) + P(X = 4) = 1/2.

The absolute deviation collapses 4 values into 2, and both outcomes are equally likely.

Check: If X is uniform on {1,2,3,4,5,6} and Y = min(X, 4), what is P(Y = 4)?

Chapter 6: Indicator Variables

An indicator random variable is the simplest possible random variable. It equals 1 if some event happens and 0 otherwise. That's it.

IA = { 1 if event A occurs,   0 otherwise }

This looks trivially simple, but indicator variables are one of the most powerful tools in all of probability. They turn counting problems into addition problems and make complex expected value calculations clean and elegant.

Why are they useful? Because of one magical property:

The indicator trick: E[IA] = P(A). The expected value of an indicator variable equals the probability of the event it indicates. This follows directly from the definition of expectation: E[IA] = 1 · P(A) + 0 · P(AC) = P(A). Simple, but incredibly useful.

Worked example — counting with indicators: Flip 10 fair coins. Let X = number of heads. We can write X as a sum of indicators:

X = I1 + I2 + … + I10

where Ik = 1 if coin k is heads, 0 otherwise. Now we can compute the expected number of heads without knowing the full PMF of X:

E[X] = E[I1] + E[I2] + … + E[I10] = 0.5 + 0.5 + … + 0.5 = 5

We used linearity of expectation — the expectation of a sum equals the sum of expectations. This works regardless of dependence. We will derive this fully in the next chapter on expectation, but the indicator decomposition technique is so important it deserves its own introduction.

Worked example — birthday problem: In a room of n people, what is the expected number of pairs who share a birthday? There are C(n, 2) = n(n − 1)/2 pairs. For each pair, define an indicator Iij = 1 if persons i and j share a birthday.

E[Iij] = P(same birthday) = 1/365 (assuming 365 equally likely birthdays).

By linearity: E[total matching pairs] = C(n, 2) · (1/365) = n(n − 1) / 730.

For n = 30: E = 30 · 29 / 730 ≈ 1.19 matching pairs. This is remarkably easy compared to computing the full distribution of matching pairs!

The pattern: Whenever you need to count something random, try to write it as a sum of indicator variables. Then use linearity of expectation. This sidesteps the need to compute the full PMF, which is often combinatorially complex. Professional probabilists use this trick constantly.
python
# Indicator variable simulation: birthday problem
import random

def birthday_matching_pairs(n=30, trials=100000):
    total = 0
    for _ in range(trials):
        bdays = [random.randint(1, 365) for _ in range(n)]
        pairs = 0
        for i in range(n):
            for j in range(i + 1, n):
                if bdays[i] == bdays[j]:
                    pairs += 1
        total += pairs
    return total / trials
# Returns ~1.19, matching n(n-1)/730
Check: What is E[IA] for an indicator variable of event A?

Chapter 7: Showcase — Interactive PMF/CDF Builder

This is the payoff. Below is an interactive simulation that lets you build your own random variable from scratch. Roll dice, flip coins, or define a custom distribution — then see the PMF and CDF update in real time as you accumulate samples.

You can also run the experiment thousands of times at once and watch the empirical PMF (the histogram of observed frequencies) converge toward the true PMF. This is probability's fundamental promise: with enough trials, observed frequencies converge to true probabilities.

PMF/CDF Builder: Empirical vs. Theoretical

Select a distribution. Click Sample to draw one value, or Run 1000 to simulate many at once. The warm bars show the empirical PMF (observed frequencies). The teal line shows the true PMF. The lower panel shows the CDF. Watch them converge.

n = 0 samples
Distribution
What to notice: With just 10 samples, the empirical PMF is noisy and jagged — some values haven't even appeared yet. By 100 samples, you start to see the shape. By 1000, it closely matches the theoretical PMF. This is the law of large numbers at work: sample averages converge to true expectations. Try the geometric distribution to see how rare high values are — you may need thousands of samples before the tail fills in.

Try this experiment: select "Sum of 2 Dice" and click Sample 1 repeatedly. After about 30 samples, guess which value has the highest bar. Then click Run 1000 and see if you were right. The triangular shape of the PMF emerges reliably with enough data.

Check: As you collect more and more samples, the empirical PMF converges to the true PMF. What fundamental principle guarantees this?

Chapter 8: Worked Problems

Problem 1: Building a PMF from scratch.

A spinner has 5 equal sectors labeled 1 through 5. You spin it twice and let X = maximum of the two spins. Find the PMF of X.

Solution: There are 25 equally likely outcomes (5 × 5). For X = k, we need both spins ≤ k and at least one spin = k.

P(X ≤ k) = (k/5)2 because both spins must be ≤ k. So P(X = k) = P(X ≤ k) − P(X ≤ k − 1) = k2/25 − (k − 1)2/25 = (2k − 1)/25.

k12345
P(X = k)1/253/255/257/259/25

Check: 1 + 3 + 5 + 7 + 9 = 25, so the sum is 25/25 = 1. Valid PMF.

Problem 2: CDF range query.

Using X from Problem 1, find P(2 ≤ X ≤ 4).

Solution: P(2 ≤ X ≤ 4) = F(4) − F(1) = (1 + 3 + 5 + 7)/25 − 1/25 = 16/25 − 1/25 = 15/25 = 3/5 = 0.6.

Problem 3: Functions of random variables.

Let X be a fair die roll. Define Y = (X mod 2). Find the PMF of Y (i.e., is X even or odd?).

Solution: X mod 2 = 1 when X ∈ {1, 3, 5} and X mod 2 = 0 when X ∈ {2, 4, 6}. So P(Y = 0) = 3/6 = 1/2 and P(Y = 1) = 3/6 = 1/2. Y is a fair coin flip. This makes X mod 2 an indicator for "X is odd."

Problem 4: Indicator decomposition.

You deal 5 cards from a standard 52-card deck. What is the expected number of aces?

Solution: Let Ik = 1 if card k (of the 5 dealt) is an ace. By symmetry, P(card k is an ace) = 4/52 = 1/13 for each card position. By linearity of expectation:

E[aces] = E[I1 + … + I5] = 5 × (1/13) = 5/13 ≈ 0.385

On average, about 0.385 of the 5 cards are aces. Notice we did not need the hypergeometric distribution — the indicator trick made it trivial.

Problem 5: Validating a PMF.

A classmate claims P(X = k) = k/10 for k ∈ {1, 2, 3, 4}. Is this a valid PMF?

Solution: Sum = 1/10 + 2/10 + 3/10 + 4/10 = 10/10 = 1. All values are non-negative. This is a valid PMF. The random variable is more likely to take on larger values — P(X = 4) = 0.4 is the mode.

Check: In Problem 1, why does the max of two spins have a PMF that increases linearly (1/25, 3/25, 5/25, ...)?

Chapter 9: Connections

Random variables are the central abstraction of probability. Everything before this chapter — sample spaces, events, conditional probability — was building toward this moment. Everything after this chapter builds on top of random variables.

Chapter 6: Expectation
The weighted average E[X] = ∑ x · P(X = x). The single most useful summary of a random variable. Linearity of expectation is derived here.
Chapter 7: Variance
How spread out is the distribution? Var(X) = E[(X − E[X])2]. Uses the PMF and LOTUS to compute.
Named Distributions
Bernoulli, Binomial, Poisson, Geometric — each is a random variable with a specific PMF formula and known expectation/variance.
Continuous RVs
Everything generalizes: PMF becomes PDF, sums become integrals, but the intuition is identical.

What comes next: Chapter 6 introduces expectation — the most important single number you can extract from a distribution. You already saw a preview with indicator variables: E[IA] = P(A). The full treatment will give you linearity of expectation, the Law of the Unconscious Statistician (LOTUS), and the tools to compute averages for any discrete random variable.

The big picture: Random variables are to probability what variables are to algebra. Before this chapter, we could only talk about "the probability of event E." Now we can talk about distributions, expectations, variances, functions of random variables, and decompositions via indicators. The entire language of statistics, machine learning, and data science is built on this foundation. Master PMFs and CDFs here, and everything that follows — named distributions, joint distributions, the central limit theorem — will feel like natural extensions.
ConceptWhat it answersWhere it leads
Random variableWhat number does the outcome map to?Everything in probability
PMFHow likely is each value?Named distributions (Binomial, Poisson, ...)
CDFHow likely is "at most x"?Continuous distributions, quantiles, percentiles
Functions of RVsWhat happens when you transform a variable?LOTUS, derived distributions
Indicator variablesDid event A happen (0 or 1)?Counting arguments, linearity of expectation
Check: Which property of indicator variables makes them especially powerful for computing expected counts?