Summarizing a random variable with a single number — then measuring how much it spreads.
A random variable's PMF is a complete picture — it tells you the probability of every possible outcome. But sometimes a complete picture is too much information. You are designing a game. A player rolls two dice and gets the sum as their score. The PMF of the sum has 11 possible values (2 through 12), each with a different probability. Your boss asks: "On average, how many points will a player score per roll?"
You need a single number that summarizes the entire distribution. That number is the expectation (also called the mean, or expected value). It is the weighted average of all possible outcomes, where each outcome is weighted by its probability.
But knowing the average is not enough. Suppose two games both have an average score of 7. In game A, players almost always get between 6 and 8. In game B, players often get 2 or 12 — wild swings. The averages are identical, but the experiences are totally different. You need a second number to capture the spread. That number is the variance.
These two concepts are not just mathematical abstractions. They are the workhorses of computer science. Every time you analyze an algorithm's average-case runtime, compute confidence intervals, train a machine learning model, or decide whether a Monte Carlo estimate has converged, you are using expectation and variance.
| Summary | What it measures | Symbol |
|---|---|---|
| Expectation | Center / weighted average | E[X] or μ |
| Variance | Spread around the center | Var(X) or σ² |
| Standard deviation | Spread in original units | Std(X) or σ |
This chapter derives both concepts from scratch. We will start with the definition of expectation, develop its key properties (linearity, LOTUS), then turn to variance and standard deviation. Every formula will be accompanied by a concrete numerical example with actual arithmetic.
The expectation of a discrete random variable X is defined as:
In words: take each value x that the random variable can assume, multiply it by the probability of that value, and sum everything up. This is a weighted average — values that are more likely get more weight.
Worked example — single die: Let X be the outcome of rolling a single fair die. Each face has probability 1/6.
Notice: 3.5 is not even a possible outcome! You can never roll a 3.5. But if you rolled a die a million times and averaged, you would get very close to 3.5. That is what expectation means.
Worked example — sum of two dice: Let X be the sum of two fair dice. The PMF is:
| x | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| P(X=x) | 1⁄36 | 2⁄36 | 3⁄36 | 4⁄36 | 5⁄36 | 6⁄36 | 5⁄36 | 4⁄36 | 3⁄36 | 2⁄36 | 1⁄36 |
The expected value of the sum of two dice is 7. In this case, 7 also happens to be the mode (the most likely outcome), but that is a coincidence — expectation and mode need not coincide.
python def expectation_sum_two_dice(): exp = 0 for x in range(2, 13): # x from 2 to 12 pr_x = pmf_sum(x) # probability sum = x exp += x * pr_x return exp # returns 7.0 def pmf_sum(x): count = 0 for d1 in range(1, 7): for d2 in range(1, 7): if d1 + d2 == x: count += 1 return count / 36
Expectation has a remarkable property that makes it incredibly powerful. It is linear. This means two things:
Where a and b are constants. You can pull constants out of the expectation and add constants outside.
Worked example: Suppose E[X] = 5 (say, the expected score on a quiz). Your professor curves by doubling everyone's score and adding 10. The new score is Y = 2X + 10. What is E[Y]?
Even more powerful is the sum rule:
This holds always — even when X and Y are dependent. Even when they have completely different distributions. The expectation of a sum equals the sum of the expectations. No conditions needed.
Worked example — sum of two dice, the easy way: Let D1 and D2 be two independent fair dice. Instead of computing the PMF of the sum (11 values!), just use linearity:
We already computed E[D1] = 3.5 for a single die. Linearity gives us the answer instantly, with no PMF table needed. This is the real power of linearity — it lets you decompose hard problems into easy pieces.
Worked example — 10 dice: You roll 10 fair dice and take the sum. What is the expected sum?
Without linearity, you would need the PMF of the sum of 10 dice (49 possible values from 10 to 60). With linearity: one line.
Worked example — expected heads: Flip a biased coin (P(H) = 0.6) exactly 100 times. Let X = number of heads.
You know the PMF of X. You want E[g(X)] for some function g. Do you need to compute the PMF of the new random variable g(X) first? No. The Law of the Unconscious Statistician (LOTUS) says you can compute it directly:
In words: apply g to each value x, weight by the probability, and sum. You never need to find the distribution of g(X). The name is humorous — the idea is so natural that "even an unconscious statistician" would use it without thinking.
Worked example — E[X²] for a single die: Let X be a single fair die roll. Compute E[X²].
Here g(x) = x². By LOTUS:
Notice that E[X²] ≠ (E[X])². We have E[X] = 3.5, so (E[X])² = 12.25. But E[X²] = 15.17. This gap between E[X²] and (E[X])² turns out to be exactly the variance, as we will see in Chapter 4.
Worked example — E[X²] for sum of two dice: Let X be the sum of two fair dice. Compute E[X²].
We will use this value in Chapter 4 to compute Var(X).
python def e_x_squared_two_dice(): result = 0 for x in range(2, 13): pr_x = pmf_sum(x) result += x**2 * pr_x # g(x) = x^2, weighted by P(X=x) return result # returns 54.8333...
Expectation tells you the center of a distribution. But two distributions can share the same center and look completely different. Imagine three graders scoring a homework assignment that deserves a 70:
• Grader A gives scores all over the map — sometimes 40, sometimes 100. High spread.
• Grader B consistently gives scores between 65 and 75. Low spread.
• Both have E[Score] = 70, but you would much rather have Grader B.
We need a number that quantifies spread. The variance measures the average squared distance from the mean:
where μ = E[X]. In words: take each possible value x, compute how far it is from the mean (x − μ), square that distance, weight by probability, and sum. The squaring ensures that deviations above and below the mean do not cancel out.
Proof of the shortcut: Starting from the definition and expanding:
Worked example — single die: X is a fair die roll. We already computed E[X] = 3.5 and E[X²] = 91/6 ≈ 15.167.
Worked example — sum of two dice: X is the sum of two dice. E[X] = 7, E[X²] = 1974/36 = 54.833...
Variance behaves differently from expectation. Unlike expectation, variance is NOT linear. Here are the key properties.
Property 1 — Adding a constant does not change variance:
Shifting a distribution left or right moves the center but does not change the spread. If everyone gets 10 extra points on an exam, the scores are all higher, but the spread is identical.
Property 2 — Scaling multiplies variance by the square:
If you double every value, distances from the mean also double, and squared distances quadruple.
Combined:
Worked example: Suppose X has E[X] = 10 and Var(X) = 4. Let Y = 3X + 5. Then:
Property 3 — Variance of a sum (independent case):
Unlike expectation of sums (which holds always), variance of sums requires independence. If X and Y are dependent, there is a correction term involving the covariance (covered in later chapters).
Worked example — two dice variance: Let D1 and D2 be independent fair dice. We know Var(D1) = 35/12.
This matches our direct computation from Chapter 4.
| Property | Expectation | Variance |
|---|---|---|
| Add constant b | E[X+b] = E[X] + b | Var(X+b) = Var(X) |
| Scale by a | E[aX] = aE[X] | Var(aX) = a²Var(X) |
| Sum (always) | E[X+Y] = E[X]+E[Y] | Not always decomposable |
| Sum (independent) | E[X+Y] = E[X]+E[Y] | Var(X+Y) = Var(X)+Var(Y) |
Variance has a units problem. If X is measured in points, Var(X) is measured in points². If X is dollars, Var(X) is dollars². Squared units are hard to interpret intuitively. How do you make sense of "the spread is 5.83 points-squared"?
The solution is the standard deviation:
Standard deviation has the same units as X. It is the "average distance" of a sample from the mean (using Euclidean distance).
Worked example — single die:
So a typical die roll deviates about 1.7 from the mean of 3.5. That matches intuition — most rolls are 1 or 2 away from 3.5.
Worked example — sum of two dice:
The sum of two dice typically deviates about 2.4 from the mean of 7. So most sums fall between roughly 5 and 9, which lines up with the PMF we saw earlier.
Scaling property: Standard deviation scales linearly (unlike variance which squares):
This is because Std(aX + b) = √(a² Var(X)) = |a| √Var(X) = |a| Std(X).
Worked example: Suppose X has Std(X) = 3 points. You triple every value and add 10: Y = 3X + 10. Then Std(Y) = 3 · 3 = 9 points. The +10 does not affect spread, and the factor of 3 scales the standard deviation by 3 (not 9 — that is what happens to variance).
Time to put it all together. The simulation below lets you choose how many dice to roll and computes the expectation, variance, and standard deviation of the sum — both theoretically (from the formulas) and empirically (from actual simulated rolls). Watch the empirical values converge to the theoretical ones as you increase the number of trials.
Choose the number of dice. Click Roll 1 to add a single trial, or Roll 100 to add many at once. The histogram shows the distribution of sums across all your trials. Theoretical values are computed from the formulas; empirical values from your samples.
This canvas shows the PMF of the sum of N dice for N = 1, 2, and 5 side by side. Notice how the distribution becomes more concentrated (bell-shaped) as N grows, even though the range of values expands. The standard deviation grows as √N, but relative to the mean it shrinks.
Watch the running average of dice sums converge to E[X]. The shaded band shows ±1 standard deviation of the sample mean. Click Run to start, Pause to stop.
Problem 1: Weighted coin. A coin has P(Heads) = 0.3. You flip it once. Heads pays $10, tails pays $2. What is the expected payout?
Problem 2: Variance of weighted coin. Same coin. What is Var(Payout)?
Problem 3: Linearity with indicator variables. In a class of 30 students, each independently has a 0.2 probability of getting an A. What is the expected number of A's?
Let Xi = 1 if student i gets an A, 0 otherwise. E[Xi] = 0.2. Total A's = X1 + X2 + … + X30.
Problem 4: Variance of a sum of independent variables. Same setup. Since students are independent, Var(Xi) = E[Xi²] − (E[Xi])² = 0.2 − 0.04 = 0.16.
Problem 5: Scaling. A casino game costs $5 to play. You roll a fair die and win $X (1 through 6). Your profit is Y = X − 5. What is E[Y] and Var(Y)?
The house takes $1.50 per game on average. The variance is unchanged because subtracting a constant does not affect spread.
Problem 6: LOTUS in action. X is uniform on {1, 2, 3, 4}. Compute E[2X].
Note: E[X] = 2.5, and 22.5 ≈ 5.66 ≠ 7.5. This confirms E[g(X)] ≠ g(E[X]) for nonlinear g.
Expectation and variance are the foundation upon which an enormous amount of probability and statistics is built. Here is how this chapter connects to what comes before and what comes next.
Builds on:
• Ch 5 — Random Variables: We defined PMFs; expectation and variance summarize them.
• Ch 3 — Conditional Probability: Independence matters for Var(X+Y).
• Ch 1 — Counting: Many PMFs (and hence expectations) come from counting arguments.
Leads to:
• Ch 7 — Named Distributions: Bernoulli, Binomial, Geometric — each has formulas for E[X] and Var(X) derived using linearity and LOTUS.
• Ch 10 — Continuous RVs: Sums become integrals, but the definitions are the same.
• Ch 14 — Central Limit Theorem: The CLT uses μ and σ² to describe how sums of random variables converge to Normal.
| Concept from this chapter | Where it appears next |
|---|---|
| E[X] definition | Every named distribution (Ch 7-9) derives its mean |
| Linearity of expectation | Binomial mean = np (Ch 7), coupon collector (Ch 8) |
| LOTUS | Moment generating functions (Ch 12), variance computations everywhere |
| Var(X) = E[X²] − (E[X])² | Bernoulli variance = p(1−p), Normal parameterization (Ch 10) |
| Variance of independent sums | Central Limit Theorem (Ch 14), confidence intervals |
| Standard deviation | z-scores, hypothesis testing, error bars |