Ch 6: Expectation & Variance — Piech Probability CS

Chapter 0: Why Expectation?

A random variable's PMF is a complete picture — it tells you the probability of every possible outcome. But sometimes a complete picture is too much information. You are designing a game. A player rolls two dice and gets the sum as their score. The PMF of the sum has 11 possible values (2 through 12), each with a different probability. Your boss asks: "On average, how many points will a player score per roll?"

You need a single number that summarizes the entire distribution. That number is the expectation (also called the mean, or expected value). It is the weighted average of all possible outcomes, where each outcome is weighted by its probability.

But knowing the average is not enough. Suppose two games both have an average score of 7. In game A, players almost always get between 6 and 8. In game B, players often get 2 or 12 — wild swings. The averages are identical, but the experiences are totally different. You need a second number to capture the spread. That number is the variance.

The core idea: Expectation tells you the center of gravity of a distribution — where it "balances." Variance tells you how spread out the mass is around that center. Together, they give you a two-number summary of any random variable.

These two concepts are not just mathematical abstractions. They are the workhorses of computer science. Every time you analyze an algorithm's average-case runtime, compute confidence intervals, train a machine learning model, or decide whether a Monte Carlo estimate has converged, you are using expectation and variance.

Summary	What it measures	Symbol
Expectation	Center / weighted average	E[X] or μ
Variance	Spread around the center	Var(X) or σ²
Standard deviation	Spread in original units	Std(X) or σ

This chapter derives both concepts from scratch. We will start with the definition of expectation, develop its key properties (linearity, LOTUS), then turn to variance and standard deviation. Every formula will be accompanied by a concrete numerical example with actual arithmetic.

Names for expectation: Mean, weighted average, center of mass, first moment — they all refer to the same quantity E[X]. You will see all of these in different textbooks and papers. They are computed with the same formula.

Check: Why do we need variance in addition to expectation?

Because expectation is always zero Because two distributions can have the same expectation but very different spreads Because expectation only works for continuous random variables

Chapter 1: Definition of E[X]

The expectation of a discrete random variable X is defined as:

E[X] = ∑_x x · P(X = x)

In words: take each value x that the random variable can assume, multiply it by the probability of that value, and sum everything up. This is a weighted average — values that are more likely get more weight.

Key insight: Expectation is NOT the most likely outcome. It is the long-run average. If you sampled from the distribution a billion times and averaged, you would get E[X]. The most likely outcome is the mode. They can coincide, but they do not have to.

Worked example — single die: Let X be the outcome of rolling a single fair die. Each face has probability 1/6.

E[X] = 1 · ¹⁄₆ + 2 · ¹⁄₆ + 3 · ¹⁄₆ + 4 · ¹⁄₆ + 5 · ¹⁄₆ + 6 · ¹⁄₆

= ^{1 + 2 + 3 + 4 + 5 + 6}⁄₆ = ²¹⁄₆ = 3.5

Notice: 3.5 is not even a possible outcome! You can never roll a 3.5. But if you rolled a die a million times and averaged, you would get very close to 3.5. That is what expectation means.

Worked example — sum of two dice: Let X be the sum of two fair dice. The PMF is:

x	2	3	4	5	6	7	8	9	10	11	12
P(X=x)	¹⁄₃₆	²⁄₃₆	³⁄₃₆	⁴⁄₃₆	⁵⁄₃₆	⁶⁄₃₆	⁵⁄₃₆	⁴⁄₃₆	³⁄₃₆	²⁄₃₆	¹⁄₃₆

E[X] = 2 · ¹⁄₃₆ + 3 · ²⁄₃₆ + 4 · ³⁄₃₆ + 5 · ⁴⁄₃₆ + 6 · ⁵⁄₃₆ + 7 · ⁶⁄₃₆ + 8 · ⁵⁄₃₆ + 9 · ⁴⁄₃₆ + 10 · ³⁄₃₆ + 11 · ²⁄₃₆ + 12 · ¹⁄₃₆

= ^{2 + 6 + 12 + 20 + 30 + 42 + 40 + 36 + 30 + 22 + 12}⁄₃₆ = ²⁵²⁄₃₆ = 7

The expected value of the sum of two dice is 7. In this case, 7 also happens to be the mode (the most likely outcome), but that is a coincidence — expectation and mode need not coincide.

Expectation of a constant: If c is a constant (not a random variable), then E[c] = c. A constant never changes, so its "average" is just itself. This seems trivial, but it appears frequently in proofs.

python
def expectation_sum_two_dice():
    exp = 0
    for x in range(2, 13):         # x from 2 to 12
        pr_x = pmf_sum(x)             # probability sum = x
        exp += x * pr_x
    return exp                      # returns 7.0

def pmf_sum(x):
    count = 0
    for d1 in range(1, 7):
        for d2 in range(1, 7):
            if d1 + d2 == x:
                count += 1
    return count / 36

Check: You flip a fair coin. Heads = +10, Tails = −4. What is E[X]?

10 6 3 (since 10 · 0.5 + (−4) · 0.5 = 5 − 2 = 3)

Chapter 2: Linearity of Expectation

Expectation has a remarkable property that makes it incredibly powerful. It is linear. This means two things:

E[aX + b] = a · E[X] + b

Where a and b are constants. You can pull constants out of the expectation and add constants outside.

Worked example: Suppose E[X] = 5 (say, the expected score on a quiz). Your professor curves by doubling everyone's score and adding 10. The new score is Y = 2X + 10. What is E[Y]?

E[Y] = E[2X + 10] = 2 · E[X] + 10 = 2 · 5 + 10 = 20

Even more powerful is the sum rule:

E[X + Y] = E[X] + E[Y]

This holds always — even when X and Y are dependent. Even when they have completely different distributions. The expectation of a sum equals the sum of the expectations. No conditions needed.

Why this is surprising: Think about it — X and Y could be tangled up in complex ways. Knowing the value of X might completely determine Y. Yet the expectation still splits cleanly into a sum. This property extends to any number of random variables: E[X₁ + X₂ + … + X_n] = E[X₁] + E[X₂] + … + E[X_n].

Worked example — sum of two dice, the easy way: Let D₁ and D₂ be two independent fair dice. Instead of computing the PMF of the sum (11 values!), just use linearity:

E[D₁ + D₂] = E[D₁] + E[D₂] = 3.5 + 3.5 = 7

We already computed E[D₁] = 3.5 for a single die. Linearity gives us the answer instantly, with no PMF table needed. This is the real power of linearity — it lets you decompose hard problems into easy pieces.

Worked example — 10 dice: You roll 10 fair dice and take the sum. What is the expected sum?

E[D₁ + D₂ + … + D₁₀] = 10 · E[D₁] = 10 · 3.5 = 35

Without linearity, you would need the PMF of the sum of 10 dice (49 possible values from 10 to 60). With linearity: one line.

The indicator trick: Many clever expectation calculations write a complex random variable as a sum of indicator variables. For instance, the number of heads in n coin flips is X = X₁ + X₂ + … + X_n, where X_i = 1 if flip i is heads, 0 otherwise. Then E[X] = n · E[X₁] = n · p. This trick works because linearity does not require independence.

Worked example — expected heads: Flip a biased coin (P(H) = 0.6) exactly 100 times. Let X = number of heads.

E[X] = E[X₁ + X₂ + … + X₁₀₀] = 100 · E[X₁] = 100 · 0.6 = 60

Check: Does E[X + Y] = E[X] + E[Y] require X and Y to be independent?

No — linearity of expectation holds always, regardless of dependence Yes — it only works for independent random variables Only if X and Y have the same distribution

Chapter 3: LOTUS

You know the PMF of X. You want E[g(X)] for some function g. Do you need to compute the PMF of the new random variable g(X) first? No. The Law of the Unconscious Statistician (LOTUS) says you can compute it directly:

E[g(X)] = ∑_x g(x) · P(X = x)

In words: apply g to each value x, weight by the probability, and sum. You never need to find the distribution of g(X). The name is humorous — the idea is so natural that "even an unconscious statistician" would use it without thinking.

Key insight: LOTUS says you compute E[g(X)] by summing over the original sample space of X, applying g to each value. You do NOT need to figure out what values g(X) can take or their probabilities. This saves enormous effort.

Worked example — E[X²] for a single die: Let X be a single fair die roll. Compute E[X²].

Here g(x) = x². By LOTUS:

E[X²] = ∑_x=1⁶ x² · ¹⁄₆ = 1² · ¹⁄₆ + 2² · ¹⁄₆ + 3² · ¹⁄₆ + 4² · ¹⁄₆ + 5² · ¹⁄₆ + 6² · ¹⁄₆

= ^{1 + 4 + 9 + 16 + 25 + 36}⁄₆ = ⁹¹⁄₆ ≈ 15.17

Notice that E[X²] ≠ (E[X])². We have E[X] = 3.5, so (E[X])² = 12.25. But E[X²] = 15.17. This gap between E[X²] and (E[X])² turns out to be exactly the variance, as we will see in Chapter 4.

Critical warning — E[g(X)] ≠ g(E[X]) in general! Expectation is linear, but it does NOT "pass through" nonlinear functions. E[X²] ≠ (E[X])². E[1/X] ≠ 1/E[X]. The ONLY functions where E[g(X)] = g(E[X]) are linear functions g(x) = ax + b. For any other g, you must use LOTUS.

Worked example — E[X²] for sum of two dice: Let X be the sum of two fair dice. Compute E[X²].

E[X²] = 4 · ¹⁄₃₆ + 9 · ²⁄₃₆ + 16 · ³⁄₃₆ + 25 · ⁴⁄₃₆ + 36 · ⁵⁄₃₆ + 49 · ⁶⁄₃₆ + 64 · ⁵⁄₃₆ + 81 · ⁴⁄₃₆ + 100 · ³⁄₃₆ + 121 · ²⁄₃₆ + 144 · ¹⁄₃₆

= ^{4 + 18 + 48 + 100 + 180 + 294 + 320 + 324 + 300 + 242 + 144}⁄₃₆ = ¹⁹⁷⁴⁄₃₆ = 54.833...

We will use this value in Chapter 4 to compute Var(X).

python
def e_x_squared_two_dice():
    result = 0
    for x in range(2, 13):
        pr_x = pmf_sum(x)
        result += x**2 * pr_x    # g(x) = x^2, weighted by P(X=x)
    return result                 # returns 54.8333...

Check: You know E[X] = 4 for some random variable. What is E[X²]?

16 4 Cannot be determined — E[X²] ≠ (E[X])² in general

Chapter 4: Variance Definition

Expectation tells you the center of a distribution. But two distributions can share the same center and look completely different. Imagine three graders scoring a homework assignment that deserves a 70:

• Grader A gives scores all over the map — sometimes 40, sometimes 100. High spread.
• Grader B consistently gives scores between 65 and 75. Low spread.
• Both have E[Score] = 70, but you would much rather have Grader B.

We need a number that quantifies spread. The variance measures the average squared distance from the mean:

Var(X) = E[(X − μ)²]

where μ = E[X]. In words: take each possible value x, compute how far it is from the mean (x − μ), square that distance, weight by probability, and sum. The squaring ensures that deviations above and below the mean do not cancel out.

The shortcut formula: There is an equivalent form that is much easier to compute:

Var(X) = E[X²] − (E[X])²

This says: compute E[X²] (using LOTUS), compute E[X], square E[X], and subtract. This is almost always the formula you should use in practice.

Proof of the shortcut: Starting from the definition and expanding:

Var(X) = E[(X − μ)²] = E[X² − 2μX + μ²]

= E[X²] − 2μ E[X] + μ² (linearity)

= E[X²] − 2μ² + μ² = E[X²] − μ² = E[X²] − (E[X])²

Worked example — single die: X is a fair die roll. We already computed E[X] = 3.5 and E[X²] = 91/6 ≈ 15.167.

Var(X) = E[X²] − (E[X])² = ⁹¹⁄₆ − 3.5² = ⁹¹⁄₆ − ⁴⁹⁄₄ = ¹⁸²⁄₁₂ − ¹⁴⁷⁄₁₂ = ³⁵⁄₁₂ ≈ 2.917

Worked example — sum of two dice: X is the sum of two dice. E[X] = 7, E[X²] = 1974/36 = 54.833...

Var(X) = E[X²] − (E[X])² = ¹⁹⁷⁴⁄₃₆ − 49 = ¹⁹⁷⁴⁄₃₆ − ¹⁷⁶⁴⁄₃₆ = ²¹⁰⁄₃₆ = ³⁵⁄₆ ≈ 5.833

Variance is always non-negative. Since Var(X) = E[(X − μ)²], and a squared quantity is always ≥ 0, the expectation of a non-negative thing is non-negative. Var(X) = 0 if and only if X is a constant (never deviates from its mean).

Check: What is the shortcut formula for variance?

Var(X) = E[X]² − E[X²] Var(X) = E[X²] − (E[X])² Var(X) = (E[X])²

Chapter 5: Properties of Variance

Variance behaves differently from expectation. Unlike expectation, variance is NOT linear. Here are the key properties.

Property 1 — Adding a constant does not change variance:

Var(X + b) = Var(X)

Shifting a distribution left or right moves the center but does not change the spread. If everyone gets 10 extra points on an exam, the scores are all higher, but the spread is identical.

Property 2 — Scaling multiplies variance by the square:

Var(aX) = a² · Var(X)

If you double every value, distances from the mean also double, and squared distances quadruple.

Combined:

Var(aX + b) = a² · Var(X)

Critical difference from expectation: Constants add into expectation: E[aX + b] = aE[X] + b. But the additive constant b vanishes in variance, and the multiplicative constant a gets squared. Variance is NOT linear.

Worked example: Suppose X has E[X] = 10 and Var(X) = 4. Let Y = 3X + 5. Then:

E[Y] = 3 · 10 + 5 = 35

Var(Y) = 3² · Var(X) = 9 · 4 = 36

Property 3 — Variance of a sum (independent case):

If X and Y are independent: Var(X + Y) = Var(X) + Var(Y)

Unlike expectation of sums (which holds always), variance of sums requires independence. If X and Y are dependent, there is a correction term involving the covariance (covered in later chapters).

Worked example — two dice variance: Let D₁ and D₂ be independent fair dice. We know Var(D₁) = 35/12.

Var(D₁ + D₂) = Var(D₁) + Var(D₂) = ³⁵⁄₁₂ + ³⁵⁄₁₂ = ⁷⁰⁄₁₂ = ³⁵⁄₆ ≈ 5.833

This matches our direct computation from Chapter 4.

Property	Expectation	Variance
Add constant b	E[X+b] = E[X] + b	Var(X+b) = Var(X)
Scale by a	E[aX] = aE[X]	Var(aX) = a²Var(X)
Sum (always)	E[X+Y] = E[X]+E[Y]	Not always decomposable
Sum (independent)	E[X+Y] = E[X]+E[Y]	Var(X+Y) = Var(X)+Var(Y)

Check: X has Var(X) = 9. What is Var(2X + 100)?

118 36 (since Var(2X + 100) = 2² · 9 = 36 — the +100 vanishes) 18

Chapter 6: Standard Deviation

Variance has a units problem. If X is measured in points, Var(X) is measured in points². If X is dollars, Var(X) is dollars². Squared units are hard to interpret intuitively. How do you make sense of "the spread is 5.83 points-squared"?

The solution is the standard deviation:

Std(X) = σ = √Var(X)

Standard deviation has the same units as X. It is the "average distance" of a sample from the mean (using Euclidean distance).

Worked example — single die:

Var(X) = ³⁵⁄₁₂ ≈ 2.917 points²

Std(X) = √(³⁵⁄₁₂) ≈ 1.708 points

So a typical die roll deviates about 1.7 from the mean of 3.5. That matches intuition — most rolls are 1 or 2 away from 3.5.

Worked example — sum of two dice:

Var(X) = ³⁵⁄₆ ≈ 5.833 points²

Std(X) = √(³⁵⁄₆) ≈ 2.415 points

The sum of two dice typically deviates about 2.4 from the mean of 7. So most sums fall between roughly 5 and 9, which lines up with the PMF we saw earlier.

Rules of thumb: For many distributions, roughly 68% of the probability mass falls within one standard deviation of the mean (between μ − σ and μ + σ), and about 95% falls within two standard deviations. This is exact for the Normal distribution and a reasonable approximation for many others.

Scaling property: Standard deviation scales linearly (unlike variance which squares):

Std(aX + b) = |a| · Std(X)

This is because Std(aX + b) = √(a² Var(X)) = |a| √Var(X) = |a| Std(X).

Worked example: Suppose X has Std(X) = 3 points. You triple every value and add 10: Y = 3X + 10. Then Std(Y) = 3 · 3 = 9 points. The +10 does not affect spread, and the factor of 3 scales the standard deviation by 3 (not 9 — that is what happens to variance).

When to use which: Variance is better for mathematical manipulation (its properties are cleaner for proofs). Standard deviation is better for interpretation (it has the right units). In practice, you compute with variance and report in standard deviation.

Check: Var(X) = 25. What is Std(X)?

5 (since √25 = 5) 625 12.5

Chapter 7: Showcase — Dice Lab

Time to put it all together. The simulation below lets you choose how many dice to roll and computes the expectation, variance, and standard deviation of the sum — both theoretically (from the formulas) and empirically (from actual simulated rolls). Watch the empirical values converge to the theoretical ones as you increase the number of trials.

Dice Expectation & Variance Calculator

Choose the number of dice. Click Roll 1 to add a single trial, or Roll 100 to add many at once. The histogram shows the distribution of sums across all your trials. Theoretical values are computed from the formulas; empirical values from your samples.

0 trials

Dice count2

PMF Spread Comparison

This canvas shows the PMF of the sum of N dice for N = 1, 2, and 5 side by side. Notice how the distribution becomes more concentrated (bell-shaped) as N grows, even though the range of values expands. The standard deviation grows as √N, but relative to the mean it shrinks.

Convergence of Sample Mean

Watch the running average of dice sums converge to E[X]. The shaded band shows ±1 standard deviation of the sample mean. Click Run to start, Pause to stop.

0 rolls

What to notice: The sample mean converges to E[X] as the number of trials grows. This is the Law of Large Numbers in action. The rate of convergence depends on the variance — higher variance means slower convergence and wider bands around the theoretical mean.

Check: You roll 5 dice. Using linearity, what is the expected sum?

17.5 (since 5 × 3.5 = 17.5) 15 21

Chapter 8: Worked Problems

Problem 1: Weighted coin. A coin has P(Heads) = 0.3. You flip it once. Heads pays $10, tails pays $2. What is the expected payout?

E[Payout] = 10 · 0.3 + 2 · 0.7 = 3.0 + 1.4 = $4.40

Problem 2: Variance of weighted coin. Same coin. What is Var(Payout)?

E[Payout²] = 10² · 0.3 + 2² · 0.7 = 100 · 0.3 + 4 · 0.7 = 30 + 2.8 = 32.8

Var(Payout) = E[Payout²] − (E[Payout])² = 32.8 − (4.4)² = 32.8 − 19.36 = 13.44

Std(Payout) = √13.44 ≈ $3.67

Problem 3: Linearity with indicator variables. In a class of 30 students, each independently has a 0.2 probability of getting an A. What is the expected number of A's?

Let X_i = 1 if student i gets an A, 0 otherwise. E[X_i] = 0.2. Total A's = X₁ + X₂ + … + X₃₀.

E[Total] = 30 · 0.2 = 6 students

Problem 4: Variance of a sum of independent variables. Same setup. Since students are independent, Var(X_i) = E[X_i²] − (E[X_i])² = 0.2 − 0.04 = 0.16.

Var(Total) = 30 · 0.16 = 4.8

Std(Total) = √4.8 ≈ 2.19 students

Problem 5: Scaling. A casino game costs $5 to play. You roll a fair die and win $X (1 through 6). Your profit is Y = X − 5. What is E[Y] and Var(Y)?

E[Y] = E[X − 5] = E[X] − 5 = 3.5 − 5 = −$1.50

Var(Y) = Var(X − 5) = Var(X) = ³⁵⁄₁₂ ≈ $²2.92

The house takes $1.50 per game on average. The variance is unchanged because subtracting a constant does not affect spread.

Problem 6: LOTUS in action. X is uniform on {1, 2, 3, 4}. Compute E[2^X].

E[2^X] = 2¹ · ¹⁄₄ + 2² · ¹⁄₄ + 2³ · ¹⁄₄ + 2⁴ · ¹⁄₄ = ^{2 + 4 + 8 + 16}⁄₄ = ³⁰⁄₄ = 7.5

Note: E[X] = 2.5, and 2^2.5 ≈ 5.66 ≠ 7.5. This confirms E[g(X)] ≠ g(E[X]) for nonlinear g.

Problem-solving checklist:
1. Identify what random variable you want the expectation/variance of.
2. Can you decompose it as a sum? Use linearity.
3. Is it a function of a simpler RV? Use LOTUS.
4. For variance, compute E[X²] first (via LOTUS), then use the shortcut Var(X) = E[X²] − (E[X])².
5. For sums: check independence before adding variances.

Check: X is uniform on {0, 1, 2}. What is Var(X)?

1 2/3 (E[X]=1, E[X²]=5/3, Var=5/3−1=2/3) 1/3

Chapter 9: Connections

Expectation and variance are the foundation upon which an enormous amount of probability and statistics is built. Here is how this chapter connects to what comes before and what comes next.

Builds on:

• Ch 5 — Random Variables: We defined PMFs; expectation and variance summarize them.
• Ch 3 — Conditional Probability: Independence matters for Var(X+Y).
• Ch 1 — Counting: Many PMFs (and hence expectations) come from counting arguments.

Leads to:

• Ch 7 — Named Distributions: Bernoulli, Binomial, Geometric — each has formulas for E[X] and Var(X) derived using linearity and LOTUS.
• Ch 10 — Continuous RVs: Sums become integrals, but the definitions are the same.
• Ch 14 — Central Limit Theorem: The CLT uses μ and σ² to describe how sums of random variables converge to Normal.

Concept from this chapter	Where it appears next
E[X] definition	Every named distribution (Ch 7-9) derives its mean
Linearity of expectation	Binomial mean = np (Ch 7), coupon collector (Ch 8)
LOTUS	Moment generating functions (Ch 12), variance computations everywhere
Var(X) = E[X²] − (E[X])²	Bernoulli variance = p(1−p), Normal parameterization (Ch 10)
Variance of independent sums	Central Limit Theorem (Ch 14), confidence intervals
Standard deviation	z-scores, hypothesis testing, error bars

The big picture: Expectation gives you the first number you need to know about a random variable. Variance gives you the second. With just these two numbers, you can state the Central Limit Theorem, construct confidence intervals, analyze algorithms, and understand the bias-variance tradeoff in machine learning. Master these, and the rest of probability is variations on a theme.

"The expected value is the single most useful summary
of a probability distribution."
— Chris Piech, Probability for Computer Scientists

Check: Which property of expectation lets you compute E[X₁ + X₂ + … + X_n] = n · E[X₁] when all X_i have the same distribution?

LOTUS Linearity of expectation The shortcut formula for variance