Ch 2: Definition of Probability — Piech, Probability for CS

Chapter 0: Why Probability?

Someone tells you "there's a 30% chance of rain tomorrow." What does that actually mean? If tomorrow either rains or it doesn't — there's no 30% of a rainstorm — then what is this number measuring?

It turns out that humans didn't have a rigorous answer to this question until the 20th century. The word "probability" was used for centuries before anyone pinned down a formal definition. That definition, and the rules that follow from it, are the foundation of everything in this course.

This chapter covers three big ideas. First, the empirical definition of probability: run an experiment many times and count. Second, equally likely outcomes: a powerful shortcut when every outcome has the same chance. Third, the probability of "or": how to combine events, with the inclusion-exclusion principle as the workhorse formula.

The core idea: Probability is a language for quantifying uncertainty. It's not that the world is inherently random — it's that we don't have complete information. Probability gives us a rigorous way to reason about what we believe will happen, given what we know.

Experiment

A repeatable process with uncertain outcomes

↓ run many trials

Frequency

count(Event) / n converges as n → ∞

↓ in the limit

Probability

A number in [0, 1] that satisfies Kolmogorov's axioms

Think of probability as a measuring tool. Just as a ruler measures length and a scale measures weight, probability measures how likely something is to happen. The measurement is always a number between 0 (impossible) and 1 (certain). Everything else in probability theory — conditional probability, Bayes' theorem, distributions — is built on top of this definition and three simple axioms.

Term	Meaning
Experiment	A repeatable process (flip a coin, roll a die, sample a user)
Sample Space S	The set of all possible outcomes
Event E	A subset of S that we care about
P(E)	A number in [0, 1] measuring the likelihood of E

Check: What does a probability of 0.30 fundamentally represent?

30% of the event will happen In the long run, ~30% of trials produce the event The event is 30% random

Chapter 1: The Empirical Definition

Here is the formal definition that took centuries to nail down. Suppose you perform n trials of an experiment. Let count(E) be the number of trials where event E occurs. Then:

P(E) = lim_n→∞ count(E) / n

In plain English: the probability of E is the fraction of trials that produce E, in the limit as you run infinitely many trials. With 10 coin flips you might get 7 heads (70%). With 1,000 flips you'll be closer to 50%. With a million flips, you'll be very close. The ratio converges to the true probability.

Key insight: This is sometimes called the "frequentist" definition. It ties probability to something concrete and observable — the long-run frequency of an event. You don't need to know the underlying mechanism. You just need to be able to repeat the experiment.

Let's see this convergence in action. The simulation below rolls a fair six-sided die repeatedly. The event E is "rolling a 5 or 6." The true probability is 2/6 ≈ 0.333. Watch how the empirical ratio oscillates wildly at first, then settles down as n grows.

Convergence of Empirical Probability

Event E: rolling a 5 or 6 on a fair die. True P(E) = 2/6 ≈ 0.333. Click Run to start rolling, or Step for one roll at a time.

n = 0, count(E) = 0

This definition also works for computing probabilities from data. Here's a worked example from the textbook:

Worked example — Elephant births: What is the probability a newborn elephant in Myanmar is male? Data: 3,070 births, 2,180 male. By the empirical definition: P(Male) ≈ 2180/3070 ≈ 0.710. Since 3,070 is much less than infinity, this is an approximation — but a good one. The sample space is {Male, Female, Intersex}. The event is {Male}. The outcomes are not equally likely.

Another way to compute probabilities is via simulation. For complex problems where analytical calculation is too hard, you can run millions of trials on a computer. If your simulations faithfully generate outcomes from the sample space, the fraction of trials producing E converges to P(E). We'll use this technique throughout the course.

Check: You flip a biased coin 10,000 times and get 7,200 heads. What is your best estimate of P(Heads)?

0.50 0.72 0.28

Chapter 2: Sample Spaces & the Axioms

Before computing any probability, you need to precisely define two things: the sample space S (the set of all possible outcomes) and the event E (the subset you care about). Getting these right is half the battle.

Experiment	Sample Space S	Example Event E
Flip a coin	{H, T}	E = {H} (heads)
Flip two coins	{(H,H), (H,T), (T,H), (T,T)}	E = {(H,H), (H,T), (T,H)} (at least one head)
Roll a die	{1, 2, 3, 4, 5, 6}	E = {1, 2, 3} (3 or less)
Emails per day	{x \| x ∈ Z, x ≥ 0}	E = {x \| 0 ≤ x < 20}
YouTube hours	{x \| x ∈ R, 0 ≤ x ≤ 24}	E = {x \| 5 ≤ x ≤ 24} ("wasted day")

Notice that sample spaces can be discrete (finite or countably infinite, like dice or emails) or continuous (uncountably infinite, like YouTube hours). The definition of probability handles both.

Now for the rules. In the early 1900s, Andrey Kolmogorov showed that all of probability theory can be built from just three axioms:

Axiom 1: 0 ≤ P(E) ≤ 1
Axiom 2: P(S) = 1
Axiom 3: If E ∩ F = ∅, then P(E ∪ F) = P(E) + P(F)

Axiom 1 says probabilities live between 0 and 1. This follows naturally from the empirical definition — you can't have more events than trials, and you can't have negative events.

Axiom 2 says the probability of something happening is 1. If your sample space covers every possible outcome, then every trial must produce some outcome in S.

Axiom 3 says if two events share no outcomes (they are mutually exclusive), then the probability of either one happening is just the sum. This is the addition rule for disjoint events — and it's the foundation for everything in Chapters 5 and 6.

Historical context: Kolmogorov's axioms (1933) were a breakthrough. Before them, mathematicians debated whether probability was "real math" or just applied intuition. Kolmogorov proved that all the tools of rigorous mathematics — proofs, theorems, set theory — apply to probability. Everything we derive in this course traces back to these three statements.

Check: A bag has 3 red and 7 blue marbles. You draw one at random. What is P(S), where S is the sample space?

1 10 0.3

Chapter 3: Equally Likely Outcomes

Many experiments have a beautiful property: every outcome in the sample space is equally likely. Fair coins, fair dice, well-shuffled decks, random selections — these all produce equally likely outcomes. And when outcomes are equally likely, computing probabilities becomes pure counting.

If all outcomes are equally likely: P(E) = |E| / |S|

Where |E| is the number of outcomes in event E, and |S| is the total number of outcomes in the sample space. This is just counting the favorable outcomes and dividing by the total outcomes.

Key insight: This formula is why counting matters so much in probability. All those permutations and combinations from Chapter 1 — they're tools for computing |E| and |S|. The equally likely assumption converts probability questions into counting questions.

But there's art in setting this up correctly. You must: (1) define S so that all outcomes are equally likely, (2) count |S|, and (3) count |E| using the same definition of outcomes. Getting step 1 wrong is the most common mistake.

Worked example — Sum of two dice equals 7:
Buggy approach: Define S = {2, 3, ..., 12} (all possible sums). But these are NOT equally likely — sum = 7 is far more likely than sum = 2. This sample space fails step 1.

Correct approach: Treat the dice as distinct. S = {(1,1), (1,2), ..., (6,6)} has |S| = 36 equally likely outcomes. Event E (sum = 7) = {(1,6), (2,5), (3,4), (4,3), (5,2), (6,1)}, so |E| = 6.

P(sum = 7) = 6/36 = 1/6 ≈ 0.167

The simulation below lets you explore this. It shows all 36 outcomes in a grid, highlighting those that sum to your chosen target. You can verify that 7 is the most probable sum — it has 6 outcomes, more than any other target.

Dice Sum — Equally Likely Outcomes

All 36 outcomes on two dice. Adjust the target sum — highlighted cells are the event. P(E) = |E|/36.

Target sum: 7 P(E) = 6/36 = 0.167

This idea extends to continuous sample spaces too. Consider a random number generator that produces a real number uniformly between 0 and 1. The probability of the number landing in [0.3, 0.7] is the ratio of the interval length to the total length: 0.4 / 1 = 0.4. All "locations" are equally likely, so probability reduces to measuring lengths (or areas, or volumes).

Check: You roll two fair dice. What is the probability the sum is 2?

1/36 — only (1,1) gives sum 2 1/11 — there are 11 possible sums 2/36 — two dice can each be 1

Chapter 4: The Complement Rule

From the three axioms, we can immediately prove some useful identities. The most important is the complement rule: the probability of an event NOT happening is 1 minus the probability of it happening.

P(E^C) = 1 − P(E)

The proof is short and elegant. The event E and its complement E^C together cover the entire sample space: E ∪ E^C = S. They share no outcomes: E ∩ E^C = ∅. So by Axiom 3:

P(S) = P(E) + P(E^C)
1 = P(E) + P(E^C) [by Axiom 2]
P(E^C) = 1 − P(E) □

This identity is more than a curiosity — it's a problem-solving strategy. Whenever computing P(E) directly is hard, try computing P(E^C) instead. If the complement is simpler, you win.

Worked example — At least one head: Flip 3 fair coins. What is P(at least one head)?

Direct approach: E = {HHH, HHT, HTH, HTT, THH, THT, TTH}. Count: |E| = 7 out of |S| = 8. P(E) = 7/8.

Complement approach: E^C = "no heads at all" = {TTT}. P(E^C) = 1/8. So P(E) = 1 − 1/8 = 7/8 = 0.875.

The complement was simpler: just one outcome to count instead of seven.

There's a second provable identity from the axioms: if E ⊆ F, then P(E) ≤ P(F). This makes intuitive sense — if every outcome in E is also in F, then F has at least as many favorable outcomes. We won't prove this one formally, but it follows from Axiom 3 by writing F = E ∪ (F − E).

When to use complements: The complement trick is especially powerful for "at least one" problems. "P(at least one success in n trials)" is hard to compute directly because you must consider exactly 1, exactly 2, ..., all the way up to n successes. But the complement — "zero successes" — is a single outcome. You'll see this pattern over and over.

Check: The probability of a system failure is 0.03. What is the probability the system does NOT fail?

0.97 0.03 0.50

Chapter 5: Probability of Or (Mutually Exclusive)

How do you compute the probability of event E or event F happening? Written P(E ∪ F), this is one of the most common probability calculations. The answer depends on whether the events are mutually exclusive.

Two events are mutually exclusive (or "disjoint") if they share no outcomes: E ∩ F = ∅. In plain English, they can't both happen. Drawing a heart and drawing a spade from a single card draw are mutually exclusive. Rolling a 1 and rolling a 6 on one die are mutually exclusive.

If E ∩ F = ∅: P(E ∪ F) = P(E) + P(F)

This is just Axiom 3 — the very foundation. When events don't overlap, you simply add their probabilities. No correction needed.

Worked example — Card suit: Draw one card from a standard 52-card deck. What is P(Heart or Spade)?

E = Heart (13 cards), F = Spade (13 cards). No card is both a heart and a spade, so E and F are mutually exclusive.

P(Heart or Spade) = P(Heart) + P(Spade) = 13/52 + 13/52 = 26/52 = 1/2 = 0.5

This extends to any number of mutually exclusive events. If E₁, E₂, ..., E_n are all pairwise mutually exclusive (no outcome appears in more than one event):

P(E₁ ∪ E₂ ∪ ... ∪ E_n) = P(E₁) + P(E₂) + ... + P(E_n) = ∑_i=1ⁿ P(E_i)

Caution: Mutual exclusion only simplifies P(E or F). It tells you nothing about P(E and F) — in fact, for mutually exclusive events, P(E and F) = 0 by definition. Don't confuse "or" with "and."

But what happens when events are NOT mutually exclusive? That's where things get interesting — and where most students make their first mistake. Simply adding probabilities double-counts the overlap. The next chapter shows how to fix this.

Check: You roll a fair die. E = {1,2}, F = {5,6}. What is P(E or F)?

1/3 1/6 2/3 — add 2/6 + 2/6 since they're mutually exclusive

Chapter 6: Inclusion-Exclusion

What if events E and F are NOT mutually exclusive? Simply adding P(E) + P(F) double-counts every outcome that belongs to both events. The fix is elegant: subtract the overlap.

P(E ∪ F) = P(E) + P(F) − P(E ∩ F)

This is the inclusion-exclusion principle. You "include" E and F, then "exclude" their intersection to correct the double-counting. If E and F are mutually exclusive, the intersection is empty and P(E ∩ F) = 0, so the formula reduces to simple addition.

Worked example — Dice (buggy vs correct):
E = even number on a die = {2, 4, 6}, so P(E) = 3/6 = 0.5.
F = three or less = {1, 2, 3}, so P(F) = 3/6 = 0.5.

Buggy: P(E or F) = 0.5 + 0.5 = 1.0. But 5 is neither even nor ≤ 3, so probability can't be 1!

Correct: E ∩ F = {2} (in both events). P(E ∩ F) = 1/6.
P(E or F) = 0.5 + 0.5 − 1/6 = 5/6 ≈ 0.833.
Check: E ∪ F = {1, 2, 3, 4, 6} = 5 outcomes. 5/6. Correct.

For three events, the pattern extends. You add the singles, subtract the pairs, then add back the triple intersection:

P(E₁ ∪ E₂ ∪ E₃) = P(E₁) + P(E₂) + P(E₃)
− P(E₁ ∩ E₂) − P(E₁ ∩ E₃) − P(E₂ ∩ E₃)
+ P(E₁ ∩ E₂ ∩ E₃)

The general pattern for n events: add all singles, subtract all pairs, add all triples, subtract all quadruples, and so on, alternating signs. This grows combinatorially — with n events you have 2ⁿ − 1 terms. For large n, it's usually better to find a cleverer approach.

Key insight: Inclusion-exclusion for non-mutually-exclusive events requires knowing P(E and F). We haven't yet learned how to compute that in general — that's the topic of Chapter 3 (conditional probability). For now, in problems with equally likely outcomes, you can count: P(E ∩ F) = |E ∩ F| / |S|.

Below is a visual simulator. Two events E and F are represented as circles in a sample space of hexagons. Drag the slider to control how much they overlap. Watch how the inclusion-exclusion correction changes.

Inclusion-Exclusion Visualizer

Adjust overlap between events E and F. When overlap = 0, they are mutually exclusive. Observe how P(E or F) changes.

Overlap: P(E or F) = ...

Check: P(A) = 0.4, P(B) = 0.5, P(A ∩ B) = 0.2. What is P(A ∪ B)?

0.9 0.7 0.5

Chapter 7: Showcase — Probability Simulator

Time to bring everything together. The simulator below lets you define a sample space, choose events, and compute probabilities using every method from this chapter. Roll dice, flip coins, or draw cards — and watch the empirical probability converge to the theoretical value.

Choose an experiment and event, then run trials. The top panel shows the sample space with event outcomes highlighted. The bottom panel tracks the empirical probability converging to the true value over thousands of trials.

Interactive Probability Simulator

Pick an experiment and event. Run trials and watch empirical probability converge. Try all three experiments to build intuition.

n = 0 True P(E) = 1/6

What to observe: No matter which experiment you pick, the empirical ratio always converges to the true probability. With 10 trials the estimate is noisy. With 100 it's close. With 1,000+ it's rock-solid. This is the law of large numbers in action — the formal statement that the empirical definition of probability actually works.

For the "Card: Heart or Face" experiment, note that this uses inclusion-exclusion. E = Heart (13 cards), F = Face card (12 cards). But 3 cards are both hearts AND face cards (J♥, Q♥, K♥). So P(E ∪ F) = 13/52 + 12/52 − 3/52 = 22/52 ≈ 0.423. The simulator verifies this.

Check: In the card experiment, why can't we just add P(Heart) + P(Face card)?

Hearts and face cards are independent They are NOT mutually exclusive — J/Q/K of hearts are in both events The sample space is too large

Chapter 8: Worked Problems

Let's solidify the concepts with three fully worked problems, each using a different technique from this chapter.

Problem 1: Equally Likely + Complement

You roll two fair dice. What is the probability that the sum is NOT 7?

Solution: From Chapter 3, P(sum = 7) = 6/36 = 1/6.
By the complement rule: P(sum ≠ 7) = 1 − P(sum = 7) = 1 − 1/6 = 5/6 ≈ 0.833.
We avoided counting 30 outcomes by counting 6 and subtracting.

Problem 2: Mutually Exclusive Or

A bag has 5 red, 3 blue, and 2 green marbles. You draw one. What is P(Red or Green)?

Solution: S has 10 equally likely outcomes (each marble is distinct). E₁ = Red (5 outcomes), E₂ = Green (2 outcomes). A marble can't be both red and green, so the events are mutually exclusive.
P(Red or Green) = P(Red) + P(Green) = 5/10 + 2/10 = 7/10 = 0.70.

Problem 3: Inclusion-Exclusion

In a class of 30 students, 18 study math, 15 study CS, and 10 study both. A student is chosen at random. What is P(Math or CS)?

Solution: S has 30 equally likely outcomes. E = Math (18), F = CS (15), E ∩ F = both (10).
P(Math or CS) = P(Math) + P(CS) − P(Math and CS) = 18/30 + 15/30 − 10/30 = 23/30 ≈ 0.767.
Without the correction we'd get 33/30 > 1 — impossible!
Check: 18 + 15 − 10 = 23 students study at least one subject. 23/30. Correct.

Problem 4: Empirical + Data

A website receives 50,000 visits in a month. Of those, 12,000 result in a purchase. What is the estimated probability a visit results in a purchase?

Solution: By the empirical definition, P(Purchase) ≈ count(Purchase) / n = 12,000 / 50,000 = 0.24.
This is an approximation. With more data (more months, more visits), the estimate would become more precise.

Check: P(A) = 0.6, P(B) = 0.5, A and B are mutually exclusive. What is P(A or B)?

0.8 Undefined — not enough info Impossible — mutually exclusive events with P(A)+P(B) > 1 can't exist

Chapter 9: Connections

This chapter established the language and rules we'll use for the rest of the course. Here's how it connects to what's ahead:

Concept from Ch 2	Where it goes
Sample spaces & events	Every chapter — this is the vocabulary of probability
Kolmogorov axioms	Every theorem we prove traces back to these three rules
Equally likely outcomes	Ch 1 counting + Ch 2 = toolkit for discrete uniform problems
Complement rule	Essential trick in Ch 3 (conditional) and Ch 5 (independence)
Mutually exclusive or	Foundation for Ch 7 (total probability) and partitions
Inclusion-exclusion	Returns in Ch 3 when computing conditional probabilities
Empirical definition	Basis for Ch 11 (estimation) and simulation methods

The big picture: Chapter 1 gave you counting tools. This chapter showed you what to count and why. With equally likely outcomes, probability = counting. But many real problems have unequal likelihoods, and that's where conditional probability (Chapter 3) becomes essential. The natural next question is: what happens when the probability of one event depends on whether another event occurred?

Ch 1: Counting

How to compute |E| and |S|

↓

Ch 2: Probability (this chapter)

P(E) = |E|/|S|, axioms, or-rules

↓

Ch 3: Conditional Probability

P(E|F) — what if events are linked?

You now have three tools for computing probabilities: (1) the empirical definition (run trials, count), (2) equally likely outcomes (count favorable / total), and (3) the axioms and their corollaries (complement, or-rules, inclusion-exclusion). Chapter 3 adds the fourth and most powerful tool: conditioning.

Check: Which technique is most useful for "what is P(at least one X in n trials)?"

Complement rule — compute P(zero X's) and subtract from 1 Inclusion-exclusion on all n trials Direct counting of all outcomes with at least one X