What it means when we say "the probability of rain is 52%" — and how to compute the probability of combined events.
Someone tells you "there's a 30% chance of rain tomorrow." What does that actually mean? If tomorrow either rains or it doesn't — there's no 30% of a rainstorm — then what is this number measuring?
It turns out that humans didn't have a rigorous answer to this question until the 20th century. The word "probability" was used for centuries before anyone pinned down a formal definition. That definition, and the rules that follow from it, are the foundation of everything in this course.
This chapter covers three big ideas. First, the empirical definition of probability: run an experiment many times and count. Second, equally likely outcomes: a powerful shortcut when every outcome has the same chance. Third, the probability of "or": how to combine events, with the inclusion-exclusion principle as the workhorse formula.
Think of probability as a measuring tool. Just as a ruler measures length and a scale measures weight, probability measures how likely something is to happen. The measurement is always a number between 0 (impossible) and 1 (certain). Everything else in probability theory — conditional probability, Bayes' theorem, distributions — is built on top of this definition and three simple axioms.
| Term | Meaning |
|---|---|
| Experiment | A repeatable process (flip a coin, roll a die, sample a user) |
| Sample Space S | The set of all possible outcomes |
| Event E | A subset of S that we care about |
| P(E) | A number in [0, 1] measuring the likelihood of E |
Here is the formal definition that took centuries to nail down. Suppose you perform n trials of an experiment. Let count(E) be the number of trials where event E occurs. Then:
In plain English: the probability of E is the fraction of trials that produce E, in the limit as you run infinitely many trials. With 10 coin flips you might get 7 heads (70%). With 1,000 flips you'll be closer to 50%. With a million flips, you'll be very close. The ratio converges to the true probability.
Let's see this convergence in action. The simulation below rolls a fair six-sided die repeatedly. The event E is "rolling a 5 or 6." The true probability is 2/6 ≈ 0.333. Watch how the empirical ratio oscillates wildly at first, then settles down as n grows.
Event E: rolling a 5 or 6 on a fair die. True P(E) = 2/6 ≈ 0.333. Click Run to start rolling, or Step for one roll at a time.
This definition also works for computing probabilities from data. Here's a worked example from the textbook:
Another way to compute probabilities is via simulation. For complex problems where analytical calculation is too hard, you can run millions of trials on a computer. If your simulations faithfully generate outcomes from the sample space, the fraction of trials producing E converges to P(E). We'll use this technique throughout the course.
Before computing any probability, you need to precisely define two things: the sample space S (the set of all possible outcomes) and the event E (the subset you care about). Getting these right is half the battle.
| Experiment | Sample Space S | Example Event E |
|---|---|---|
| Flip a coin | {H, T} | E = {H} (heads) |
| Flip two coins | {(H,H), (H,T), (T,H), (T,T)} | E = {(H,H), (H,T), (T,H)} (at least one head) |
| Roll a die | {1, 2, 3, 4, 5, 6} | E = {1, 2, 3} (3 or less) |
| Emails per day | {x | x ∈ Z, x ≥ 0} | E = {x | 0 ≤ x < 20} |
| YouTube hours | {x | x ∈ R, 0 ≤ x ≤ 24} | E = {x | 5 ≤ x ≤ 24} ("wasted day") |
Notice that sample spaces can be discrete (finite or countably infinite, like dice or emails) or continuous (uncountably infinite, like YouTube hours). The definition of probability handles both.
Now for the rules. In the early 1900s, Andrey Kolmogorov showed that all of probability theory can be built from just three axioms:
Axiom 1 says probabilities live between 0 and 1. This follows naturally from the empirical definition — you can't have more events than trials, and you can't have negative events.
Axiom 2 says the probability of something happening is 1. If your sample space covers every possible outcome, then every trial must produce some outcome in S.
Axiom 3 says if two events share no outcomes (they are mutually exclusive), then the probability of either one happening is just the sum. This is the addition rule for disjoint events — and it's the foundation for everything in Chapters 5 and 6.
Many experiments have a beautiful property: every outcome in the sample space is equally likely. Fair coins, fair dice, well-shuffled decks, random selections — these all produce equally likely outcomes. And when outcomes are equally likely, computing probabilities becomes pure counting.
Where |E| is the number of outcomes in event E, and |S| is the total number of outcomes in the sample space. This is just counting the favorable outcomes and dividing by the total outcomes.
But there's art in setting this up correctly. You must: (1) define S so that all outcomes are equally likely, (2) count |S|, and (3) count |E| using the same definition of outcomes. Getting step 1 wrong is the most common mistake.
The simulation below lets you explore this. It shows all 36 outcomes in a grid, highlighting those that sum to your chosen target. You can verify that 7 is the most probable sum — it has 6 outcomes, more than any other target.
All 36 outcomes on two dice. Adjust the target sum — highlighted cells are the event. P(E) = |E|/36.
This idea extends to continuous sample spaces too. Consider a random number generator that produces a real number uniformly between 0 and 1. The probability of the number landing in [0.3, 0.7] is the ratio of the interval length to the total length: 0.4 / 1 = 0.4. All "locations" are equally likely, so probability reduces to measuring lengths (or areas, or volumes).
From the three axioms, we can immediately prove some useful identities. The most important is the complement rule: the probability of an event NOT happening is 1 minus the probability of it happening.
The proof is short and elegant. The event E and its complement EC together cover the entire sample space: E ∪ EC = S. They share no outcomes: E ∩ EC = ∅. So by Axiom 3:
This identity is more than a curiosity — it's a problem-solving strategy. Whenever computing P(E) directly is hard, try computing P(EC) instead. If the complement is simpler, you win.
There's a second provable identity from the axioms: if E ⊆ F, then P(E) ≤ P(F). This makes intuitive sense — if every outcome in E is also in F, then F has at least as many favorable outcomes. We won't prove this one formally, but it follows from Axiom 3 by writing F = E ∪ (F − E).
How do you compute the probability of event E or event F happening? Written P(E ∪ F), this is one of the most common probability calculations. The answer depends on whether the events are mutually exclusive.
Two events are mutually exclusive (or "disjoint") if they share no outcomes: E ∩ F = ∅. In plain English, they can't both happen. Drawing a heart and drawing a spade from a single card draw are mutually exclusive. Rolling a 1 and rolling a 6 on one die are mutually exclusive.
This is just Axiom 3 — the very foundation. When events don't overlap, you simply add their probabilities. No correction needed.
This extends to any number of mutually exclusive events. If E1, E2, ..., En are all pairwise mutually exclusive (no outcome appears in more than one event):
But what happens when events are NOT mutually exclusive? That's where things get interesting — and where most students make their first mistake. Simply adding probabilities double-counts the overlap. The next chapter shows how to fix this.
What if events E and F are NOT mutually exclusive? Simply adding P(E) + P(F) double-counts every outcome that belongs to both events. The fix is elegant: subtract the overlap.
This is the inclusion-exclusion principle. You "include" E and F, then "exclude" their intersection to correct the double-counting. If E and F are mutually exclusive, the intersection is empty and P(E ∩ F) = 0, so the formula reduces to simple addition.
For three events, the pattern extends. You add the singles, subtract the pairs, then add back the triple intersection:
The general pattern for n events: add all singles, subtract all pairs, add all triples, subtract all quadruples, and so on, alternating signs. This grows combinatorially — with n events you have 2n − 1 terms. For large n, it's usually better to find a cleverer approach.
Below is a visual simulator. Two events E and F are represented as circles in a sample space of hexagons. Drag the slider to control how much they overlap. Watch how the inclusion-exclusion correction changes.
Adjust overlap between events E and F. When overlap = 0, they are mutually exclusive. Observe how P(E or F) changes.
Time to bring everything together. The simulator below lets you define a sample space, choose events, and compute probabilities using every method from this chapter. Roll dice, flip coins, or draw cards — and watch the empirical probability converge to the theoretical value.
Choose an experiment and event, then run trials. The top panel shows the sample space with event outcomes highlighted. The bottom panel tracks the empirical probability converging to the true value over thousands of trials.
Pick an experiment and event. Run trials and watch empirical probability converge. Try all three experiments to build intuition.
For the "Card: Heart or Face" experiment, note that this uses inclusion-exclusion. E = Heart (13 cards), F = Face card (12 cards). But 3 cards are both hearts AND face cards (J♥, Q♥, K♥). So P(E ∪ F) = 13/52 + 12/52 − 3/52 = 22/52 ≈ 0.423. The simulator verifies this.
Let's solidify the concepts with three fully worked problems, each using a different technique from this chapter.
You roll two fair dice. What is the probability that the sum is NOT 7?
A bag has 5 red, 3 blue, and 2 green marbles. You draw one. What is P(Red or Green)?
In a class of 30 students, 18 study math, 15 study CS, and 10 study both. A student is chosen at random. What is P(Math or CS)?
A website receives 50,000 visits in a month. Of those, 12,000 result in a purchase. What is the estimated probability a visit results in a purchase?
This chapter established the language and rules we'll use for the rest of the course. Here's how it connects to what's ahead:
| Concept from Ch 2 | Where it goes |
|---|---|
| Sample spaces & events | Every chapter — this is the vocabulary of probability |
| Kolmogorov axioms | Every theorem we prove traces back to these three rules |
| Equally likely outcomes | Ch 1 counting + Ch 2 = toolkit for discrete uniform problems |
| Complement rule | Essential trick in Ch 3 (conditional) and Ch 5 (independence) |
| Mutually exclusive or | Foundation for Ch 7 (total probability) and partitions |
| Inclusion-exclusion | Returns in Ch 3 when computing conditional probabilities |
| Empirical definition | Basis for Ch 11 (estimation) and simulation methods |
You now have three tools for computing probabilities: (1) the empirical definition (run trials, count), (2) equally likely outcomes (count favorable / total), and (3) the axioms and their corollaries (complement, or-rules, inclusion-exclusion). Chapter 3 adds the fourth and most powerful tool: conditioning.