Piech, Chapter 3

Conditional Probability

Updating beliefs with new evidence — from conditional probability to Bayes' theorem.

Prerequisites: Chapter 2 (Probability basics, axioms, sample spaces). That's it.
10
Chapters
3
Simulations
10
Quizzes

Chapter 0: Why Conditional Probability?

Imagine a streaming service like Netflix wants to recommend movies. The raw probability that any user watches Life is Beautiful is about 2%. Not very useful. But if you know the user just watched Amélie, the probability jumps to 9%. Knowing one thing changed our belief about another. That is conditional probability.

In English, a conditional probability answers: "what is the chance of event E happening, given that I have already observed some other event F?" It is the single most important idea in applied probability because it lets us update our beliefs in the face of new evidence.

When you condition on an event F, you enter the universe where F has taken place. Your sample space shrinks to just the outcomes consistent with F. Inside that smaller universe, all the rules of probability still hold — axioms, complements, inclusion-exclusion, everything.

The core idea: Conditioning is like putting on a filter. Once you know F happened, you throw away every outcome where F didn't happen. The remaining outcomes — now your entire world — are the only ones that matter. Conditional probability measures how much of that filtered world is also in E.

This chapter builds five tightly linked tools. Conditional probability defines the filter. The chain rule lets us compute joint probabilities by chaining conditionals. Independence tells us when the filter changes nothing. The law of total probability lets us decompose hard problems into easy conditional pieces. And Bayes' theorem flips a conditional around — converting P(evidence | cause) into P(cause | evidence).

ToolWhat it does
P(E|F)Probability of E after observing F
Chain RuleP(E and F) = P(E|F) · P(F)
IndependenceP(E|F) = P(E) — F tells you nothing about E
LOTPDecompose P(E) using background events
Bayes' TheoremFlip the conditional: P(F|E) from P(E|F)

These five tools, combined with the axioms from Chapter 2, are enough to solve virtually every probability problem in this course. They are also the mathematical foundation of machine learning, medical diagnostics, spam filters, and recommendation engines.

Real-world applications:
Medical testing: A positive test doesn't mean you have the disease — Bayes' theorem quantifies the actual probability.
Spam filters: Naive Bayes classifiers use conditional probability to score every email.
Recommendation engines: P(watch movie E | watched movie F) drives Netflix-style recommendations.
Search engines: The relevance of a document given a query is a conditional probability problem.

The visualization below shows how conditioning shrinks the sample space. Click to toggle event F on/off and watch how P(E) changes to P(E|F).

Conditioning: Shrinking the Sample Space

50 hexagonal outcomes. Blue = event E, orange = event F, dark = E ∩ F. P(E|F) = (E∩F) / F.

Showing P(E) = unconditioned
Check: If you know a user watched Amélie, and the probability they watch Life is Beautiful jumps from 2% to 9%, what does this tell you about the two events?

Chapter 1: Definition of P(E|F)

Here is the setup: you have a sample space S with events E and F. You learn that F has occurred. What is the probability of E now?

Once F is known, your new sample space is F itself. The only outcomes of E that still matter are the ones also in F — that is, the intersection E ∩ F. So you take the fraction of the original probability that lands in both E and F, relative to the total probability of F.

P(E|F) = P(E and F) / P(F)

Where P(F) > 0. If P(F) = 0, the conditional is undefined — you cannot condition on an impossible event.

Visual intuition: Imagine 50 equally likely outcomes drawn as hexagons. F covers 14 of them. E covers some other set. The overlap E ∩ F contains 3 hexagons. Then P(E|F) = (3/50) / (14/50) = 3/14 ≈ 0.21. We simply zoomed into F and asked what fraction of F is also in E.

Worked example — Netflix movies: A streaming service has 50,923,123 users. Of those, 1,234,231 watched Life is Beautiful (event E), and 2,500,000 watched Amélie (event F). Of users who watched Amélie, 225,000 also watched Life is Beautiful.

P(E|F) = P(E and F) / P(F) = (225,000 / 50,923,123) / (2,500,000 / 50,923,123) = 225,000 / 2,500,000 = 0.09

So the probability jumped from P(E) = 0.024 to P(E|F) = 0.09. Knowing someone watched Amélie nearly quadrupled our belief they watched Life is Beautiful. That is the power of conditioning.

The conditional paradigm: When you condition consistently on the same event G, every rule of probability still holds. Axiom 1: 0 ≤ P(E|G) ≤ 1. Axiom 2: P(S|G) = 1. Axiom 3: for mutually exclusive E, F: P(E or F | G) = P(E|G) + P(F|G). Even the complement rule transfers: P(EC|G) = 1 − P(E|G).

Another worked example: A bag contains 6 red and 4 blue marbles. You draw one marble and learn it is red (event F). What is the probability the marble weighs more than 5g (event E), if 2 of the 6 red marbles are heavy and 3 of the 4 blue marbles are heavy?

P(E|F) = P(E and F) / P(F) = (2/10) / (6/10) = 2/6 = 1/3 ≈ 0.333

Before learning the color, P(E) = 5/10 = 0.5. After learning it is red, our belief dropped to 1/3 because red marbles are less likely to be heavy than blue ones. The conditioning changed our belief.

Conditioning on multiple events: The definition extends naturally. If you consistently condition on G:

P(E | F, G) = P(E and F | G) / P(F | G)

Read P(E|F,G) as "the probability of E, given both F and G have occurred." Should P(E|F,G) equal P(E|F)? Sometimes yes, sometimes no — it depends on whether G gives additional information beyond F.

Check: In a sample space of 100 equally likely outcomes, F contains 25 outcomes and E ∩ F contains 5. What is P(E|F)?

Chapter 2: The Chain Rule

The definition of conditional probability can be rearranged to give us a way to calculate the probability of two events both happening. Just multiply both sides of P(E|F) = P(E and F)/P(F) by P(F):

P(E and F) = P(E|F) · P(F)

This is the chain rule (also called the multiplication rule or product rule). It says: the probability of E and F both occurring equals the probability that F occurs, times the probability that E occurs given F already happened.

There is nothing special about the order. Equivalently:

P(E and F) = P(F|E) · P(E)
Key insight: The chain rule is just the definition of conditional probability, rearranged. But this rearrangement is enormously useful — it lets you compute joint probabilities by breaking them into a sequence of conditionals, which are often easier to estimate.

Generalization to n events: The chain rule extends to any number of events by peeling off one event at a time:

P(E1 and E2 and … and En) = P(E1) · P(E2|E1) · P(E3|E1, E2) ··· P(En|E1, …, En−1)

Worked example — drawing cards: What is the probability of drawing two aces in a row from a standard 52-card deck (without replacement)?

Let A1 = first card is an ace, A2 = second card is an ace.

P(A1 and A2) = P(A1) · P(A2|A1) = (4/52) · (3/51) = 12/2652 ≈ 0.0045

After drawing one ace, only 3 remain among 51 cards. The chain rule naturally handles this dependence.

Why not just multiply P(A1) · P(A2)? Because the draws are dependent — the first draw changes the deck. If you naively multiplied (4/52)(4/52) = 0.0059, you'd overestimate. The chain rule gives the correct answer because it accounts for the dependence through the conditional P(A2|A1).

Extended example — three-event chain: You roll a die, flip a coin, and draw a card. Let D = die shows 6, C = coin is heads, K = card is a king. If these are independent (different physical processes), the chain rule simplifies:

P(D, C, K) = P(D) · P(C|D) · P(K|D,C) = P(D) · P(C) · P(K) = (1/6)(1/2)(4/52) ≈ 0.0064

Independence made each conditional collapse to the marginal. Without independence, we would need to know the full conditionals — the chain rule handles both cases.

Chain rule = universal joint probability tool. Any joint probability can be decomposed using the chain rule. Independence simplifies the terms. Dependence keeps them as full conditionals. Either way, the chain rule works.
Check: Using the chain rule, what is P(A and B and C) expanded as a product of conditionals?

Chapter 3: Independence

Sometimes learning that F occurred tells you absolutely nothing about E. Two dice rolls, for example: the outcome of the first die gives you zero information about the second. When this happens, we say E and F are independent.

P(E|F) = P(E)     ⇐   E and F are independent

Equivalently, using the chain rule with independence substituted in:

P(E and F) = P(E) · P(F)     (alternative definition)

This second form is the one you will use most often in practice. If two events are independent, the probability of both happening is just the product of their individual probabilities. No conditionals needed.

Independence is symmetric. If E is independent of F, then F is independent of E. Proof using Bayes' theorem: P(E|F) = P(F|E)·P(E)/P(F). If P(F|E) = P(F), then P(E|F) = P(F)·P(E)/P(F) = P(E). The "given" direction doesn't matter.

Generalized independence: Events E1, …, En are independent if for every subset of r events (r ≤ n):

P(Ei1, Ei2, …, Eir) = ∏j=1..r P(Eij)

Worked example — five coin flips: What is the probability of getting 5 heads on 5 fair coin flips, assuming independence?

P(H1, H2, H3, H4, H5) = ∏i=1..5 P(Hi) = (1/2)5 = 1/32 = 0.03125

How to establish independence: The gold standard is mathematical proof: show P(E|F) = P(E). In practice, with data-derived probabilities, exact equality is rare. We often assume independence when one event is unlikely to influence belief about the other. This is a modelling choice — potentially wrong, but useful, because it makes calculations tractable.

Independence vs. mutual exclusion: These are completely different properties. Mutually exclusive events cannot both happen — P(E and F) = 0. Independent events can both happen — P(E and F) = P(E)·P(F). In fact, if two events with positive probability are mutually exclusive, they are necessarily dependent: knowing one happened tells you the other definitely didn't.

Conditional independence: Events can be independent within a conditioned universe without being independent overall (and vice versa). If E1, E2, E3 are conditionally independent given F, then P(E1, E2, E3 | F) = P(E1|F) · P(E2|F) · P(E3|F). But this does not imply P(E1, E2, E3) = P(E1) · P(E2) · P(E3).

Warning: Conditioning can create or destroy independence. Two events that are independent can become dependent when conditioned on a third event. Two dependent events can become independent when conditioned. Always check independence in the specific universe you are working in.

Independence and complements: If A and B are independent, then A and BC are also independent. This follows from P(A and BC) = P(A) − P(A and B) = P(A) − P(A)P(B) = P(A)(1 − P(B)) = P(A)P(BC). The same argument shows AC and B, and AC and BC, are also independent.

Check: If P(E) = 0.3 and P(F) = 0.5, and E and F are independent, what is P(E and F)?

Chapter 4: Probability of And

We now have two tools for computing P(E and F), and which one you reach for depends on whether the events are independent or dependent.

RelationshipFormula for P(E and F)
IndependentP(E) · P(F)
Dependent (general)P(E|F) · P(F)   (chain rule)

The independent case is just the chain rule with P(E|F) replaced by P(E). The general case always works, independent or not.

Key insight: Independence makes "and" easy — just multiply. This is why independence is so prized in probability: it collapses conditional reasoning into simple multiplication, the same way mutual exclusion collapses "or" into simple addition.

Worked example — parallel network: Two independent routers connect computers A and B. Router 1 works with probability p1 = 0.95 and router 2 with p2 = 0.90. Information gets through if at least one router works. What is P(connection)?

Let Fi = router i fails. Then P(F1) = 0.05 and P(F2) = 0.10.

P(both fail) = P(F1) · P(F2) = 0.05 × 0.10 = 0.005
P(connection) = 1 − P(both fail) = 1 − 0.005 = 0.995

Generalizing to n independent routers: P(connection) = 1 − ∏i=1..n (1 − pi).

Below is an interactive simulation. Set the number of routers and their individual reliability, and watch how parallel redundancy drives failure probability toward zero.

Parallel Network Reliability

Each router is independent. The network works if at least one router functions. Adjust the count and reliability to see the effect.

Routers3
Each p0.90
Proof that P(E and F) = P(E) · P(F) for independent events: Start with P(E|F) = P(E and F)/P(F). Independence says P(E|F) = P(E). Substitute: P(E) = P(E and F)/P(F). Multiply both sides by P(F): P(E and F) = P(E) · P(F). Done.

Dependent example — drawing without replacement: A box has 3 red and 2 blue balls. You draw two without replacement. What is the probability both are red?

P(R1 and R2) = P(R1) · P(R2|R1) = (3/5) · (2/4) = 6/20 = 0.30

If the draws were independent (with replacement), we would get (3/5)(3/5) = 0.36 — a higher probability because the first draw doesn't deplete the pool. The chain rule correctly accounts for the shrinking population.

Summary decision tree: Need P(E and F)? Ask: are E and F independent? If yes: P(E) · P(F). If no: P(E|F) · P(F) (chain rule). If you are unsure, use the chain rule — it always works.
Check: Three independent routers each work with probability 0.80. What is the probability the parallel network fails (all three fail)?

Chapter 5: Law of Total Probability

Sometimes you need P(E), but it is hard to compute directly. However, you can easily compute P(E) in different contexts — "given high risk," "given medium risk," "given low risk." The law of total probability (LOTP) lets you combine these conditional pieces into the overall probability.

The simplest version uses an event F and its complement FC:

P(E) = P(E|F) · P(F) + P(E|FC) · P(FC)

Why does this work? Event E can be split into two mutually exclusive parts: the part inside F (that is, E ∩ F) and the part outside F (E ∩ FC). Since these two parts cover all of E with no overlap:

P(E) = P(E and F) + P(E and FC)

Apply the chain rule to each term and you get the LOTP.

The general version: If B1, B2, …, Bn are mutually exclusive and cover the entire sample space, then:
P(E) = ∑i=1..n P(E|Bi) · P(Bi)
The Bi are called "background events." Each one is a different context in which you evaluate E.

Worked example — disease testing: A population splits into three risk groups: high-risk (B1, 10%), medium-risk (B2, 30%), low-risk (B3, 60%). The probability of testing positive for a disease in each group: P(+|B1) = 0.60, P(+|B2) = 0.20, P(+|B3) = 0.05. What is P(+)?

P(+) = 0.60 × 0.10 + 0.20 × 0.30 + 0.05 × 0.60 = 0.06 + 0.06 + 0.03 = 0.15

15% of the population tests positive overall. We computed this without knowing P(+) directly — we only needed the conditional probabilities within each group and the group sizes.

Below, you can adjust the group proportions and conditional probabilities to see how P(E) changes. Notice how a large low-risk group with a small false-positive rate can still contribute substantially to the total.

Law of Total Probability Visualizer

Three mutually exclusive groups. Drag the sliders to see how each group's contribution adds up to the total P(E).

P(E|B1)0.60
P(E|B2)0.20
P(E|B3)0.05
P(B1)0.10
P(B2)0.30

P(E) = 0.150

When to use LOTP: Whenever computing P(E) directly is hard, but computing P(E | some context) is easy. The "contexts" (background events) must be mutually exclusive and exhaustive. Classic examples: disease risk groups, weather conditions, different machine types in a factory.
Two requirements for background events: (1) Mutually exclusive — no outcome belongs to two groups simultaneously. (2) Exhaustive — every outcome belongs to at least one group. Together these mean the groups partition the sample space. F and FC always satisfy both conditions, which is why the two-event LOTP always works.

LOTP and the chain rule: The LOTP is just the chain rule applied to a partition. Each term P(E|Bi)P(Bi) equals P(E and Bi) by the chain rule. Summing over the partition gives P(E) because the events E∩Bi are mutually exclusive and their union is E. The LOTP is not a new axiom — it is a consequence of the axioms and the chain rule working together.

Check: A factory has two machines. Machine A produces 60% of items with 2% defect rate. Machine B produces 40% with 5% defect rate. What is the overall defect rate?

Chapter 6: Bayes' Theorem

Here is the motivating situation: you observe some evidence E and want to know the probability of some underlying cause or state B. For example, a medical test comes back positive (evidence) and you want to know the probability the patient actually has the disease (cause). You know P(E|B) — the probability of a positive test given the disease — but you need P(B|E), the reverse direction.

Bayes' theorem flips the conditional. Start with the definition of conditional probability and apply the chain rule:

P(B|E) = P(E|B) · P(B) / P(E)

We can derive this in two lines. By definition, P(B|E) = P(B and E)/P(E). By the chain rule, P(B and E) = P(E|B) · P(B). Substitute and you have Bayes' theorem.

The vocabulary of Bayes:
P(B) = the prior — your belief about B before seeing evidence.
P(E|B) = the likelihood — how probable the evidence is if B is true.
P(B|E) = the posterior — your updated belief after seeing evidence.
P(E) = the normalizing constant — ensures the posterior sums to 1.

When P(E) is unknown, expand it using the law of total probability. This gives Bayes' theorem with LOTP:

P(B|E) = P(E|B) · P(B) / [ P(E|B) · P(B) + P(E|BC) · P(BC) ]

Worked example — mammogram test: Breast cancer has a natural prevalence of P(I) = 0.08. The mammogram returns positive 95% of the time for patients with cancer: P(+|I) = 0.95. It also returns positive 7% of the time for patients without cancer (false positive): P(+|IC) = 0.07. What is P(I|+)?

P(I|+) = (0.95 × 0.08) / (0.95 × 0.08 + 0.07 × 0.92) = 0.076 / (0.076 + 0.0644) = 0.076 / 0.1404 ≈ 0.541

Only 54%! Even with a 95%-sensitive test, the posterior is barely above a coin flip because the disease is rare (low prior) and the false-positive rate applies to the large healthy population.

Why Bayes surprises people: The base rate matters enormously. When a disease is rare, even a small false-positive rate generates many false alarms relative to true positives. Bayes' theorem quantifies this trade-off precisely. It is the reason doctors order confirmation tests.

Natural frequency intuition: Imagine 1000 people. About 80 have cancer. Of those, 0.95 × 80 = 76 test positive. Of the 920 healthy people, 0.07 × 920 = 64 also test positive. Total positives: 76 + 64 = 140. Fraction who actually have cancer: 76/140 ≈ 0.543. This matches Bayes' theorem exactly, and many people find it more intuitive than the formula.

TermNameIn our example
P(I)Prior0.08 (base rate of cancer)
P(+|I)Likelihood (sensitivity)0.95
P(+|IC)False positive rate0.07
P(I|+)Posterior0.541
P(+)Normalizing constant0.1404

Bayes with general LOTP: When the belief B can take more than two values (B1, …, Bn), use the general form. For example, tracking a phone across n locations: P(Bi|E) = P(E|Bi)P(Bi) / ∑j P(E|Bj)P(Bj). Each location's posterior is proportional to its prior times its likelihood.

Check: In the mammogram example, what is the main reason P(I|+) is so much lower than the test's 95% sensitivity?

Chapter 7: Showcase — Interactive Bayes Calculator

This is the payoff simulation. Set the disease prevalence, the test's sensitivity (true positive rate), and the false positive rate. The calculator applies Bayes' theorem in real time and shows a population of 1000 dots to build the natural-frequency intuition.

Bayes' Theorem: Disease Testing

Adjust parameters below. The canvas shows 1000 people: blue = has disease, pink = healthy. Dark shading = tested positive. The posterior P(disease | +) is computed live.

Prevalence P(D)0.08
Sensitivity P(+|D)0.95
False positive P(+|DC)0.07
P(D|+) = 0.541
Things to try:
• Set prevalence to 1% and watch P(D|+) plummet, even with 95% sensitivity.
• Drag false positive rate down to 1% — the posterior jumps dramatically.
• Set prevalence to 50% (coin flip prior) — now sensitivity dominates.
• Notice: the total number of dark dots (all positives) is the denominator P(+), and the dark blue dots are the numerator P(+ and D).

The ratio trick: When P(E) is unknown, you can compute the odds ratio P(B|E)/P(BC|E). The P(E) terms cancel:

P(B|E) / P(BC|E) = [ P(E|B) · P(B) ] / [ P(E|BC) · P(BC) ]

This ratio tells you how many times more likely B is than not-B, given evidence E. To recover the actual probability: P(B|E) = ratio / (1 + ratio).

Concrete ratio example: Using the mammogram numbers (prevalence 8%, sensitivity 95%, FPR 7%):

odds = (0.95 × 0.08) / (0.07 × 0.92) = 0.076 / 0.0644 = 1.180

So the disease is about 1.18 times more likely than no disease given a positive test. Converting: P(D|+) = 1.180 / (1 + 1.180) = 0.541. Same answer, no need to compute P(+) directly.

Sequential Bayes: Suppose the patient takes a second independent test and it also comes back positive. Now the posterior from the first test becomes the prior for the second. The odds multiply: new odds = 1.180 × (0.95/0.07) = 1.180 × 13.57 = 16.01. So P(D|two positives) = 16.01/17.01 ≈ 0.941. Two tests dramatically increase confidence. This is why doctors order confirmation tests.
Check: With prevalence 1%, sensitivity 99%, and false positive rate 5%, roughly what is P(disease | positive)?

Chapter 8: Worked Problems

Problem 1: Two children. A family has two children. You learn that at least one child is a girl. What is the probability both are girls? (Assume each child is equally likely to be a boy or girl, independently.)

Sample space: {BB, BG, GB, GG}. Event F = at least one girl = {BG, GB, GG}. Event E = both girls = {GG}.

P(E|F) = P(E and F) / P(F) = (1/4) / (3/4) = 1/3

Not 1/2! Conditioning on "at least one girl" removes BB but leaves three equally likely outcomes, only one of which is GG.

Common mistake: People often reason "one is a girl, so the other has a 50/50 chance, giving 1/2." This is wrong because "at least one girl" does not identify which child is the girl. The conditional probability formula gives the correct 1/3.

Problem 2: Chain rule with three events. In a bag are 5 red and 3 blue marbles. You draw three without replacement. What is P(all three red)?

P(R1, R2, R3) = P(R1) · P(R2|R1) · P(R3|R1,R2) = (5/8) · (4/7) · (3/6) = 60/336 ≈ 0.179

Problem 3: LOTP in the wild. 70% of emails are spam. A spam filter flags 90% of spam and 5% of legitimate email. What fraction of all email gets flagged?

P(flagged) = P(flagged|spam) · P(spam) + P(flagged|legit) · P(legit) = 0.90 × 0.70 + 0.05 × 0.30 = 0.645

64.5% of all email is flagged. Follow-up using Bayes: of flagged emails, what fraction are actually spam?

P(spam|flagged) = (0.90 × 0.70) / 0.645 = 0.630 / 0.645 ≈ 0.977

97.7% of flagged emails are truly spam. The filter is quite precise because spam is so prevalent.

Problem 4: Independence and complements. If A and B are independent, prove A and BC are independent.

We need to show P(A and BC) = P(A) · P(BC).

P(A and BC) = P(A) − P(A and B)     (LOTP: A = (A∩B) ∪ (A∩BC))
= P(A) − P(A)·P(B) = P(A)[1 − P(B)] = P(A) · P(BC)   ■

Problem 5: Full Bayes workflow. A factory has two machines. Machine A (60% of production) has a 3% defect rate. Machine B (40%) has a 7% defect rate. A randomly selected item is defective. What is the probability it came from Machine A?

Let D = defective, A = came from Machine A.

P(D) = P(D|A)P(A) + P(D|B)P(B) = 0.03 × 0.60 + 0.07 × 0.40 = 0.018 + 0.028 = 0.046
P(A|D) = P(D|A)P(A) / P(D) = 0.018 / 0.046 ≈ 0.391

Even though Machine A produces 60% of all items, it only accounts for 39% of defects because its defect rate is much lower. Machine B, despite making fewer items, produces the majority of defects.

Pattern recognition: This problem used all three tools in sequence: LOTP to find P(D), then Bayes' theorem to flip the conditional. This "LOTP in the denominator" pattern appears in nearly every applied Bayes problem.

Problem 6: Conditional probability chain. P(A) = 0.4, P(B|A) = 0.7, P(C|A,B) = 0.9. Find P(A and B and C).

P(A, B, C) = P(A) · P(B|A) · P(C|A,B) = 0.4 × 0.7 × 0.9 = 0.252
Check: In Problem 1, if you instead learn that the older child is a girl, what is P(both girls)?

Chapter 9: Connections

Conditional probability is not an isolated topic — it is the connective tissue of the entire course and the foundation of machine learning.

Where it leadsHow conditional probability appears
Random variables (Ch 4)Conditional distributions P(X=x | Y=y) extend everything here to numerical quantities
Naive Bayes classifiersBayes' theorem + conditional independence assumption → simple but powerful ML classifier
Bayesian networksChain rule decomposes joint distributions into products of conditionals along a directed graph
Hidden Markov modelsLOTP marginalizes over hidden states; Bayes updates beliefs at each time step
Reinforcement learningConditional independence (Markov property) makes sequential decision-making tractable
Medical diagnosticsBayes' theorem converts test accuracy into patient-level disease probability
The Bayesian worldview: Bayes' theorem is more than a formula. It represents a philosophy: start with a prior belief, observe evidence, update your belief. This cycle — prior × likelihood → posterior — is the engine of Bayesian statistics, Bayesian deep learning, and a large fraction of modern AI. Every time a language model updates its prediction given the next token, it is doing (approximate) Bayesian inference.

What we built:

• P(E|F) definition & the conditional paradigm
• Chain rule for joint probabilities
• Independence & its implications
• Law of total probability
• Bayes' theorem (classic + LOTP form)

What comes next:

Chapter 4: Random variables — numeric outcomes
• Expectation, variance, common distributions
• Conditional distributions & independence for random variables
• The central limit theorem

Master the five tools of this chapter and you have the complete toolkit for discrete probability. Everything ahead — random variables, distributions, expectation — builds directly on these foundations.

The five tools, one more time: (1) P(E|F) = P(E∩F)/P(F) — the definition. (2) P(E∩F) = P(E|F)P(F) — the chain rule. (3) P(E|F) = P(E) — independence. (4) P(E) = ∑ P(E|Bi)P(Bi) — LOTP. (5) P(B|E) = P(E|B)P(B)/P(E) — Bayes. These five identities, plus the three axioms from Chapter 2, are the complete foundation. Every computation in discrete probability reduces to applying these rules in the right order.
"The theory of probabilities is at bottom nothing
but common sense reduced to calculus."
— Pierre-Simon Laplace
Check: Which tool from this chapter would you use to convert P(test result | disease) into P(disease | test result)?