Updating beliefs with new evidence — from conditional probability to Bayes' theorem.
Imagine a streaming service like Netflix wants to recommend movies. The raw probability that any user watches Life is Beautiful is about 2%. Not very useful. But if you know the user just watched Amélie, the probability jumps to 9%. Knowing one thing changed our belief about another. That is conditional probability.
In English, a conditional probability answers: "what is the chance of event E happening, given that I have already observed some other event F?" It is the single most important idea in applied probability because it lets us update our beliefs in the face of new evidence.
When you condition on an event F, you enter the universe where F has taken place. Your sample space shrinks to just the outcomes consistent with F. Inside that smaller universe, all the rules of probability still hold — axioms, complements, inclusion-exclusion, everything.
This chapter builds five tightly linked tools. Conditional probability defines the filter. The chain rule lets us compute joint probabilities by chaining conditionals. Independence tells us when the filter changes nothing. The law of total probability lets us decompose hard problems into easy conditional pieces. And Bayes' theorem flips a conditional around — converting P(evidence | cause) into P(cause | evidence).
| Tool | What it does |
|---|---|
| P(E|F) | Probability of E after observing F |
| Chain Rule | P(E and F) = P(E|F) · P(F) |
| Independence | P(E|F) = P(E) — F tells you nothing about E |
| LOTP | Decompose P(E) using background events |
| Bayes' Theorem | Flip the conditional: P(F|E) from P(E|F) |
These five tools, combined with the axioms from Chapter 2, are enough to solve virtually every probability problem in this course. They are also the mathematical foundation of machine learning, medical diagnostics, spam filters, and recommendation engines.
The visualization below shows how conditioning shrinks the sample space. Click to toggle event F on/off and watch how P(E) changes to P(E|F).
50 hexagonal outcomes. Blue = event E, orange = event F, dark = E ∩ F. P(E|F) = (E∩F) / F.
Here is the setup: you have a sample space S with events E and F. You learn that F has occurred. What is the probability of E now?
Once F is known, your new sample space is F itself. The only outcomes of E that still matter are the ones also in F — that is, the intersection E ∩ F. So you take the fraction of the original probability that lands in both E and F, relative to the total probability of F.
Where P(F) > 0. If P(F) = 0, the conditional is undefined — you cannot condition on an impossible event.
Worked example — Netflix movies: A streaming service has 50,923,123 users. Of those, 1,234,231 watched Life is Beautiful (event E), and 2,500,000 watched Amélie (event F). Of users who watched Amélie, 225,000 also watched Life is Beautiful.
So the probability jumped from P(E) = 0.024 to P(E|F) = 0.09. Knowing someone watched Amélie nearly quadrupled our belief they watched Life is Beautiful. That is the power of conditioning.
Another worked example: A bag contains 6 red and 4 blue marbles. You draw one marble and learn it is red (event F). What is the probability the marble weighs more than 5g (event E), if 2 of the 6 red marbles are heavy and 3 of the 4 blue marbles are heavy?
Before learning the color, P(E) = 5/10 = 0.5. After learning it is red, our belief dropped to 1/3 because red marbles are less likely to be heavy than blue ones. The conditioning changed our belief.
Conditioning on multiple events: The definition extends naturally. If you consistently condition on G:
Read P(E|F,G) as "the probability of E, given both F and G have occurred." Should P(E|F,G) equal P(E|F)? Sometimes yes, sometimes no — it depends on whether G gives additional information beyond F.
The definition of conditional probability can be rearranged to give us a way to calculate the probability of two events both happening. Just multiply both sides of P(E|F) = P(E and F)/P(F) by P(F):
This is the chain rule (also called the multiplication rule or product rule). It says: the probability of E and F both occurring equals the probability that F occurs, times the probability that E occurs given F already happened.
There is nothing special about the order. Equivalently:
Generalization to n events: The chain rule extends to any number of events by peeling off one event at a time:
Worked example — drawing cards: What is the probability of drawing two aces in a row from a standard 52-card deck (without replacement)?
Let A1 = first card is an ace, A2 = second card is an ace.
After drawing one ace, only 3 remain among 51 cards. The chain rule naturally handles this dependence.
Extended example — three-event chain: You roll a die, flip a coin, and draw a card. Let D = die shows 6, C = coin is heads, K = card is a king. If these are independent (different physical processes), the chain rule simplifies:
Independence made each conditional collapse to the marginal. Without independence, we would need to know the full conditionals — the chain rule handles both cases.
Sometimes learning that F occurred tells you absolutely nothing about E. Two dice rolls, for example: the outcome of the first die gives you zero information about the second. When this happens, we say E and F are independent.
Equivalently, using the chain rule with independence substituted in:
This second form is the one you will use most often in practice. If two events are independent, the probability of both happening is just the product of their individual probabilities. No conditionals needed.
Generalized independence: Events E1, …, En are independent if for every subset of r events (r ≤ n):
Worked example — five coin flips: What is the probability of getting 5 heads on 5 fair coin flips, assuming independence?
How to establish independence: The gold standard is mathematical proof: show P(E|F) = P(E). In practice, with data-derived probabilities, exact equality is rare. We often assume independence when one event is unlikely to influence belief about the other. This is a modelling choice — potentially wrong, but useful, because it makes calculations tractable.
Conditional independence: Events can be independent within a conditioned universe without being independent overall (and vice versa). If E1, E2, E3 are conditionally independent given F, then P(E1, E2, E3 | F) = P(E1|F) · P(E2|F) · P(E3|F). But this does not imply P(E1, E2, E3) = P(E1) · P(E2) · P(E3).
Independence and complements: If A and B are independent, then A and BC are also independent. This follows from P(A and BC) = P(A) − P(A and B) = P(A) − P(A)P(B) = P(A)(1 − P(B)) = P(A)P(BC). The same argument shows AC and B, and AC and BC, are also independent.
We now have two tools for computing P(E and F), and which one you reach for depends on whether the events are independent or dependent.
| Relationship | Formula for P(E and F) |
|---|---|
| Independent | P(E) · P(F) |
| Dependent (general) | P(E|F) · P(F) (chain rule) |
The independent case is just the chain rule with P(E|F) replaced by P(E). The general case always works, independent or not.
Worked example — parallel network: Two independent routers connect computers A and B. Router 1 works with probability p1 = 0.95 and router 2 with p2 = 0.90. Information gets through if at least one router works. What is P(connection)?
Let Fi = router i fails. Then P(F1) = 0.05 and P(F2) = 0.10.
Generalizing to n independent routers: P(connection) = 1 − ∏i=1..n (1 − pi).
Below is an interactive simulation. Set the number of routers and their individual reliability, and watch how parallel redundancy drives failure probability toward zero.
Each router is independent. The network works if at least one router functions. Adjust the count and reliability to see the effect.
Dependent example — drawing without replacement: A box has 3 red and 2 blue balls. You draw two without replacement. What is the probability both are red?
If the draws were independent (with replacement), we would get (3/5)(3/5) = 0.36 — a higher probability because the first draw doesn't deplete the pool. The chain rule correctly accounts for the shrinking population.
Sometimes you need P(E), but it is hard to compute directly. However, you can easily compute P(E) in different contexts — "given high risk," "given medium risk," "given low risk." The law of total probability (LOTP) lets you combine these conditional pieces into the overall probability.
The simplest version uses an event F and its complement FC:
Why does this work? Event E can be split into two mutually exclusive parts: the part inside F (that is, E ∩ F) and the part outside F (E ∩ FC). Since these two parts cover all of E with no overlap:
Apply the chain rule to each term and you get the LOTP.
Worked example — disease testing: A population splits into three risk groups: high-risk (B1, 10%), medium-risk (B2, 30%), low-risk (B3, 60%). The probability of testing positive for a disease in each group: P(+|B1) = 0.60, P(+|B2) = 0.20, P(+|B3) = 0.05. What is P(+)?
15% of the population tests positive overall. We computed this without knowing P(+) directly — we only needed the conditional probabilities within each group and the group sizes.
Below, you can adjust the group proportions and conditional probabilities to see how P(E) changes. Notice how a large low-risk group with a small false-positive rate can still contribute substantially to the total.
Three mutually exclusive groups. Drag the sliders to see how each group's contribution adds up to the total P(E).
P(E) = 0.150
LOTP and the chain rule: The LOTP is just the chain rule applied to a partition. Each term P(E|Bi)P(Bi) equals P(E and Bi) by the chain rule. Summing over the partition gives P(E) because the events E∩Bi are mutually exclusive and their union is E. The LOTP is not a new axiom — it is a consequence of the axioms and the chain rule working together.
Here is the motivating situation: you observe some evidence E and want to know the probability of some underlying cause or state B. For example, a medical test comes back positive (evidence) and you want to know the probability the patient actually has the disease (cause). You know P(E|B) — the probability of a positive test given the disease — but you need P(B|E), the reverse direction.
Bayes' theorem flips the conditional. Start with the definition of conditional probability and apply the chain rule:
We can derive this in two lines. By definition, P(B|E) = P(B and E)/P(E). By the chain rule, P(B and E) = P(E|B) · P(B). Substitute and you have Bayes' theorem.
When P(E) is unknown, expand it using the law of total probability. This gives Bayes' theorem with LOTP:
Worked example — mammogram test: Breast cancer has a natural prevalence of P(I) = 0.08. The mammogram returns positive 95% of the time for patients with cancer: P(+|I) = 0.95. It also returns positive 7% of the time for patients without cancer (false positive): P(+|IC) = 0.07. What is P(I|+)?
Only 54%! Even with a 95%-sensitive test, the posterior is barely above a coin flip because the disease is rare (low prior) and the false-positive rate applies to the large healthy population.
Natural frequency intuition: Imagine 1000 people. About 80 have cancer. Of those, 0.95 × 80 = 76 test positive. Of the 920 healthy people, 0.07 × 920 = 64 also test positive. Total positives: 76 + 64 = 140. Fraction who actually have cancer: 76/140 ≈ 0.543. This matches Bayes' theorem exactly, and many people find it more intuitive than the formula.
| Term | Name | In our example |
|---|---|---|
| P(I) | Prior | 0.08 (base rate of cancer) |
| P(+|I) | Likelihood (sensitivity) | 0.95 |
| P(+|IC) | False positive rate | 0.07 |
| P(I|+) | Posterior | 0.541 |
| P(+) | Normalizing constant | 0.1404 |
Bayes with general LOTP: When the belief B can take more than two values (B1, …, Bn), use the general form. For example, tracking a phone across n locations: P(Bi|E) = P(E|Bi)P(Bi) / ∑j P(E|Bj)P(Bj). Each location's posterior is proportional to its prior times its likelihood.
This is the payoff simulation. Set the disease prevalence, the test's sensitivity (true positive rate), and the false positive rate. The calculator applies Bayes' theorem in real time and shows a population of 1000 dots to build the natural-frequency intuition.
Adjust parameters below. The canvas shows 1000 people: blue = has disease, pink = healthy. Dark shading = tested positive. The posterior P(disease | +) is computed live.
The ratio trick: When P(E) is unknown, you can compute the odds ratio P(B|E)/P(BC|E). The P(E) terms cancel:
This ratio tells you how many times more likely B is than not-B, given evidence E. To recover the actual probability: P(B|E) = ratio / (1 + ratio).
Concrete ratio example: Using the mammogram numbers (prevalence 8%, sensitivity 95%, FPR 7%):
So the disease is about 1.18 times more likely than no disease given a positive test. Converting: P(D|+) = 1.180 / (1 + 1.180) = 0.541. Same answer, no need to compute P(+) directly.
Problem 1: Two children. A family has two children. You learn that at least one child is a girl. What is the probability both are girls? (Assume each child is equally likely to be a boy or girl, independently.)
Sample space: {BB, BG, GB, GG}. Event F = at least one girl = {BG, GB, GG}. Event E = both girls = {GG}.
Not 1/2! Conditioning on "at least one girl" removes BB but leaves three equally likely outcomes, only one of which is GG.
Problem 2: Chain rule with three events. In a bag are 5 red and 3 blue marbles. You draw three without replacement. What is P(all three red)?
Problem 3: LOTP in the wild. 70% of emails are spam. A spam filter flags 90% of spam and 5% of legitimate email. What fraction of all email gets flagged?
64.5% of all email is flagged. Follow-up using Bayes: of flagged emails, what fraction are actually spam?
97.7% of flagged emails are truly spam. The filter is quite precise because spam is so prevalent.
Problem 4: Independence and complements. If A and B are independent, prove A and BC are independent.
We need to show P(A and BC) = P(A) · P(BC).
Problem 5: Full Bayes workflow. A factory has two machines. Machine A (60% of production) has a 3% defect rate. Machine B (40%) has a 7% defect rate. A randomly selected item is defective. What is the probability it came from Machine A?
Let D = defective, A = came from Machine A.
Even though Machine A produces 60% of all items, it only accounts for 39% of defects because its defect rate is much lower. Machine B, despite making fewer items, produces the majority of defects.
Problem 6: Conditional probability chain. P(A) = 0.4, P(B|A) = 0.7, P(C|A,B) = 0.9. Find P(A and B and C).
Conditional probability is not an isolated topic — it is the connective tissue of the entire course and the foundation of machine learning.
| Where it leads | How conditional probability appears |
|---|---|
| Random variables (Ch 4) | Conditional distributions P(X=x | Y=y) extend everything here to numerical quantities |
| Naive Bayes classifiers | Bayes' theorem + conditional independence assumption → simple but powerful ML classifier |
| Bayesian networks | Chain rule decomposes joint distributions into products of conditionals along a directed graph |
| Hidden Markov models | LOTP marginalizes over hidden states; Bayes updates beliefs at each time step |
| Reinforcement learning | Conditional independence (Markov property) makes sequential decision-making tractable |
| Medical diagnostics | Bayes' theorem converts test accuracy into patient-level disease probability |
What we built:
• P(E|F) definition & the conditional paradigm
• Chain rule for joint probabilities
• Independence & its implications
• Law of total probability
• Bayes' theorem (classic + LOTP form)
What comes next:
• Chapter 4: Random variables — numeric outcomes
• Expectation, variance, common distributions
• Conditional distributions & independence for random variables
• The central limit theorem
Master the five tools of this chapter and you have the complete toolkit for discrete probability. Everything ahead — random variables, distributions, expectation — builds directly on these foundations.