Piech, Chapter 9

Joint Distributions

Reasoning about several random variables at once — from joint tables to continuous densities.

Prerequisites: Chapter 8 (Continuous random variables, PDFs, CDFs). That's it.
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: Why Joint Distributions?

Suppose you are building a tool to diagnose an illness called Determinitis. A patient walks in. They might have a fever (none, low, or high), and they might have lost their sense of smell. You know the probability of each symptom individually — but that is not enough. Whether fever and smell-loss appear together tells you far more about the disease than either symptom alone.

Until now, every distribution we have studied describes one random variable in isolation. But real problems involve several variables interacting with each other. A recommendation engine reasons about a user's age, location, and watch history simultaneously. A robot fuses its GPS reading with its wheel odometry. A medical model combines fever, cough, and test result.

To handle these situations we need a single function that captures the probability of every possible combination of values for multiple variables at once. That function is the joint distribution.

The core idea: A joint distribution is a lookup table (for discrete variables) or a density surface (for continuous ones) that tells you the probability of any combination of values across all your variables. From this one object you can recover everything — marginals, conditionals, expectations, covariances — for each variable and every relationship between them.
Single Variable
P(X = x) — one PMF or PDF per variable
↓ multiple variables
Joint Distribution
P(X = x, Y = y, …) — one function for all variables together
↓ recover individual info
Marginals & Conditionals
P(X = x) and P(X = x | Y = y) extracted from the joint

Here is a taste. Three random variables describe a patient: disease D (0 or 1), fever F (none, low, high), and smell S (0 or 1). The joint table has 2 × 3 × 2 = 12 cells, and every cell stores the probability of one specific combination — say P(D=1, F=low, S=0) = 0.005. Those 12 numbers encode everything about how these variables relate.

PropertyWhat it gives you
Joint PMF / PDFProbability (or density) for any assignment of all variables
MarginalDistribution of one variable, summing (or integrating) out the others
ConditionalDistribution of one variable given specific values of others
Covariance / CorrelationHow much two variables move together

This chapter builds these tools one at a time: joint PMFs and their tables, marginalization, conditionals from the joint, the multinomial distribution, continuous joint PDFs, marginal PDFs, and finally covariance and correlation. By the end, you will be able to take any joint distribution and extract every piece of information it contains.

Check: Why do we need joint distributions instead of separate single-variable distributions?

Chapter 1: The Joint PMF

For a single discrete variable, the PMF maps each value to a probability. For two discrete variables X and Y, the joint PMF maps every (x, y) pair to the probability that X takes value x and Y takes value y simultaneously:

P(X = x, Y = y)

Read the comma as "and." A common shorthand is P(x, y), which means the same thing. The joint PMF must satisfy two rules: every entry is non-negative, and the sum over all (x, y) pairs equals 1.

The most concrete way to represent a joint PMF is a joint probability table. Each row is a value of one variable, each column is a value of the other, and the cell contains the joint probability. Here is the Determinitis example from the textbook, simplified to two variables: fever F (none, low, high) and disease D (0 or 1).

D = 0D = 1
F = none0.8070.020
F = low0.0950.016
F = high0.0470.015

Reading the table: The probability someone has no fever and no disease is P(F=none, D=0) = 0.807. The probability someone has a high fever and Determinitis is P(F=high, D=1) = 0.015. Sum all six cells: 0.807 + 0.020 + 0.095 + 0.016 + 0.047 + 0.015 = 1.000. Good — the table is a valid distribution.

Key insight — joint vs. conditional: The value 0.016 is the joint probability P(F=low, D=1) — the probability of having both a low fever and the disease. It is not P(F=low | D=1). A table of conditional probabilities would be called a conditional probability table. In a joint table, every cell is an unconditional joint probability and the full table sums to 1.

How big are these tables? If variable i can take ni values, the table has ∏ ni cells. For Determinitis with D (2 values), F (3 values), and S (2 values), that is 2 × 3 × 2 = 12 cells. Adding a fourth binary variable doubles it to 24. With 20 binary variables, the table has 220 ≈ 1 million cells. Joint tables grow exponentially — this is both their power and their weakness.

Think of it this way: A joint table is a brute-force approach. It stores one number for every possible world. That is complete information — you can answer any probability question about these variables — but the cost is exponential in the number of variables. Later tools (Bayesian networks, conditional independence) will compress this.
Check: In the Determinitis table above, what is P(F=low, D=0)?

Chapter 2: Marginalization

You have a joint table for X and Y, but you only care about X. How do you recover P(X = x) from the joint? The answer is marginalization — you sum out the variable you don't care about, using the Law of Total Probability (LOTP):

P(X = x) = ∑y P(X = x, Y = y)

In words: to find the probability that X equals some particular value x, add up the joint probabilities across every possible value of Y. You are collapsing the Y dimension of the table.

Worked example — favorite digit: X is a Stanford student's favorite binary digit (0 or 1) and Y is their year. Here is the joint table from the textbook:

X = 0X = 1
Frosh0.010.13
Soph0.050.33
Junior0.040.21
Senior0.030.12
5+0.020.06

What is P(X = 0)? Sum down the X=0 column:

P(X = 0) = 0.01 + 0.05 + 0.04 + 0.03 + 0.02 = 0.15

And P(X = 1) = 0.13 + 0.33 + 0.21 + 0.12 + 0.06 = 0.85. Check: 0.15 + 0.85 = 1.00.

The name "marginalization" comes from literally writing the sums in the margins of the table. The resulting single-variable distribution P(X) is called the marginal distribution of X.

Marginalization with more variables: The same idea extends to any number of variables. With three variables X, Y, Z, you can marginalize out Y and Z to get P(X = x) = ∑y,z P(X = x, Y = y, Z = z). The double sum means you loop over all combinations of y and z. If Y has 3 values and Z has 4, that is 12 terms per value of x.
Verification: After marginalizing, always check that the resulting marginal sums to 1. If it doesn't, either the original joint was wrong or you missed a term. This is a quick sanity check that catches arithmetic errors.
Check: In the favorite-digit table, what is P(Y = Soph)?

Chapter 3: Conditional from Joint

The joint table is "complete information" — you can compute any conditional distribution directly from it. Recall from Chapter 3 that P(X = x | Y = y) = P(X = x, Y = y) / P(Y = y). The numerator comes straight from the joint table and the denominator is the marginal we just learned to compute.

P(X = x | Y = y) = P(X = x, Y = y) / P(Y = y)

Worked example: Using the favorite-digit table, what is P(X = 0 | Y = Soph)? We need two things. First, the joint: P(X=0, Y=Soph) = 0.05. Second, the marginal: P(Y=Soph) = 0.05 + 0.33 = 0.38. Now divide:

P(X = 0 | Y = Soph) = 0.05 / 0.38 ≈ 0.132

So about 13.2% of sophomores prefer 0. Compare this to the unconditional P(X=0) = 0.15 — sophomores are slightly less likely to favor 0 than the overall population.

You can compute the full conditional distribution P(X | Y = Soph) by dividing each cell in the Soph row by the row total:

X = 0X = 1
P(X | Soph)0.05/0.38 ≈ 0.1320.33/0.38 ≈ 0.868

Check: 0.132 + 0.868 = 1.000. The conditional distribution sums to 1, as it must.

Key insight — the three-step recipe: (1) Look up the joint probability in the table. (2) Marginalize to get the denominator. (3) Divide. This recipe works for any conditional involving variables in the joint. The joint table truly is complete information.

This also lets you go backwards. If you know P(Y = y | X = x) and P(X = x), you can reconstruct the joint: P(X = x, Y = y) = P(Y = y | X = x) · P(X = x). This is just the chain rule from Chapter 3, now applied to random variables.

Check: Using the favorite-digit table, what is P(X = 1 | Y = Frosh)?

Chapter 4: The Multinomial Distribution

We have been looking at joint tables that are defined by listing every probability. The multinomial distribution is our first parametric joint distribution — one defined by a formula.

Think of it as an extension of the binomial. In a binomial, you flip a coin n times and count successes. In a multinomial, you roll an m-sided die n times and count how many times each face appears. The multinomial tells you the probability of any particular combination of face counts.

Setup: Perform n independent trials. Each trial produces one of m outcomes with probabilities p1, p2, …, pm (where ∑ pi = 1). Let Xi be the count of outcome i. Then:

P(X1 = c1, …, Xm = cm) = (n! / (c1! c2! … cm!)) · p1c1 · p2c2 … pmcm

where c1 + c2 + … + cm = n. The first factor is the multinomial coefficient — it counts how many orderings produce the given counts (permutations with indistinct objects).

Why this formula works: Any single ordering with c1 copies of outcome 1, c2 of outcome 2, etc., has probability p1c1 · p2c2 … pmcm (by independence). The multinomial coefficient counts how many distinct orderings produce those same counts. Multiply the two together.

Worked example — Bayeslandia weather: Each day in Bayeslandia is Sunny (p = 0.7), Cloudy (p = 0.2), or Rainy (p = 0.1), independently. What is the probability that over 7 days, exactly 5 are sunny, 1 cloudy, and 1 rainy?

P(S=5, C=1, R=1) = (7! / (5! · 1! · 1!)) · 0.75 · 0.21 · 0.11

Step by step: 7! / (5! · 1! · 1!) = 5040 / (120 · 1 · 1) = 42. And 0.75 = 0.16807. So:

= 42 · 0.16807 · 0.2 · 0.1 = 42 · 0.0033614 ≈ 0.141

Compare: the probability that all 7 days are sunny is 0.77 ≈ 0.082. So a mix of 5 sunny, 1 cloudy, 1 rainy is nearly twice as likely as 7 straight sunny days — because there are 42 ways to arrange that mix versus only 1 way for all-sunny.

Multinomial propertyValue
Parametersn (trials), p1, …, pm (outcome probs, sum to 1)
Supportci ∈ {0, 1, …, n} for each i, with ∑ ci = n
E[Xi]n · pi
Var(Xi)n · pi · (1 − pi)
Check: A fair 6-sided die is rolled 7 times. The multinomial coefficient for the outcome (1 one, 1 two, 0 threes, 2 fours, 0 fives, 3 sixes) is 7!/(1!1!0!2!0!3!). What is this value?

Chapter 5: Joint PDF (Continuous)

When variables are continuous, probabilities live in areas, not cells. Instead of a joint PMF we have a joint probability density function (PDF) f(x, y). The density at a point is not a probability — it is a relative likelihood. To get an actual probability, you integrate over a region:

P(a1 < X ≤ a2, b1 < Y ≤ b2) = ∫a1a2b1b2 f(x, y) dy dx

Think of the joint PDF as a mountain range over the x-y plane. The height at each point tells you how likely that combination is relative to others. The volume under the surface over any region gives the probability of landing in that region. The total volume under the entire surface is 1.

From discrete to continuous — the dartboard intuition: Imagine throwing darts at a target. The x and y landing coordinates are continuous random variables. Discretize the target into a grid of small squares. Each square gets a probability (its share of the darts). Make the squares infinitely small and the probabilities become densities. The discrete joint table becomes the continuous joint PDF.

The independent multivariate Gaussian: The most important continuous joint distribution is the multivariate Gaussian. In the simplest (independent) case, each variable Xi is an independent normal with mean μi and standard deviation σi. The joint PDF is the product of individual PDFs:

f(x1, …, xn) = ∏i (1 / (σi√(2π))) exp(−(xi − μi)2 / (2σi2))

In 2D this produces the familiar bell-shaped mound centered at (μ1, μ2). The simulation below lets you explore this surface.

2D Gaussian Joint Density

Drag the sliders to change the standard deviations. The heatmap shows density as brightness — brighter means higher density. The density is highest at the center (μ1 = 0, μ2 = 0).

σx1
σy1

Worked example — Gaussian blur: An image is blurred using a 2D Gaussian with μ = (0, 0) and σ = (3, 3). What fraction of the density falls inside the center pixel (−0.5 ≤ x ≤ 0.5, −0.5 ≤ y ≤ 0.5)? Because X and Y are independent, F(x, y) = Φ(x/3) · Φ(y/3). The center-pixel mass is:

F(0.5, 0.5) − F(−0.5, 0.5) + F(−0.5, −0.5) − F(0.5, −0.5) ≈ 0.028

Only about 2.8% of the density falls in the center pixel. The rest spreads out into neighboring pixels, which is exactly how Gaussian blur smooths an image.

Check: If f(1, 2) = 0.05 for a joint PDF, does that mean P(X=1, Y=2) = 0.05?

Chapter 6: Marginal PDF

Marginalization works the same way for continuous variables as it does for discrete ones — except you replace the sum with an integral. Given a joint PDF f(x, y), the marginal PDF of X is:

fX(x) = ∫−∞ f(x, y) dy

You are "integrating out" Y, collapsing the 2D density surface onto the x-axis. The result is a 1D density that describes X alone.

Similarly, the marginal of Y integrates out X:

fY(y) = ∫−∞ f(x, y) dx
Key insight — the shadow analogy: Imagine the joint PDF as a 3D mountain. The marginal of X is the "shadow" you get by shining a light along the y-axis onto the x-axis. The marginal of Y is the shadow from a light along the x-axis. Each shadow loses the detail of the other dimension but preserves the overall shape for its own variable.

Worked example: Suppose f(x, y) = 6(1 − y) for 0 ≤ x ≤ y ≤ 1, and 0 otherwise. What is fX(x)?

We integrate out y over the valid range y ∈ [x, 1]:

fX(x) = ∫x1 6(1 − y) dy = 6 [y − y2/2]x1 = 6 [(1 − 1/2) − (x − x2/2)] = 6 [1/2 − x + x2/2] = 3(1 − x)2

Check: ∫01 3(1 − x)2 dx = 3 · [−(1−x)3/3]01 = 3 · (0 − (−1/3)) = 1. The marginal integrates to 1, as required.

Marginal PDF Visualizer

The center shows a 2D joint density. The top panel shows fX(x) — the marginal obtained by integrating out Y. The right panel shows fY(y) — the marginal obtained by integrating out X.

Continuous conditional from the joint: Just as in the discrete case, you can compute conditional densities: f(x | y) = f(x, y) / fY(y). The recipe is identical — divide the joint by the marginal.
Check: To compute the marginal PDF of X from a joint PDF f(x,y), what operation do you perform?

Chapter 7: Covariance & Correlation

Marginals tell you about each variable individually. Covariance tells you how two variables move together. Do large values of X tend to appear alongside large values of Y (positive covariance), or alongside small values of Y (negative covariance)?

Cov(X, Y) = E[(X − μX)(Y − μY)] = E[XY] − E[X]E[Y]

The second form is the computational shortcut — often easier to evaluate. E[XY] is the expectation of the product (computed from the joint), and E[X], E[Y] come from the marginals.

Sign interpretation: If Cov(X,Y) > 0, X and Y tend to be above (or below) their means simultaneously — they rise and fall together. If Cov(X,Y) < 0, when one is above its mean the other tends to be below — they move in opposition. If Cov(X,Y) = 0, there is no linear relationship (but there could still be a non-linear one).

The problem with covariance: Covariance depends on the units and scale of X and Y. If you measure height in centimeters vs. meters, the covariance changes by a factor of 100. This makes it hard to compare covariances across different pairs of variables. The fix is correlation.

The Pearson correlation coefficient normalizes covariance by the standard deviations:

ρX,Y = Cov(X, Y) / (σX · σY)

Correlation is always between −1 and +1. At ρ = +1, Y is an exact increasing linear function of X. At ρ = −1, Y is an exact decreasing linear function. At ρ = 0, there is no linear relationship.

Worked example: Let X and Y have the following discrete joint:

Y = 0Y = 1
X = 00.40.1
X = 10.10.4

E[X] = 0 · 0.5 + 1 · 0.5 = 0.5. E[Y] = 0 · 0.5 + 1 · 0.5 = 0.5. E[XY] = 0·0·0.4 + 0·1·0.1 + 1·0·0.1 + 1·1·0.4 = 0.4. So Cov(X,Y) = 0.4 − 0.5·0.5 = 0.4 − 0.25 = 0.15. Since Var(X) = Var(Y) = 0.5 · 0.5 = 0.25, σX = σY = 0.5, and ρ = 0.15 / (0.5 · 0.5) = 0.6.

A correlation of 0.6 indicates a moderately strong positive linear relationship: when X is high, Y tends to be high too.

Independence kills covariance: If X and Y are independent, then E[XY] = E[X]E[Y], so Cov(X,Y) = 0 and ρ = 0. The converse is not true: zero correlation does not guarantee independence. Two variables can be uncorrelated yet highly dependent in a non-linear way.
Correlation Explorer

Each dot is a sample from a 2D Gaussian with the chosen correlation ρ. Watch how the scatter cloud stretches from a circle (ρ = 0) to a thin line (ρ ≈ ±1).

ρ0.60
Check: If two variables are independent, what is their correlation?

Chapter 8: Showcase — Joint PMF Explorer

This is the payoff. Below is an interactive joint probability table for two discrete random variables X (rows) and Y (columns). You can edit any cell's probability. The tool automatically computes and displays the marginal distributions P(X) and P(Y), the conditional P(X|Y) for a selected column, and the covariance and correlation.

Try building different joint distributions and observe how the marginals, conditionals, and correlation change. Can you create a table where X and Y are uncorrelated (ρ ≈ 0) yet clearly dependent?

Joint PMF Table Builder

Click any cell to cycle its weight (0–9). The table is auto-normalized so all cells sum to 1. Marginals appear on the edges. Select a Y column to see the conditional P(X|Y=y).

Condition on Y =0
Things to try:
• Click Positive Corr and watch ρ go positive. The diagonal cells dominate.
• Click Negative Corr and watch ρ go negative. The anti-diagonal dominates.
• Click Independent — every conditional P(X|Y=y) is the same regardless of y.
• Try building a joint where ρ = 0 but X and Y are clearly dependent (hint: put mass at the four corners symmetrically).
Check: In an independent joint distribution, what happens to P(X|Y=y) as you change y?

Chapter 9: Connections

Joint distributions are not just a topic — they are the language of multi-variable probability. Everything ahead in this course and in machine learning assumes you can fluently work with joint, marginal, and conditional distributions.

Where it leadsHow joint distributions appear
Bayesian inference (Ch 10)The posterior P(θ|data) is derived from the joint P(θ, data) using Bayes' rule and marginalization
Naive Bayes classifiersAssume conditional independence to factorize a huge joint distribution into manageable pieces
Bayesian networksCompact representation of joint distributions via directed graphs and conditional probability tables
Hidden Markov modelsThe joint over hidden states and observations is factored via the Markov property
Gaussian mixture modelsThe joint density is a weighted sum of multivariate Gaussians; EM fits the parameters
Linear regressionCovariance and correlation quantify linear relationships; the regression line minimizes squared error
The master principle: "The joint is complete information." From P(X, Y, Z, …) you can derive any marginal, any conditional, any expectation, any covariance. In machine learning, most of what we do is either (a) estimating a joint distribution from data, (b) computing conditionals from a joint, or (c) finding compact representations of joints that would otherwise be too large to store. Joint distributions are the foundation of it all.

What we built:

• Joint PMF and probability tables
• Marginalization (sum out variables)
• Conditional from joint (divide by marginal)
• Multinomial distribution (parametric joint)
• Joint PDF for continuous variables
• Marginal PDF (integrate out variables)
• Covariance & correlation

What comes next:

Chapter 10: Inference — using joints and conditionals to reason about unknown quantities from observed data
• Parameter estimation (MLE, MAP)
• The central limit theorem
• Bayesian networks

Every probability model you will ever build — from a two-variable table to a billion-parameter neural network — is ultimately a joint distribution over its variables. Master the tools of this chapter and the rest is refinement.

"Probability theory is nothing but common sense
reduced to calculation."
— Pierre-Simon Laplace
Check: Which operation extracts P(X = x) from the joint P(X = x, Y = y)?