Reasoning about several random variables at once — from joint tables to continuous densities.
Suppose you are building a tool to diagnose an illness called Determinitis. A patient walks in. They might have a fever (none, low, or high), and they might have lost their sense of smell. You know the probability of each symptom individually — but that is not enough. Whether fever and smell-loss appear together tells you far more about the disease than either symptom alone.
Until now, every distribution we have studied describes one random variable in isolation. But real problems involve several variables interacting with each other. A recommendation engine reasons about a user's age, location, and watch history simultaneously. A robot fuses its GPS reading with its wheel odometry. A medical model combines fever, cough, and test result.
To handle these situations we need a single function that captures the probability of every possible combination of values for multiple variables at once. That function is the joint distribution.
Here is a taste. Three random variables describe a patient: disease D (0 or 1), fever F (none, low, high), and smell S (0 or 1). The joint table has 2 × 3 × 2 = 12 cells, and every cell stores the probability of one specific combination — say P(D=1, F=low, S=0) = 0.005. Those 12 numbers encode everything about how these variables relate.
| Property | What it gives you |
|---|---|
| Joint PMF / PDF | Probability (or density) for any assignment of all variables |
| Marginal | Distribution of one variable, summing (or integrating) out the others |
| Conditional | Distribution of one variable given specific values of others |
| Covariance / Correlation | How much two variables move together |
This chapter builds these tools one at a time: joint PMFs and their tables, marginalization, conditionals from the joint, the multinomial distribution, continuous joint PDFs, marginal PDFs, and finally covariance and correlation. By the end, you will be able to take any joint distribution and extract every piece of information it contains.
For a single discrete variable, the PMF maps each value to a probability. For two discrete variables X and Y, the joint PMF maps every (x, y) pair to the probability that X takes value x and Y takes value y simultaneously:
Read the comma as "and." A common shorthand is P(x, y), which means the same thing. The joint PMF must satisfy two rules: every entry is non-negative, and the sum over all (x, y) pairs equals 1.
The most concrete way to represent a joint PMF is a joint probability table. Each row is a value of one variable, each column is a value of the other, and the cell contains the joint probability. Here is the Determinitis example from the textbook, simplified to two variables: fever F (none, low, high) and disease D (0 or 1).
| D = 0 | D = 1 | |
|---|---|---|
| F = none | 0.807 | 0.020 |
| F = low | 0.095 | 0.016 |
| F = high | 0.047 | 0.015 |
Reading the table: The probability someone has no fever and no disease is P(F=none, D=0) = 0.807. The probability someone has a high fever and Determinitis is P(F=high, D=1) = 0.015. Sum all six cells: 0.807 + 0.020 + 0.095 + 0.016 + 0.047 + 0.015 = 1.000. Good — the table is a valid distribution.
How big are these tables? If variable i can take ni values, the table has ∏ ni cells. For Determinitis with D (2 values), F (3 values), and S (2 values), that is 2 × 3 × 2 = 12 cells. Adding a fourth binary variable doubles it to 24. With 20 binary variables, the table has 220 ≈ 1 million cells. Joint tables grow exponentially — this is both their power and their weakness.
You have a joint table for X and Y, but you only care about X. How do you recover P(X = x) from the joint? The answer is marginalization — you sum out the variable you don't care about, using the Law of Total Probability (LOTP):
In words: to find the probability that X equals some particular value x, add up the joint probabilities across every possible value of Y. You are collapsing the Y dimension of the table.
Worked example — favorite digit: X is a Stanford student's favorite binary digit (0 or 1) and Y is their year. Here is the joint table from the textbook:
| X = 0 | X = 1 | |
|---|---|---|
| Frosh | 0.01 | 0.13 |
| Soph | 0.05 | 0.33 |
| Junior | 0.04 | 0.21 |
| Senior | 0.03 | 0.12 |
| 5+ | 0.02 | 0.06 |
What is P(X = 0)? Sum down the X=0 column:
And P(X = 1) = 0.13 + 0.33 + 0.21 + 0.12 + 0.06 = 0.85. Check: 0.15 + 0.85 = 1.00.
The name "marginalization" comes from literally writing the sums in the margins of the table. The resulting single-variable distribution P(X) is called the marginal distribution of X.
The joint table is "complete information" — you can compute any conditional distribution directly from it. Recall from Chapter 3 that P(X = x | Y = y) = P(X = x, Y = y) / P(Y = y). The numerator comes straight from the joint table and the denominator is the marginal we just learned to compute.
Worked example: Using the favorite-digit table, what is P(X = 0 | Y = Soph)? We need two things. First, the joint: P(X=0, Y=Soph) = 0.05. Second, the marginal: P(Y=Soph) = 0.05 + 0.33 = 0.38. Now divide:
So about 13.2% of sophomores prefer 0. Compare this to the unconditional P(X=0) = 0.15 — sophomores are slightly less likely to favor 0 than the overall population.
You can compute the full conditional distribution P(X | Y = Soph) by dividing each cell in the Soph row by the row total:
| X = 0 | X = 1 | |
|---|---|---|
| P(X | Soph) | 0.05/0.38 ≈ 0.132 | 0.33/0.38 ≈ 0.868 |
Check: 0.132 + 0.868 = 1.000. The conditional distribution sums to 1, as it must.
This also lets you go backwards. If you know P(Y = y | X = x) and P(X = x), you can reconstruct the joint: P(X = x, Y = y) = P(Y = y | X = x) · P(X = x). This is just the chain rule from Chapter 3, now applied to random variables.
We have been looking at joint tables that are defined by listing every probability. The multinomial distribution is our first parametric joint distribution — one defined by a formula.
Think of it as an extension of the binomial. In a binomial, you flip a coin n times and count successes. In a multinomial, you roll an m-sided die n times and count how many times each face appears. The multinomial tells you the probability of any particular combination of face counts.
Setup: Perform n independent trials. Each trial produces one of m outcomes with probabilities p1, p2, …, pm (where ∑ pi = 1). Let Xi be the count of outcome i. Then:
where c1 + c2 + … + cm = n. The first factor is the multinomial coefficient — it counts how many orderings produce the given counts (permutations with indistinct objects).
Worked example — Bayeslandia weather: Each day in Bayeslandia is Sunny (p = 0.7), Cloudy (p = 0.2), or Rainy (p = 0.1), independently. What is the probability that over 7 days, exactly 5 are sunny, 1 cloudy, and 1 rainy?
Step by step: 7! / (5! · 1! · 1!) = 5040 / (120 · 1 · 1) = 42. And 0.75 = 0.16807. So:
Compare: the probability that all 7 days are sunny is 0.77 ≈ 0.082. So a mix of 5 sunny, 1 cloudy, 1 rainy is nearly twice as likely as 7 straight sunny days — because there are 42 ways to arrange that mix versus only 1 way for all-sunny.
| Multinomial property | Value |
|---|---|
| Parameters | n (trials), p1, …, pm (outcome probs, sum to 1) |
| Support | ci ∈ {0, 1, …, n} for each i, with ∑ ci = n |
| E[Xi] | n · pi |
| Var(Xi) | n · pi · (1 − pi) |
When variables are continuous, probabilities live in areas, not cells. Instead of a joint PMF we have a joint probability density function (PDF) f(x, y). The density at a point is not a probability — it is a relative likelihood. To get an actual probability, you integrate over a region:
Think of the joint PDF as a mountain range over the x-y plane. The height at each point tells you how likely that combination is relative to others. The volume under the surface over any region gives the probability of landing in that region. The total volume under the entire surface is 1.
The independent multivariate Gaussian: The most important continuous joint distribution is the multivariate Gaussian. In the simplest (independent) case, each variable Xi is an independent normal with mean μi and standard deviation σi. The joint PDF is the product of individual PDFs:
In 2D this produces the familiar bell-shaped mound centered at (μ1, μ2). The simulation below lets you explore this surface.
Drag the sliders to change the standard deviations. The heatmap shows density as brightness — brighter means higher density. The density is highest at the center (μ1 = 0, μ2 = 0).
Worked example — Gaussian blur: An image is blurred using a 2D Gaussian with μ = (0, 0) and σ = (3, 3). What fraction of the density falls inside the center pixel (−0.5 ≤ x ≤ 0.5, −0.5 ≤ y ≤ 0.5)? Because X and Y are independent, F(x, y) = Φ(x/3) · Φ(y/3). The center-pixel mass is:
Only about 2.8% of the density falls in the center pixel. The rest spreads out into neighboring pixels, which is exactly how Gaussian blur smooths an image.
Marginalization works the same way for continuous variables as it does for discrete ones — except you replace the sum with an integral. Given a joint PDF f(x, y), the marginal PDF of X is:
You are "integrating out" Y, collapsing the 2D density surface onto the x-axis. The result is a 1D density that describes X alone.
Similarly, the marginal of Y integrates out X:
Worked example: Suppose f(x, y) = 6(1 − y) for 0 ≤ x ≤ y ≤ 1, and 0 otherwise. What is fX(x)?
We integrate out y over the valid range y ∈ [x, 1]:
Check: ∫01 3(1 − x)2 dx = 3 · [−(1−x)3/3]01 = 3 · (0 − (−1/3)) = 1. The marginal integrates to 1, as required.
The center shows a 2D joint density. The top panel shows fX(x) — the marginal obtained by integrating out Y. The right panel shows fY(y) — the marginal obtained by integrating out X.
Marginals tell you about each variable individually. Covariance tells you how two variables move together. Do large values of X tend to appear alongside large values of Y (positive covariance), or alongside small values of Y (negative covariance)?
The second form is the computational shortcut — often easier to evaluate. E[XY] is the expectation of the product (computed from the joint), and E[X], E[Y] come from the marginals.
Sign interpretation: If Cov(X,Y) > 0, X and Y tend to be above (or below) their means simultaneously — they rise and fall together. If Cov(X,Y) < 0, when one is above its mean the other tends to be below — they move in opposition. If Cov(X,Y) = 0, there is no linear relationship (but there could still be a non-linear one).
The Pearson correlation coefficient normalizes covariance by the standard deviations:
Correlation is always between −1 and +1. At ρ = +1, Y is an exact increasing linear function of X. At ρ = −1, Y is an exact decreasing linear function. At ρ = 0, there is no linear relationship.
Worked example: Let X and Y have the following discrete joint:
| Y = 0 | Y = 1 | |
|---|---|---|
| X = 0 | 0.4 | 0.1 |
| X = 1 | 0.1 | 0.4 |
E[X] = 0 · 0.5 + 1 · 0.5 = 0.5. E[Y] = 0 · 0.5 + 1 · 0.5 = 0.5. E[XY] = 0·0·0.4 + 0·1·0.1 + 1·0·0.1 + 1·1·0.4 = 0.4. So Cov(X,Y) = 0.4 − 0.5·0.5 = 0.4 − 0.25 = 0.15. Since Var(X) = Var(Y) = 0.5 · 0.5 = 0.25, σX = σY = 0.5, and ρ = 0.15 / (0.5 · 0.5) = 0.6.
A correlation of 0.6 indicates a moderately strong positive linear relationship: when X is high, Y tends to be high too.
Each dot is a sample from a 2D Gaussian with the chosen correlation ρ. Watch how the scatter cloud stretches from a circle (ρ = 0) to a thin line (ρ ≈ ±1).
This is the payoff. Below is an interactive joint probability table for two discrete random variables X (rows) and Y (columns). You can edit any cell's probability. The tool automatically computes and displays the marginal distributions P(X) and P(Y), the conditional P(X|Y) for a selected column, and the covariance and correlation.
Try building different joint distributions and observe how the marginals, conditionals, and correlation change. Can you create a table where X and Y are uncorrelated (ρ ≈ 0) yet clearly dependent?
Click any cell to cycle its weight (0–9). The table is auto-normalized so all cells sum to 1. Marginals appear on the edges. Select a Y column to see the conditional P(X|Y=y).
Joint distributions are not just a topic — they are the language of multi-variable probability. Everything ahead in this course and in machine learning assumes you can fluently work with joint, marginal, and conditional distributions.
| Where it leads | How joint distributions appear |
|---|---|
| Bayesian inference (Ch 10) | The posterior P(θ|data) is derived from the joint P(θ, data) using Bayes' rule and marginalization |
| Naive Bayes classifiers | Assume conditional independence to factorize a huge joint distribution into manageable pieces |
| Bayesian networks | Compact representation of joint distributions via directed graphs and conditional probability tables |
| Hidden Markov models | The joint over hidden states and observations is factored via the Markov property |
| Gaussian mixture models | The joint density is a weighted sum of multivariate Gaussians; EM fits the parameters |
| Linear regression | Covariance and correlation quantify linear relationships; the regression line minimizes squared error |
What we built:
• Joint PMF and probability tables
• Marginalization (sum out variables)
• Conditional from joint (divide by marginal)
• Multinomial distribution (parametric joint)
• Joint PDF for continuous variables
• Marginal PDF (integrate out variables)
• Covariance & correlation
What comes next:
• Chapter 10: Inference — using joints and conditionals to reason about unknown quantities from observed data
• Parameter estimation (MLE, MAP)
• The central limit theorem
• Bayesian networks
Every probability model you will ever build — from a two-variable table to a billion-parameter neural network — is ultimately a joint distribution over its variables. Master the tools of this chapter and the rest is refinement.