Ch 6: Probability & Distributions

Chapter 0: Why Probability?

You have a sensor. It tells you the temperature is 22.3°C. Do you trust it completely? Of course not — every measurement has noise. Your model predicts rain tomorrow with some confidence, but the atmosphere is chaotic. A neural network classifies an image as "cat" — but how sure is it?

Probability is the mathematical language for reasoning about uncertainty. In machine learning, uncertainty is everywhere: noisy data, incomplete observations, model mismatch, finite training sets. Probability gives us a principled framework to quantify, propagate, and reduce uncertainty.

The core idea: Probability is not just about flipping coins. It is the calculus of belief — a way to assign numbers to how plausible different outcomes are, and to update those numbers as evidence arrives. Every ML model is, at its heart, a probabilistic statement about data.

This chapter builds the probabilistic toolkit that the rest of the book depends on. We start from axioms, build up to distributions, and culminate in two workhorses of ML: the Gaussian distribution and Bayes' theorem.

Foundations

Probability spaces, axioms, random variables

↓

Rules

Sum rule, product rule, Bayes' theorem

↓

Statistics

Mean, variance, covariance, correlation

↓

Distributions

Gaussian, conjugate priors, exponential family

Concept	Why it matters for ML
Bayes' theorem	Posterior inference, updating beliefs with data
Gaussian distribution	Linear regression, GP, Kalman filters, VAEs
Conjugate priors	Closed-form Bayesian updates
Exponential family	Unifies most common distributions, enables GLMs
Change of variables	Normalizing flows, reparameterization trick

Think of probability as the operating system of ML. Optimization (Chapter 7) tells you how to learn. Probability tells you what to learn and how certain you should be about it.

Check: Why is probability essential to machine learning?

It makes models run faster It provides a principled framework for reasoning about uncertainty in data and models It is only needed for random number generation

Chapter 1: Probability Spaces

Before we can compute with probability, we need to define what probability is. The formal machinery has three parts, collectively called a probability space (Ω, A, P).

The sample space Ω is the set of all possible outcomes. Roll a die: Ω = {1, 2, 3, 4, 5, 6}. Measure a person's height: Ω = (0, ∞). The sample space can be finite, countably infinite, or uncountably infinite.

The event space A is a collection of subsets of Ω that we assign probabilities to. For a die, A includes events like "roll an even number" = {2, 4, 6}. Technically, A must be a σ-algebra — closed under complements and countable unions — but the key intuition is: A is the set of questions we can ask about outcomes.

The probability measure P assigns a number in [0, 1] to each event in A, following three axioms:

1. P(A) ≥ 0 for all A ∈ A

2. P(Ω) = 1

3. P(A₁ ∪ A₂ ∪ …) = P(A₁) + P(A₂) + … if A_i are disjoint

Key insight: These three axioms are all you need. Every theorem in probability theory — Bayes' theorem, the law of large numbers, the central limit theorem — follows from these three rules. They are the "axioms of Euclidean geometry" for uncertainty.

A random variable X is a function from the sample space to the real numbers: X: Ω → R. It maps outcomes to numbers we can compute with. "The number showing on the die" is a random variable. "The height of a randomly chosen person" is a random variable. We describe random variables through their probability distribution — a rule that assigns probabilities to the values X can take.

Component	Symbol	Example (die roll)
Sample space	Ω	{1, 2, 3, 4, 5, 6}
Event space	A	All subsets of Ω
Probability measure	P	P({k}) = 1/6 for each k
Random variable	X	X(ω) = ω (the face value)

Two interpretations: The frequentist view says P(A) is the long-run frequency of A in repeated experiments. The Bayesian view says P(A) is your degree of belief that A is true. The math is identical — only the interpretation differs. ML uses both perspectives freely.

Check: What are the three components of a probability space?

Sample space Ω, event space A, probability measure P Mean, variance, standard deviation Prior, likelihood, posterior

Chapter 2: Discrete vs Continuous

Random variables come in two fundamental flavors, and they are described differently.

A discrete random variable X takes values from a countable set. We describe it with a probability mass function (PMF): P(X = x) gives the probability of each specific value. The PMF must be non-negative and sum to 1.

P(X = x_i) ≥ 0 and ∑_i P(X = x_i) = 1

Example: a fair coin has PMF P(H) = 0.5, P(T) = 0.5. A loaded die might have P(6) = 0.5 and P(k) = 0.1 for k = 1, …, 5.

A continuous random variable X takes values in an uncountable set (typically an interval in R). We describe it with a probability density function (PDF) f(x). The crucial difference: f(x) is not a probability. It is a density — probability per unit length. You get probabilities by integrating:

P(a ≤ X ≤ b) = ∫_a^b f(x) dx

Key insight: For continuous variables, the probability of any single exact value is zero: P(X = 3.14159...) = 0. Probability only lives in intervals. This is why we need densities rather than mass functions. The density can exceed 1 — what matters is that it integrates to 1 over its domain.

The cumulative distribution function (CDF) unifies both cases: F(x) = P(X ≤ x). For discrete X, F is a staircase. For continuous X, F is a smooth curve rising from 0 to 1. The CDF always exists, even when the PMF or PDF is awkward to write down.

PMF vs PDF

Left: a discrete distribution (Binomial with n=10, p=0.5). Right: a continuous distribution (Gaussian with μ=0, σ=1). Notice how the PMF gives bar heights that sum to 1, while the PDF gives a curve whose area integrates to 1.

Property	Discrete (PMF)	Continuous (PDF)
Values	Countable set	Uncountable (interval)
Assigns	Probabilities to points	Densities (prob per length)
Sum/Integral	∑ P(x) = 1	∫ f(x) dx = 1
P(X = x)	Can be > 0	Always 0
f(x) or P(x)	≤ 1	Can be > 1

Check: Can a probability density function f(x) exceed 1?

Yes — it is a density, not a probability; only the integral must equal 1 No — probabilities can never exceed 1 Only for multivariate distributions

Chapter 3: Sum Rule & Product Rule

Two rules govern how probabilities compose. Master them, and everything else in this chapter becomes a consequence.

Given two random variables X and Y with joint distribution p(X, Y), the product rule decomposes the joint into a conditional and a marginal:

p(X, Y) = p(Y | X) · p(X)

Read this as: the probability of seeing X and Y together equals the probability of X times the probability of Y given X. The conditional p(Y | X) captures what you learn about Y once you know X.

The sum rule (or marginalization) recovers a marginal distribution by summing (or integrating) out the other variable:

p(X) = ∑_Y p(X, Y) (discrete)

p(X) = ∫ p(X, Y) dY (continuous)

Why these two rules matter so much: Every probabilistic computation in ML — inference, learning, prediction — reduces to repeated application of the product rule (factoring joints) and the sum rule (integrating out variables). Bayesian inference, EM, variational methods, message passing — all are built from these two bricks.

Independence is the special case where conditioning tells you nothing: p(Y | X) = p(Y). In that case, the product rule simplifies to p(X, Y) = p(X) p(Y). Conditional independence X ⊥ Y | Z means p(X, Y | Z) = p(X | Z) p(Y | Z) — once you know Z, X and Y decouple. This is the foundation of graphical models.

Joint, Marginal, and Conditional

A 2D joint distribution over two discrete variables. The right margin sums columns (marginal of X). The bottom margin sums rows (marginal of Y). Click a cell to see the conditional distribution.

Click a cell to see conditionals

Rule	Formula	What it does
Product	p(X, Y) = p(Y\|X) p(X)	Factorize a joint distribution
Sum	p(X) = ∑_Y p(X, Y)	Marginalize out a variable
Chain	p(X₁, ..., X_n) = ∏_i p(X_i \| X₁, ..., X_i-1)	Factor any joint into conditionals

Check: What does the sum rule (marginalization) let you do?

Multiply two probabilities together Recover the distribution of one variable by summing out the other from the joint Compute the mode of a distribution

Chapter 4: Bayes' Theorem

Bayes' theorem is arguably the most important equation in machine learning. It tells you how to update your beliefs when you see new data. Start with a prior belief, observe evidence, and get a posterior belief. The formula is a direct consequence of the product rule:

p(θ | D) = p(D | θ) · p(θ) ⁄ p(D)

Let's unpack each piece:

Term	Name	Meaning
p(θ)	Prior	What you believed about θ before seeing data
p(D \| θ)	Likelihood	How probable the data is if θ were true
p(θ \| D)	Posterior	Your updated belief after seeing data
p(D)	Evidence / Marginal likelihood	A normalizing constant = ∫ p(D \| θ) p(θ) dθ

The Bayesian recipe: Prior × Likelihood → Posterior. That's it. Your old beliefs, multiplied by how well they explain the data, give you your new beliefs. More data = posterior concentrates. Bad prior = data eventually overwhelms it. The evidence p(D) just makes the posterior integrate to 1.

Let's make this concrete. Suppose you find a coin and wonder if it is fair. Your prior belief about the bias θ (probability of heads) is a Beta distribution — maybe Beta(2, 2), meaning you think the coin is probably roughly fair but you're not sure. You flip the coin N times and observe h heads. The likelihood is Binomial. Bayes' theorem gives you a posterior that is also Beta — this is the magic of conjugacy (Chapter 9).

Prior: θ ~ Beta(α, β)

Likelihood: h | θ ~ Binomial(N, θ)

Posterior: θ | h ~ Beta(α + h, β + N − h)

The posterior is simply the prior with updated counts. Each head increments α; each tail increments β. The more flips, the more concentrated the posterior becomes. Try it below.

Bayes' Theorem: Coin Flipper

Set the prior Beta(α, β) with the sliders. Click Flip to generate a random coin flip (true bias = 0.6). Watch the posterior update live. Dashed = prior, dotted = likelihood, solid = posterior.

0 flips: 0H 0T

α2

β2

True bias0.60

What to notice: With a weak prior (small α, β), the posterior quickly tracks the data. With a strong prior (large α, β), many flips are needed before the data overrides your prior. The posterior always concentrates toward the true bias as N grows — this is Bayesian consistency.

The evidence integral: The denominator p(D) = ∫ p(D | θ) p(θ) dθ is often intractable for complex models. This is why much of modern ML (variational inference, MCMC, normalizing flows) is dedicated to avoiding or approximating this integral. But conceptually, it is just a normalizing constant.

Check: In Bayes' theorem, what does the likelihood p(D | θ) represent?

Your prior belief about the parameter How probable the observed data is under a specific parameter value The probability of the parameter being correct

Chapter 5: Mean & Variance

Distributions can be complicated objects. Two numbers summarize the most important features: where the distribution is centered and how spread out it is.

The mean (or expected value) E[X] is the probability-weighted average of all values X can take:

E[X] = ∑_x x · P(X = x) (discrete)

E[X] = ∫ x · f(x) dx (continuous)

Think of the mean as the "center of mass" of the distribution. If you placed weights proportional to the probability at each point on a number line, the mean is where the line balances.

The variance V[X] measures how far values typically deviate from the mean:

V[X] = E[(X − E[X])²] = E[X²] − (E[X])²

The second form — "E of X-squared minus the square of E of X" — is usually easier to compute. The standard deviation σ = √V[X] has the same units as X, making it more interpretable.

Key insight: The mean and variance are expectations — integrals against the distribution. This is a general pattern: any summary of a distribution is an expected value of some function g(X). The mean uses g(X) = X. The variance uses g(X) = (X − μ)². Higher moments use g(X) = X^k.

Linearity of expectation is enormously useful: E[aX + b] = aE[X] + b. This works even if X and Y are dependent: E[X + Y] = E[X] + E[Y]. No independence needed. Variance is not linear: V[aX + b] = a²V[X] (the constant b disappears, a gets squared).

Property	Mean	Variance
Shift by b	E[X + b] = E[X] + b	V[X + b] = V[X]
Scale by a	E[aX] = aE[X]	V[aX] = a²V[X]
Sum (independent)	E[X + Y] = E[X] + E[Y]	V[X + Y] = V[X] + V[Y]

Check: Why does V[X + b] = V[X] — adding a constant doesn't change variance?

Because shifting all values by b shifts the mean by b too, so deviations from the mean are unchanged Because constants have zero probability Because variance is always non-negative

Chapter 6: Covariance & Correlation

Mean and variance describe single variables. For pairs of variables, we need to capture how they move together. Enter covariance.

Cov[X, Y] = E[(X − E[X])(Y − E[Y])] = E[XY] − E[X]E[Y]

Positive covariance: when X is above its mean, Y tends to be above its mean too. Negative: they move in opposite directions. Zero: no linear relationship (but they might still be dependent nonlinearly).

The problem with covariance is that its magnitude depends on the units and scales of X and Y. Correlation normalizes this:

Corr[X, Y] = Cov[X, Y] / (σ_X σ_Y) ∈ [−1, 1]

Correlation of +1 means perfect positive linear relationship. Correlation of −1 means perfect negative linear relationship. Correlation of 0 means no linear relationship.

Covariance is variance, generalized. Notice that Cov[X, X] = V[X]. The variance is just the covariance of a variable with itself. For vectors X = (X₁, ..., X_D), the covariance matrix Σ has entries Σ_ij = Cov[X_i, X_j]. Diagonal entries are variances; off-diagonals are covariances. This matrix is always symmetric and positive semi-definite.

For a random vector X ∈ R^D with mean μ = E[X]:

Σ = Cov[X] = E[(X − μ)(X − μ)^T]

This D×D matrix encodes all pairwise linear relationships. It is the multivariate generalization of variance, and it appears everywhere: Gaussian distributions, PCA, Kalman filters, Mahalanobis distance.

Important caveat: Zero correlation does not imply independence. Consider X ~ Uniform(−1, 1) and Y = X². Then Corr[X, Y] = 0, yet Y is completely determined by X. Correlation only measures linear dependence. Independence is a much stronger condition.

Corr[X,Y]	Interpretation
+1	Perfect positive linear relationship (Y = aX + b, a > 0)
0	No linear relationship (may still be dependent!)
−1	Perfect negative linear relationship (Y = aX + b, a < 0)

Check: If Corr[X, Y] = 0, are X and Y necessarily independent?

Yes — zero correlation always implies independence No — zero correlation means no linear relationship, but nonlinear dependence can still exist Only if both are Gaussian

Chapter 7: The Gaussian

If you learn only one distribution in your life, make it the Gaussian. It appears everywhere in ML: linear regression assumes Gaussian noise, the central limit theorem says averages are approximately Gaussian, variational autoencoders use Gaussian latent spaces, and Gaussian processes are built entirely from it.

The univariate Gaussian (or Normal) distribution has two parameters — mean μ and variance σ²:

N(x | μ, σ²) = (2πσ²)^−1/2 exp(−(x − μ)² / 2σ²)

The bell curve. Centered at μ, width controlled by σ. About 68% of probability mass lies within one standard deviation of the mean, 95% within two, 99.7% within three.

The multivariate Gaussian generalizes to vectors X ∈ R^D with mean vector μ ∈ R^D and covariance matrix Σ ∈ R^D×D:

N(x | μ, Σ) = (2π)^−D/2 |Σ|^−1/2 exp(−½(x − μ)^T Σ⁻¹(x − μ))

The quadratic form: The exponent (x − μ)^T Σ⁻¹(x − μ) is the Mahalanobis distance from x to μ, weighted by the inverse covariance. Contours of constant density are ellipses whose axes are the eigenvectors of Σ and whose lengths are the eigenvalues. Everything from Chapter 4 (eigendecomposition) shows up here.

2D Gaussian Contours

Adjust the mean, variances, and correlation to see how the Gaussian's shape changes. The ellipses are contours of constant density (1σ and 2σ).

ρ (correlation)0.50

σ₁1.0

σ₂0.7

Why Gaussians are special:

• Maximum entropy. Among all distributions with a given mean and variance, the Gaussian has the highest entropy — it makes the fewest assumptions.

• Closure under linear transformations. If X ~ N(μ, Σ) and Y = AX + b, then Y ~ N(Aμ + b, AΣA^T). Gaussians beget Gaussians.

• Closure under conditioning and marginalization. Marginals and conditionals of Gaussians are also Gaussian (Chapter 8).

• Central Limit Theorem. Sums of many independent variables converge to Gaussian, regardless of the original distribution.

Check: If X ~ N(μ, Σ) and Y = AX + b, what distribution does Y follow?

N(Aμ + b, AΣA^T) — still Gaussian It depends on A A uniform distribution

Chapter 8: Gaussian Conditioning

One of the most powerful properties of the Gaussian is that conditioning and marginalization produce new Gaussians with closed-form parameters. This is why Gaussians dominate Bayesian ML — you can do exact inference without approximation.

Consider a joint Gaussian over two sub-vectors X and Y:

p(X, Y) = N&left;( [μ_X; μ_Y], [Σ_XX, Σ_XY; Σ_YX, Σ_YY] &right;)

Marginalization: p(X) = N(X | μ_X, Σ_XX). Just read off the mean and variance of X from the joint. The cross-covariance Σ_XY is irrelevant.

Conditioning: p(X | Y = y) is Gaussian with:

μ_X|Y = μ_X + Σ_XY Σ_YY⁻¹ (y − μ_Y)

Σ_X|Y = Σ_XX − Σ_XY Σ_YY⁻¹ Σ_YX

Read the conditional mean formula: Start at your prior mean μ_X. Observe that Y deviates from its mean by (y − μ_Y). Multiply by the "regression coefficient" Σ_XYΣ_YY⁻¹ to translate Y's deviation into X's space. This is exactly linear regression — the conditional mean is a linear function of the observed value y.

Key insight on the conditional covariance: Notice that Σ_X|Y does not depend on the observed value y. The uncertainty reduction from conditioning is the same regardless of what value Y takes. Conditioning always shrinks uncertainty (the Schur complement Σ_XYΣ_YY⁻¹Σ_YX is positive semi-definite).

These formulas are the engine behind:

Application	How it uses Gaussian conditioning
Bayesian linear regression	Posterior over weights is Gaussian conditioned on data
Kalman filter	Update step is Gaussian conditioning
Gaussian processes	Predictions are conditional Gaussians
Factor analysis / PPCA	Latent variable inference via conditioning

Check: Does the conditional covariance Σ_X|Y depend on the observed value y?

Yes — different observations lead to different uncertainties No — the uncertainty reduction is the same regardless of what value y takes Only for 1D Gaussians

Chapter 9: Conjugate Priors

Bayes' theorem says posterior ∝ likelihood × prior. For most likelihood-prior combinations, the resulting posterior has no closed form — you need MCMC or variational methods. But for certain special pairings, the posterior belongs to the same family as the prior. These are called conjugate priors.

Conjugacy means convenience: If your prior is Beta and your likelihood is Binomial, the posterior is Beta. If your prior is Gaussian and your likelihood is Gaussian, the posterior is Gaussian. No integrals, no approximation — just update the parameters.

We already saw one example in the coin flipper (Chapter 4): the Beta distribution is conjugate to the Binomial likelihood. Here is the update:

Prior: θ ~ Beta(α, β)

Likelihood: h heads in N flips ~ Binomial(N, θ)

Posterior: θ | data ~ Beta(α + h, β + N − h)

The prior parameters α and β act like "pseudo-counts" — imagine you have already seen α − 1 heads and β − 1 tails before any data arrives. The data adds real counts on top. As N → ∞, the data dominates and the prior becomes irrelevant.

Likelihood	Conjugate Prior	Posterior
Binomial / Bernoulli	Beta	Beta
Gaussian (known σ)	Gaussian	Gaussian
Gaussian (known μ)	Inverse-Gamma	Inverse-Gamma
Multinomial	Dirichlet	Dirichlet
Poisson	Gamma	Gamma

Key insight: Conjugacy is not just a mathematical curiosity. It means you can process data sequentially: yesterday's posterior becomes today's prior, and the update is just parameter addition. This is how online Bayesian learning works — and it is exactly what the Kalman filter does for continuous states.

For the Gaussian with known variance σ², the conjugate prior on the mean μ is also Gaussian:

Prior: μ ~ N(m₀, s₀²)

Posterior: μ | x₁, ..., x_N ~ N(m_N, s_N²)

s_N² = (1/s₀² + N/σ²)⁻¹ m_N = s_N²(m₀/s₀² + N x̄/σ²)

The posterior precision (inverse variance) is the sum of prior precision and data precision. The posterior mean is a precision-weighted average of the prior mean and the data mean. More data = more precision = tighter posterior.

Check: What does "conjugate prior" mean?

The prior that maximizes the likelihood The prior that is always uniform A prior that, when combined with a given likelihood, produces a posterior in the same family

Chapter 10: Exponential Family

We have seen several distributions: Gaussian, Bernoulli, Beta, Gamma, Poisson. They look very different, but most of them share a common algebraic structure. The exponential family unifies them.

A distribution belongs to the exponential family if its density can be written as:

p(x | η) = h(x) exp(η^T T(x) − A(η))

Symbol	Name	Role
η	Natural parameter	Controls the distribution shape
T(x)	Sufficient statistic	Summarizes data — no information lost
A(η)	Log-partition function	Ensures the distribution integrates to 1
h(x)	Base measure	Scaling factor independent of η

Why this matters for ML: If your model is in the exponential family, you get three powerful gifts. (1) Sufficient statistics: T(x) captures all the information in the data — you can throw away the raw data and keep only T(x). For a Gaussian, T(x) = (x, x²); for a Bernoulli, T(x) = x. (2) Conjugate priors always exist. (3) Maximum likelihood has a unique solution related to matching moments.

The log-partition function A(η) is deceptively useful. Its derivatives give you moments:

dA/dη = E[T(x)] (the mean of the sufficient statistic)

d²A/dη² = Var[T(x)] (its variance)

Example — Bernoulli: P(x | p) = p^x(1−p)^1−x. Let η = log(p/(1−p)) (the log-odds). Then T(x) = x, h(x) = 1, and A(η) = log(1 + exp(η)) — this is the softplus function! The sigmoid function p = 1/(1+exp(−η)) maps natural parameters back to probabilities. Logistic regression is just fitting natural parameters of a Bernoulli exponential family model.

Generalized linear models (GLMs) extend linear regression by using any exponential family distribution for the response variable. The natural parameter is a linear function of the inputs: η = w^Tx. This gives you logistic regression (Bernoulli), Poisson regression (Poisson), and classical linear regression (Gaussian) as special cases of the same framework.

Check: What is the sufficient statistic for a Gaussian distribution?

Just x (x, x²) — you need both to recover the mean and variance The sample median

Chapter 11: Change of Variables

If X has a known distribution and Y = g(X) for some function g, what is the distribution of Y? This is the change of variables problem, and the answer involves the Jacobian from Chapter 5.

For a monotonic, differentiable function g with inverse g⁻¹, the density of Y = g(X) is:

f_Y(y) = f_X(g⁻¹(y)) · |dg⁻¹/dy|

The absolute derivative |dg⁻¹/dy| is the Jacobian factor. It accounts for the "stretching" or "compressing" that g does to the probability mass. If g spreads out a region, the density decreases there; if g compresses, the density increases.

Key insight: Probability mass is conserved. If you transform a random variable, the total probability must still equal 1. The Jacobian is the correction factor that ensures this. It is the "exchange rate" between the old and new coordinate systems.

For multivariate transformations Y = g(X) where X ∈ R^D:

f_Y(y) = f_X(g⁻¹(y)) · |det(J_g⁻¹(y))|

where J_g⁻¹ is the Jacobian matrix of the inverse mapping. The absolute determinant measures how volumes change under the transformation.

This is the foundation of normalizing flows. A normalizing flow transforms a simple distribution (like a standard Gaussian) through a chain of invertible transformations Y = g_K ˆ … ˆ g₁(X). At each step, the log-density changes by the log-absolute-determinant of the Jacobian. By choosing transformations with tractable Jacobians, you can model arbitrarily complex distributions while still computing exact likelihoods.

Change of Variables: Stretching Densities

Start with X ~ N(0, 1). Apply Y = g(X). Watch how the density transforms. The Jacobian correction preserves total probability mass.

The reparameterization trick, used in VAEs, is also a change of variables. Instead of sampling Z ~ q(z | x), you write Z = μ(x) + σ(x) · ε where ε ~ N(0, 1). The randomness is moved to the fixed distribution ε, making the gradient ∂/∂μ and ∂/∂σ well-defined. This is possible because the Gaussian is closed under affine transformations.

Check: What is the role of the Jacobian determinant in the change of variables formula?

It corrects for how the transformation stretches or compresses probability mass It makes the transformation invertible It computes the mean of the transformed variable