Deisenroth et al., Chapter 6

Probability & Distributions

The language machines use to reason under uncertainty — from axioms to Gaussians to Bayes.

Prerequisites: Chapters 2–5 (linear algebra, calculus). Comfort with matrices and partial derivatives.
12
Chapters
5
Simulations
12
Quizzes

Chapter 0: Why Probability?

You have a sensor. It tells you the temperature is 22.3°C. Do you trust it completely? Of course not — every measurement has noise. Your model predicts rain tomorrow with some confidence, but the atmosphere is chaotic. A neural network classifies an image as "cat" — but how sure is it?

Probability is the mathematical language for reasoning about uncertainty. In machine learning, uncertainty is everywhere: noisy data, incomplete observations, model mismatch, finite training sets. Probability gives us a principled framework to quantify, propagate, and reduce uncertainty.

The core idea: Probability is not just about flipping coins. It is the calculus of belief — a way to assign numbers to how plausible different outcomes are, and to update those numbers as evidence arrives. Every ML model is, at its heart, a probabilistic statement about data.

This chapter builds the probabilistic toolkit that the rest of the book depends on. We start from axioms, build up to distributions, and culminate in two workhorses of ML: the Gaussian distribution and Bayes' theorem.

Foundations
Probability spaces, axioms, random variables
Rules
Sum rule, product rule, Bayes' theorem
Statistics
Mean, variance, covariance, correlation
Distributions
Gaussian, conjugate priors, exponential family
ConceptWhy it matters for ML
Bayes' theoremPosterior inference, updating beliefs with data
Gaussian distributionLinear regression, GP, Kalman filters, VAEs
Conjugate priorsClosed-form Bayesian updates
Exponential familyUnifies most common distributions, enables GLMs
Change of variablesNormalizing flows, reparameterization trick

Think of probability as the operating system of ML. Optimization (Chapter 7) tells you how to learn. Probability tells you what to learn and how certain you should be about it.

Check: Why is probability essential to machine learning?

Chapter 1: Probability Spaces

Before we can compute with probability, we need to define what probability is. The formal machinery has three parts, collectively called a probability space (Ω, A, P).

The sample space Ω is the set of all possible outcomes. Roll a die: Ω = {1, 2, 3, 4, 5, 6}. Measure a person's height: Ω = (0, ∞). The sample space can be finite, countably infinite, or uncountably infinite.

The event space A is a collection of subsets of Ω that we assign probabilities to. For a die, A includes events like "roll an even number" = {2, 4, 6}. Technically, A must be a σ-algebra — closed under complements and countable unions — but the key intuition is: A is the set of questions we can ask about outcomes.

The probability measure P assigns a number in [0, 1] to each event in A, following three axioms:

1.  P(A) ≥ 0  for all  A ∈ A
2.  P(Ω) = 1
3.  P(A1 ∪ A2 ∪ …) = P(A1) + P(A2) + …  if  Ai are disjoint
Key insight: These three axioms are all you need. Every theorem in probability theory — Bayes' theorem, the law of large numbers, the central limit theorem — follows from these three rules. They are the "axioms of Euclidean geometry" for uncertainty.

A random variable X is a function from the sample space to the real numbers: X: Ω → R. It maps outcomes to numbers we can compute with. "The number showing on the die" is a random variable. "The height of a randomly chosen person" is a random variable. We describe random variables through their probability distribution — a rule that assigns probabilities to the values X can take.

ComponentSymbolExample (die roll)
Sample spaceΩ{1, 2, 3, 4, 5, 6}
Event spaceAAll subsets of Ω
Probability measurePP({k}) = 1/6 for each k
Random variableXX(ω) = ω (the face value)
Two interpretations: The frequentist view says P(A) is the long-run frequency of A in repeated experiments. The Bayesian view says P(A) is your degree of belief that A is true. The math is identical — only the interpretation differs. ML uses both perspectives freely.
Check: What are the three components of a probability space?

Chapter 2: Discrete vs Continuous

Random variables come in two fundamental flavors, and they are described differently.

A discrete random variable X takes values from a countable set. We describe it with a probability mass function (PMF): P(X = x) gives the probability of each specific value. The PMF must be non-negative and sum to 1.

P(X = xi) ≥ 0  and  ∑i P(X = xi) = 1

Example: a fair coin has PMF P(H) = 0.5, P(T) = 0.5. A loaded die might have P(6) = 0.5 and P(k) = 0.1 for k = 1, …, 5.

A continuous random variable X takes values in an uncountable set (typically an interval in R). We describe it with a probability density function (PDF) f(x). The crucial difference: f(x) is not a probability. It is a density — probability per unit length. You get probabilities by integrating:

P(a ≤ X ≤ b) = ∫ab f(x) dx
Key insight: For continuous variables, the probability of any single exact value is zero: P(X = 3.14159...) = 0. Probability only lives in intervals. This is why we need densities rather than mass functions. The density can exceed 1 — what matters is that it integrates to 1 over its domain.

The cumulative distribution function (CDF) unifies both cases: F(x) = P(X ≤ x). For discrete X, F is a staircase. For continuous X, F is a smooth curve rising from 0 to 1. The CDF always exists, even when the PMF or PDF is awkward to write down.

PMF vs PDF

Left: a discrete distribution (Binomial with n=10, p=0.5). Right: a continuous distribution (Gaussian with μ=0, σ=1). Notice how the PMF gives bar heights that sum to 1, while the PDF gives a curve whose area integrates to 1.

PropertyDiscrete (PMF)Continuous (PDF)
ValuesCountable setUncountable (interval)
AssignsProbabilities to pointsDensities (prob per length)
Sum/Integral∑ P(x) = 1∫ f(x) dx = 1
P(X = x)Can be > 0Always 0
f(x) or P(x)≤ 1Can be > 1
Check: Can a probability density function f(x) exceed 1?

Chapter 3: Sum Rule & Product Rule

Two rules govern how probabilities compose. Master them, and everything else in this chapter becomes a consequence.

Given two random variables X and Y with joint distribution p(X, Y), the product rule decomposes the joint into a conditional and a marginal:

p(X, Y) = p(Y | X) · p(X)

Read this as: the probability of seeing X and Y together equals the probability of X times the probability of Y given X. The conditional p(Y | X) captures what you learn about Y once you know X.

The sum rule (or marginalization) recovers a marginal distribution by summing (or integrating) out the other variable:

p(X) = ∑Y p(X, Y)   (discrete)
p(X) = ∫ p(X, Y) dY   (continuous)
Why these two rules matter so much: Every probabilistic computation in ML — inference, learning, prediction — reduces to repeated application of the product rule (factoring joints) and the sum rule (integrating out variables). Bayesian inference, EM, variational methods, message passing — all are built from these two bricks.

Independence is the special case where conditioning tells you nothing: p(Y | X) = p(Y). In that case, the product rule simplifies to p(X, Y) = p(X) p(Y). Conditional independence X ⊥ Y | Z means p(X, Y | Z) = p(X | Z) p(Y | Z) — once you know Z, X and Y decouple. This is the foundation of graphical models.

Joint, Marginal, and Conditional

A 2D joint distribution over two discrete variables. The right margin sums columns (marginal of X). The bottom margin sums rows (marginal of Y). Click a cell to see the conditional distribution.

Click a cell to see conditionals
RuleFormulaWhat it does
Productp(X, Y) = p(Y|X) p(X)Factorize a joint distribution
Sump(X) = ∑Y p(X, Y)Marginalize out a variable
Chainp(X1, ..., Xn) = ∏i p(Xi | X1, ..., Xi-1)Factor any joint into conditionals
Check: What does the sum rule (marginalization) let you do?

Chapter 4: Bayes' Theorem

Bayes' theorem is arguably the most important equation in machine learning. It tells you how to update your beliefs when you see new data. Start with a prior belief, observe evidence, and get a posterior belief. The formula is a direct consequence of the product rule:

p(θ | D) = p(D | θ) · p(θ) ⁄ p(D)

Let's unpack each piece:

TermNameMeaning
p(θ)PriorWhat you believed about θ before seeing data
p(D | θ)LikelihoodHow probable the data is if θ were true
p(θ | D)PosteriorYour updated belief after seeing data
p(D)Evidence / Marginal likelihoodA normalizing constant = ∫ p(D | θ) p(θ) dθ
The Bayesian recipe: Prior × Likelihood → Posterior. That's it. Your old beliefs, multiplied by how well they explain the data, give you your new beliefs. More data = posterior concentrates. Bad prior = data eventually overwhelms it. The evidence p(D) just makes the posterior integrate to 1.

Let's make this concrete. Suppose you find a coin and wonder if it is fair. Your prior belief about the bias θ (probability of heads) is a Beta distribution — maybe Beta(2, 2), meaning you think the coin is probably roughly fair but you're not sure. You flip the coin N times and observe h heads. The likelihood is Binomial. Bayes' theorem gives you a posterior that is also Beta — this is the magic of conjugacy (Chapter 9).

Prior: θ ~ Beta(α, β)
Likelihood: h | θ ~ Binomial(N, θ)
Posterior: θ | h ~ Beta(α + h, β + N − h)

The posterior is simply the prior with updated counts. Each head increments α; each tail increments β. The more flips, the more concentrated the posterior becomes. Try it below.

Bayes' Theorem: Coin Flipper

Set the prior Beta(α, β) with the sliders. Click Flip to generate a random coin flip (true bias = 0.6). Watch the posterior update live. Dashed = prior, dotted = likelihood, solid = posterior.

0 flips: 0H 0T
α2
β2
True bias0.60
What to notice: With a weak prior (small α, β), the posterior quickly tracks the data. With a strong prior (large α, β), many flips are needed before the data overrides your prior. The posterior always concentrates toward the true bias as N grows — this is Bayesian consistency.
The evidence integral: The denominator p(D) = ∫ p(D | θ) p(θ) dθ is often intractable for complex models. This is why much of modern ML (variational inference, MCMC, normalizing flows) is dedicated to avoiding or approximating this integral. But conceptually, it is just a normalizing constant.
Check: In Bayes' theorem, what does the likelihood p(D | θ) represent?

Chapter 5: Mean & Variance

Distributions can be complicated objects. Two numbers summarize the most important features: where the distribution is centered and how spread out it is.

The mean (or expected value) E[X] is the probability-weighted average of all values X can take:

E[X] = ∑x x · P(X = x)   (discrete)
E[X] = ∫ x · f(x) dx   (continuous)

Think of the mean as the "center of mass" of the distribution. If you placed weights proportional to the probability at each point on a number line, the mean is where the line balances.

The variance V[X] measures how far values typically deviate from the mean:

V[X] = E[(X − E[X])2] = E[X2] − (E[X])2

The second form — "E of X-squared minus the square of E of X" — is usually easier to compute. The standard deviation σ = √V[X] has the same units as X, making it more interpretable.

Key insight: The mean and variance are expectations — integrals against the distribution. This is a general pattern: any summary of a distribution is an expected value of some function g(X). The mean uses g(X) = X. The variance uses g(X) = (X − μ)2. Higher moments use g(X) = Xk.

Linearity of expectation is enormously useful: E[aX + b] = aE[X] + b. This works even if X and Y are dependent: E[X + Y] = E[X] + E[Y]. No independence needed. Variance is not linear: V[aX + b] = a2V[X] (the constant b disappears, a gets squared).

PropertyMeanVariance
Shift by bE[X + b] = E[X] + bV[X + b] = V[X]
Scale by aE[aX] = aE[X]V[aX] = a2V[X]
Sum (independent)E[X + Y] = E[X] + E[Y]V[X + Y] = V[X] + V[Y]
Check: Why does V[X + b] = V[X] — adding a constant doesn't change variance?

Chapter 6: Covariance & Correlation

Mean and variance describe single variables. For pairs of variables, we need to capture how they move together. Enter covariance.

Cov[X, Y] = E[(X − E[X])(Y − E[Y])] = E[XY] − E[X]E[Y]

Positive covariance: when X is above its mean, Y tends to be above its mean too. Negative: they move in opposite directions. Zero: no linear relationship (but they might still be dependent nonlinearly).

The problem with covariance is that its magnitude depends on the units and scales of X and Y. Correlation normalizes this:

Corr[X, Y] = Cov[X, Y] / (σX σY)   ∈ [−1, 1]

Correlation of +1 means perfect positive linear relationship. Correlation of −1 means perfect negative linear relationship. Correlation of 0 means no linear relationship.

Covariance is variance, generalized. Notice that Cov[X, X] = V[X]. The variance is just the covariance of a variable with itself. For vectors X = (X1, ..., XD), the covariance matrix Σ has entries Σij = Cov[Xi, Xj]. Diagonal entries are variances; off-diagonals are covariances. This matrix is always symmetric and positive semi-definite.

For a random vector X ∈ RD with mean μ = E[X]:

Σ = Cov[X] = E[(X − μ)(X − μ)T]

This D×D matrix encodes all pairwise linear relationships. It is the multivariate generalization of variance, and it appears everywhere: Gaussian distributions, PCA, Kalman filters, Mahalanobis distance.

Important caveat: Zero correlation does not imply independence. Consider X ~ Uniform(−1, 1) and Y = X2. Then Corr[X, Y] = 0, yet Y is completely determined by X. Correlation only measures linear dependence. Independence is a much stronger condition.
Corr[X,Y]Interpretation
+1Perfect positive linear relationship (Y = aX + b, a > 0)
0No linear relationship (may still be dependent!)
−1Perfect negative linear relationship (Y = aX + b, a < 0)
Check: If Corr[X, Y] = 0, are X and Y necessarily independent?

Chapter 7: The Gaussian

If you learn only one distribution in your life, make it the Gaussian. It appears everywhere in ML: linear regression assumes Gaussian noise, the central limit theorem says averages are approximately Gaussian, variational autoencoders use Gaussian latent spaces, and Gaussian processes are built entirely from it.

The univariate Gaussian (or Normal) distribution has two parameters — mean μ and variance σ2:

N(x | μ, σ2) = (2πσ2)−1/2 exp(−(x − μ)2 / 2σ2)

The bell curve. Centered at μ, width controlled by σ. About 68% of probability mass lies within one standard deviation of the mean, 95% within two, 99.7% within three.

The multivariate Gaussian generalizes to vectors X ∈ RD with mean vector μ ∈ RD and covariance matrix Σ ∈ RD×D:

N(x | μ, Σ) = (2π)−D/2 |Σ|−1/2 exp(−½(x − μ)T Σ−1(x − μ))
The quadratic form: The exponent (x − μ)T Σ−1(x − μ) is the Mahalanobis distance from x to μ, weighted by the inverse covariance. Contours of constant density are ellipses whose axes are the eigenvectors of Σ and whose lengths are the eigenvalues. Everything from Chapter 4 (eigendecomposition) shows up here.
2D Gaussian Contours

Adjust the mean, variances, and correlation to see how the Gaussian's shape changes. The ellipses are contours of constant density (1σ and 2σ).

ρ (correlation)0.50
σ11.0
σ20.7

Why Gaussians are special:

Maximum entropy. Among all distributions with a given mean and variance, the Gaussian has the highest entropy — it makes the fewest assumptions.

Closure under linear transformations. If X ~ N(μ, Σ) and Y = AX + b, then Y ~ N(Aμ + b, AΣAT). Gaussians beget Gaussians.

Closure under conditioning and marginalization. Marginals and conditionals of Gaussians are also Gaussian (Chapter 8).

Central Limit Theorem. Sums of many independent variables converge to Gaussian, regardless of the original distribution.

Check: If X ~ N(μ, Σ) and Y = AX + b, what distribution does Y follow?

Chapter 8: Gaussian Conditioning

One of the most powerful properties of the Gaussian is that conditioning and marginalization produce new Gaussians with closed-form parameters. This is why Gaussians dominate Bayesian ML — you can do exact inference without approximation.

Consider a joint Gaussian over two sub-vectors X and Y:

p(X, Y) = N&left;( X; μY], XX, ΣXY; ΣYX, ΣYY] &right;)

Marginalization: p(X) = N(X | μX, ΣXX). Just read off the mean and variance of X from the joint. The cross-covariance ΣXY is irrelevant.

Conditioning: p(X | Y = y) is Gaussian with:

μX|Y = μX + ΣXY ΣYY−1 (y − μY)
ΣX|Y = ΣXX − ΣXY ΣYY−1 ΣYX
Read the conditional mean formula: Start at your prior mean μX. Observe that Y deviates from its mean by (y − μY). Multiply by the "regression coefficient" ΣXYΣYY−1 to translate Y's deviation into X's space. This is exactly linear regression — the conditional mean is a linear function of the observed value y.
Key insight on the conditional covariance: Notice that ΣX|Y does not depend on the observed value y. The uncertainty reduction from conditioning is the same regardless of what value Y takes. Conditioning always shrinks uncertainty (the Schur complement ΣXYΣYY−1ΣYX is positive semi-definite).

These formulas are the engine behind:

ApplicationHow it uses Gaussian conditioning
Bayesian linear regressionPosterior over weights is Gaussian conditioned on data
Kalman filterUpdate step is Gaussian conditioning
Gaussian processesPredictions are conditional Gaussians
Factor analysis / PPCALatent variable inference via conditioning
Check: Does the conditional covariance ΣX|Y depend on the observed value y?

Chapter 9: Conjugate Priors

Bayes' theorem says posterior ∝ likelihood × prior. For most likelihood-prior combinations, the resulting posterior has no closed form — you need MCMC or variational methods. But for certain special pairings, the posterior belongs to the same family as the prior. These are called conjugate priors.

Conjugacy means convenience: If your prior is Beta and your likelihood is Binomial, the posterior is Beta. If your prior is Gaussian and your likelihood is Gaussian, the posterior is Gaussian. No integrals, no approximation — just update the parameters.

We already saw one example in the coin flipper (Chapter 4): the Beta distribution is conjugate to the Binomial likelihood. Here is the update:

Prior: θ ~ Beta(α, β)
Likelihood: h heads in N flips ~ Binomial(N, θ)
Posterior: θ | data ~ Beta(α + h, β + N − h)

The prior parameters α and β act like "pseudo-counts" — imagine you have already seen α − 1 heads and β − 1 tails before any data arrives. The data adds real counts on top. As N → ∞, the data dominates and the prior becomes irrelevant.

LikelihoodConjugate PriorPosterior
Binomial / BernoulliBetaBeta
Gaussian (known σ)GaussianGaussian
Gaussian (known μ)Inverse-GammaInverse-Gamma
MultinomialDirichletDirichlet
PoissonGammaGamma
Key insight: Conjugacy is not just a mathematical curiosity. It means you can process data sequentially: yesterday's posterior becomes today's prior, and the update is just parameter addition. This is how online Bayesian learning works — and it is exactly what the Kalman filter does for continuous states.

For the Gaussian with known variance σ2, the conjugate prior on the mean μ is also Gaussian:

Prior: μ ~ N(m0, s02)
Posterior: μ | x1, ..., xN ~ N(mN, sN2)
sN2 = (1/s02 + N/σ2)−1    mN = sN2(m0/s02 + N x̄/σ2)

The posterior precision (inverse variance) is the sum of prior precision and data precision. The posterior mean is a precision-weighted average of the prior mean and the data mean. More data = more precision = tighter posterior.

Check: What does "conjugate prior" mean?

Chapter 10: Exponential Family

We have seen several distributions: Gaussian, Bernoulli, Beta, Gamma, Poisson. They look very different, but most of them share a common algebraic structure. The exponential family unifies them.

A distribution belongs to the exponential family if its density can be written as:

p(x | η) = h(x) exp(ηT T(x) − A(η))
SymbolNameRole
ηNatural parameterControls the distribution shape
T(x)Sufficient statisticSummarizes data — no information lost
A(η)Log-partition functionEnsures the distribution integrates to 1
h(x)Base measureScaling factor independent of η
Why this matters for ML: If your model is in the exponential family, you get three powerful gifts. (1) Sufficient statistics: T(x) captures all the information in the data — you can throw away the raw data and keep only T(x). For a Gaussian, T(x) = (x, x2); for a Bernoulli, T(x) = x. (2) Conjugate priors always exist. (3) Maximum likelihood has a unique solution related to matching moments.

The log-partition function A(η) is deceptively useful. Its derivatives give you moments:

dA/dη = E[T(x)]   (the mean of the sufficient statistic)
d2A/dη2 = Var[T(x)]   (its variance)
Example — Bernoulli: P(x | p) = px(1−p)1−x. Let η = log(p/(1−p)) (the log-odds). Then T(x) = x, h(x) = 1, and A(η) = log(1 + exp(η)) — this is the softplus function! The sigmoid function p = 1/(1+exp(−η)) maps natural parameters back to probabilities. Logistic regression is just fitting natural parameters of a Bernoulli exponential family model.

Generalized linear models (GLMs) extend linear regression by using any exponential family distribution for the response variable. The natural parameter is a linear function of the inputs: η = wTx. This gives you logistic regression (Bernoulli), Poisson regression (Poisson), and classical linear regression (Gaussian) as special cases of the same framework.

Check: What is the sufficient statistic for a Gaussian distribution?

Chapter 11: Change of Variables

If X has a known distribution and Y = g(X) for some function g, what is the distribution of Y? This is the change of variables problem, and the answer involves the Jacobian from Chapter 5.

For a monotonic, differentiable function g with inverse g−1, the density of Y = g(X) is:

fY(y) = fX(g−1(y)) · |dg−1/dy|

The absolute derivative |dg−1/dy| is the Jacobian factor. It accounts for the "stretching" or "compressing" that g does to the probability mass. If g spreads out a region, the density decreases there; if g compresses, the density increases.

Key insight: Probability mass is conserved. If you transform a random variable, the total probability must still equal 1. The Jacobian is the correction factor that ensures this. It is the "exchange rate" between the old and new coordinate systems.

For multivariate transformations Y = g(X) where X ∈ RD:

fY(y) = fX(g−1(y)) · |det(Jg−1(y))|

where Jg−1 is the Jacobian matrix of the inverse mapping. The absolute determinant measures how volumes change under the transformation.

This is the foundation of normalizing flows. A normalizing flow transforms a simple distribution (like a standard Gaussian) through a chain of invertible transformations Y = gK ˆ … ˆ g1(X). At each step, the log-density changes by the log-absolute-determinant of the Jacobian. By choosing transformations with tractable Jacobians, you can model arbitrarily complex distributions while still computing exact likelihoods.
Change of Variables: Stretching Densities

Start with X ~ N(0, 1). Apply Y = g(X). Watch how the density transforms. The Jacobian correction preserves total probability mass.

The reparameterization trick, used in VAEs, is also a change of variables. Instead of sampling Z ~ q(z | x), you write Z = μ(x) + σ(x) · ε where ε ~ N(0, 1). The randomness is moved to the fixed distribution ε, making the gradient ∂/∂μ and ∂/∂σ well-defined. This is possible because the Gaussian is closed under affine transformations.

Check: What is the role of the Jacobian determinant in the change of variables formula?