The language machines use to reason under uncertainty — from axioms to Gaussians to Bayes.
You have a sensor. It tells you the temperature is 22.3°C. Do you trust it completely? Of course not — every measurement has noise. Your model predicts rain tomorrow with some confidence, but the atmosphere is chaotic. A neural network classifies an image as "cat" — but how sure is it?
Probability is the mathematical language for reasoning about uncertainty. In machine learning, uncertainty is everywhere: noisy data, incomplete observations, model mismatch, finite training sets. Probability gives us a principled framework to quantify, propagate, and reduce uncertainty.
This chapter builds the probabilistic toolkit that the rest of the book depends on. We start from axioms, build up to distributions, and culminate in two workhorses of ML: the Gaussian distribution and Bayes' theorem.
| Concept | Why it matters for ML |
|---|---|
| Bayes' theorem | Posterior inference, updating beliefs with data |
| Gaussian distribution | Linear regression, GP, Kalman filters, VAEs |
| Conjugate priors | Closed-form Bayesian updates |
| Exponential family | Unifies most common distributions, enables GLMs |
| Change of variables | Normalizing flows, reparameterization trick |
Think of probability as the operating system of ML. Optimization (Chapter 7) tells you how to learn. Probability tells you what to learn and how certain you should be about it.
Before we can compute with probability, we need to define what probability is. The formal machinery has three parts, collectively called a probability space (Ω, A, P).
The sample space Ω is the set of all possible outcomes. Roll a die: Ω = {1, 2, 3, 4, 5, 6}. Measure a person's height: Ω = (0, ∞). The sample space can be finite, countably infinite, or uncountably infinite.
The event space A is a collection of subsets of Ω that we assign probabilities to. For a die, A includes events like "roll an even number" = {2, 4, 6}. Technically, A must be a σ-algebra — closed under complements and countable unions — but the key intuition is: A is the set of questions we can ask about outcomes.
The probability measure P assigns a number in [0, 1] to each event in A, following three axioms:
A random variable X is a function from the sample space to the real numbers: X: Ω → R. It maps outcomes to numbers we can compute with. "The number showing on the die" is a random variable. "The height of a randomly chosen person" is a random variable. We describe random variables through their probability distribution — a rule that assigns probabilities to the values X can take.
| Component | Symbol | Example (die roll) |
|---|---|---|
| Sample space | Ω | {1, 2, 3, 4, 5, 6} |
| Event space | A | All subsets of Ω |
| Probability measure | P | P({k}) = 1/6 for each k |
| Random variable | X | X(ω) = ω (the face value) |
Random variables come in two fundamental flavors, and they are described differently.
A discrete random variable X takes values from a countable set. We describe it with a probability mass function (PMF): P(X = x) gives the probability of each specific value. The PMF must be non-negative and sum to 1.
Example: a fair coin has PMF P(H) = 0.5, P(T) = 0.5. A loaded die might have P(6) = 0.5 and P(k) = 0.1 for k = 1, …, 5.
A continuous random variable X takes values in an uncountable set (typically an interval in R). We describe it with a probability density function (PDF) f(x). The crucial difference: f(x) is not a probability. It is a density — probability per unit length. You get probabilities by integrating:
The cumulative distribution function (CDF) unifies both cases: F(x) = P(X ≤ x). For discrete X, F is a staircase. For continuous X, F is a smooth curve rising from 0 to 1. The CDF always exists, even when the PMF or PDF is awkward to write down.
Left: a discrete distribution (Binomial with n=10, p=0.5). Right: a continuous distribution (Gaussian with μ=0, σ=1). Notice how the PMF gives bar heights that sum to 1, while the PDF gives a curve whose area integrates to 1.
| Property | Discrete (PMF) | Continuous (PDF) |
|---|---|---|
| Values | Countable set | Uncountable (interval) |
| Assigns | Probabilities to points | Densities (prob per length) |
| Sum/Integral | ∑ P(x) = 1 | ∫ f(x) dx = 1 |
| P(X = x) | Can be > 0 | Always 0 |
| f(x) or P(x) | ≤ 1 | Can be > 1 |
Two rules govern how probabilities compose. Master them, and everything else in this chapter becomes a consequence.
Given two random variables X and Y with joint distribution p(X, Y), the product rule decomposes the joint into a conditional and a marginal:
Read this as: the probability of seeing X and Y together equals the probability of X times the probability of Y given X. The conditional p(Y | X) captures what you learn about Y once you know X.
The sum rule (or marginalization) recovers a marginal distribution by summing (or integrating) out the other variable:
Independence is the special case where conditioning tells you nothing: p(Y | X) = p(Y). In that case, the product rule simplifies to p(X, Y) = p(X) p(Y). Conditional independence X ⊥ Y | Z means p(X, Y | Z) = p(X | Z) p(Y | Z) — once you know Z, X and Y decouple. This is the foundation of graphical models.
A 2D joint distribution over two discrete variables. The right margin sums columns (marginal of X). The bottom margin sums rows (marginal of Y). Click a cell to see the conditional distribution.
| Rule | Formula | What it does |
|---|---|---|
| Product | p(X, Y) = p(Y|X) p(X) | Factorize a joint distribution |
| Sum | p(X) = ∑Y p(X, Y) | Marginalize out a variable |
| Chain | p(X1, ..., Xn) = ∏i p(Xi | X1, ..., Xi-1) | Factor any joint into conditionals |
Bayes' theorem is arguably the most important equation in machine learning. It tells you how to update your beliefs when you see new data. Start with a prior belief, observe evidence, and get a posterior belief. The formula is a direct consequence of the product rule:
Let's unpack each piece:
| Term | Name | Meaning |
|---|---|---|
| p(θ) | Prior | What you believed about θ before seeing data |
| p(D | θ) | Likelihood | How probable the data is if θ were true |
| p(θ | D) | Posterior | Your updated belief after seeing data |
| p(D) | Evidence / Marginal likelihood | A normalizing constant = ∫ p(D | θ) p(θ) dθ |
Let's make this concrete. Suppose you find a coin and wonder if it is fair. Your prior belief about the bias θ (probability of heads) is a Beta distribution — maybe Beta(2, 2), meaning you think the coin is probably roughly fair but you're not sure. You flip the coin N times and observe h heads. The likelihood is Binomial. Bayes' theorem gives you a posterior that is also Beta — this is the magic of conjugacy (Chapter 9).
The posterior is simply the prior with updated counts. Each head increments α; each tail increments β. The more flips, the more concentrated the posterior becomes. Try it below.
Set the prior Beta(α, β) with the sliders. Click Flip to generate a random coin flip (true bias = 0.6). Watch the posterior update live. Dashed = prior, dotted = likelihood, solid = posterior.
Distributions can be complicated objects. Two numbers summarize the most important features: where the distribution is centered and how spread out it is.
The mean (or expected value) E[X] is the probability-weighted average of all values X can take:
Think of the mean as the "center of mass" of the distribution. If you placed weights proportional to the probability at each point on a number line, the mean is where the line balances.
The variance V[X] measures how far values typically deviate from the mean:
The second form — "E of X-squared minus the square of E of X" — is usually easier to compute. The standard deviation σ = √V[X] has the same units as X, making it more interpretable.
Linearity of expectation is enormously useful: E[aX + b] = aE[X] + b. This works even if X and Y are dependent: E[X + Y] = E[X] + E[Y]. No independence needed. Variance is not linear: V[aX + b] = a2V[X] (the constant b disappears, a gets squared).
| Property | Mean | Variance |
|---|---|---|
| Shift by b | E[X + b] = E[X] + b | V[X + b] = V[X] |
| Scale by a | E[aX] = aE[X] | V[aX] = a2V[X] |
| Sum (independent) | E[X + Y] = E[X] + E[Y] | V[X + Y] = V[X] + V[Y] |
Mean and variance describe single variables. For pairs of variables, we need to capture how they move together. Enter covariance.
Positive covariance: when X is above its mean, Y tends to be above its mean too. Negative: they move in opposite directions. Zero: no linear relationship (but they might still be dependent nonlinearly).
The problem with covariance is that its magnitude depends on the units and scales of X and Y. Correlation normalizes this:
Correlation of +1 means perfect positive linear relationship. Correlation of −1 means perfect negative linear relationship. Correlation of 0 means no linear relationship.
For a random vector X ∈ RD with mean μ = E[X]:
This D×D matrix encodes all pairwise linear relationships. It is the multivariate generalization of variance, and it appears everywhere: Gaussian distributions, PCA, Kalman filters, Mahalanobis distance.
| Corr[X,Y] | Interpretation |
|---|---|
| +1 | Perfect positive linear relationship (Y = aX + b, a > 0) |
| 0 | No linear relationship (may still be dependent!) |
| −1 | Perfect negative linear relationship (Y = aX + b, a < 0) |
If you learn only one distribution in your life, make it the Gaussian. It appears everywhere in ML: linear regression assumes Gaussian noise, the central limit theorem says averages are approximately Gaussian, variational autoencoders use Gaussian latent spaces, and Gaussian processes are built entirely from it.
The univariate Gaussian (or Normal) distribution has two parameters — mean μ and variance σ2:
The bell curve. Centered at μ, width controlled by σ. About 68% of probability mass lies within one standard deviation of the mean, 95% within two, 99.7% within three.
The multivariate Gaussian generalizes to vectors X ∈ RD with mean vector μ ∈ RD and covariance matrix Σ ∈ RD×D:
Adjust the mean, variances, and correlation to see how the Gaussian's shape changes. The ellipses are contours of constant density (1σ and 2σ).
Why Gaussians are special:
• Maximum entropy. Among all distributions with a given mean and variance, the Gaussian has the highest entropy — it makes the fewest assumptions.
• Closure under linear transformations. If X ~ N(μ, Σ) and Y = AX + b, then Y ~ N(Aμ + b, AΣAT). Gaussians beget Gaussians.
• Closure under conditioning and marginalization. Marginals and conditionals of Gaussians are also Gaussian (Chapter 8).
• Central Limit Theorem. Sums of many independent variables converge to Gaussian, regardless of the original distribution.
One of the most powerful properties of the Gaussian is that conditioning and marginalization produce new Gaussians with closed-form parameters. This is why Gaussians dominate Bayesian ML — you can do exact inference without approximation.
Consider a joint Gaussian over two sub-vectors X and Y:
Marginalization: p(X) = N(X | μX, ΣXX). Just read off the mean and variance of X from the joint. The cross-covariance ΣXY is irrelevant.
Conditioning: p(X | Y = y) is Gaussian with:
These formulas are the engine behind:
| Application | How it uses Gaussian conditioning |
|---|---|
| Bayesian linear regression | Posterior over weights is Gaussian conditioned on data |
| Kalman filter | Update step is Gaussian conditioning |
| Gaussian processes | Predictions are conditional Gaussians |
| Factor analysis / PPCA | Latent variable inference via conditioning |
Bayes' theorem says posterior ∝ likelihood × prior. For most likelihood-prior combinations, the resulting posterior has no closed form — you need MCMC or variational methods. But for certain special pairings, the posterior belongs to the same family as the prior. These are called conjugate priors.
We already saw one example in the coin flipper (Chapter 4): the Beta distribution is conjugate to the Binomial likelihood. Here is the update:
The prior parameters α and β act like "pseudo-counts" — imagine you have already seen α − 1 heads and β − 1 tails before any data arrives. The data adds real counts on top. As N → ∞, the data dominates and the prior becomes irrelevant.
| Likelihood | Conjugate Prior | Posterior |
|---|---|---|
| Binomial / Bernoulli | Beta | Beta |
| Gaussian (known σ) | Gaussian | Gaussian |
| Gaussian (known μ) | Inverse-Gamma | Inverse-Gamma |
| Multinomial | Dirichlet | Dirichlet |
| Poisson | Gamma | Gamma |
For the Gaussian with known variance σ2, the conjugate prior on the mean μ is also Gaussian:
The posterior precision (inverse variance) is the sum of prior precision and data precision. The posterior mean is a precision-weighted average of the prior mean and the data mean. More data = more precision = tighter posterior.
We have seen several distributions: Gaussian, Bernoulli, Beta, Gamma, Poisson. They look very different, but most of them share a common algebraic structure. The exponential family unifies them.
A distribution belongs to the exponential family if its density can be written as:
| Symbol | Name | Role |
|---|---|---|
| η | Natural parameter | Controls the distribution shape |
| T(x) | Sufficient statistic | Summarizes data — no information lost |
| A(η) | Log-partition function | Ensures the distribution integrates to 1 |
| h(x) | Base measure | Scaling factor independent of η |
The log-partition function A(η) is deceptively useful. Its derivatives give you moments:
Generalized linear models (GLMs) extend linear regression by using any exponential family distribution for the response variable. The natural parameter is a linear function of the inputs: η = wTx. This gives you logistic regression (Bernoulli), Poisson regression (Poisson), and classical linear regression (Gaussian) as special cases of the same framework.
If X has a known distribution and Y = g(X) for some function g, what is the distribution of Y? This is the change of variables problem, and the answer involves the Jacobian from Chapter 5.
For a monotonic, differentiable function g with inverse g−1, the density of Y = g(X) is:
The absolute derivative |dg−1/dy| is the Jacobian factor. It accounts for the "stretching" or "compressing" that g does to the probability mass. If g spreads out a region, the density decreases there; if g compresses, the density increases.
For multivariate transformations Y = g(X) where X ∈ RD:
where Jg−1 is the Jacobian matrix of the inverse mapping. The absolute determinant measures how volumes change under the transformation.
Start with X ~ N(0, 1). Apply Y = g(X). Watch how the density transforms. The Jacobian correction preserves total probability mass.
The reparameterization trick, used in VAEs, is also a change of variables. Instead of sampling Z ~ q(z | x), you write Z = μ(x) + σ(x) · ε where ε ~ N(0, 1). The randomness is moved to the fixed distribution ε, making the gradient ∂/∂μ and ∂/∂σ well-defined. This is possible because the Gaussian is closed under affine transformations.