Every ML model is a probabilistic claim about data. Here we build the toolkit — from coin flips to multivariate Gaussians.
You build a spam filter. It looks at an email and says "spam" or "not spam." But how confident is it? A patient walks into a clinic and tests positive for a rare disease. Should the doctor be alarmed? A self-driving car's camera sees something that might be a pedestrian. How certain must it be before it brakes?
In every case, we need to quantify uncertainty. Not just a binary answer, but a calibrated degree of belief. That is what probability gives us: a mathematical language for reasoning about things we don't know for sure.
Murphy identifies two kinds of uncertainty:
This chapter covers the probabilistic toolkit that the entire book builds on: random variables, distributions, Bayes' rule, the Gaussian, and multivariate models. By the end, you will be able to build a live Bayesian inference engine.
Before we can do any math, we need the concept of a random variable. Think of it as a quantity whose value is uncertain. The temperature tomorrow. The label of an image. The number of heads in 10 coin flips. We describe the random variable by its probability distribution — a rule that assigns probabilities to each possible value.
Discrete random variables take values from a countable set. We describe them with a probability mass function (PMF): p(X = x) gives the probability of each specific value. All values must be non-negative and sum to 1.
Continuous random variables take values in an uncountable set (like the real line). We describe them with a probability density function (PDF) p(x). The crucial difference: p(x) is not a probability. It is a density — probability per unit length. Probability lives in intervals:
The cumulative distribution function (CDF) unifies both cases: F(x) = P(X ≤ x). For discrete X it is a staircase; for continuous X it is a smooth curve rising from 0 to 1.
Key statistics of a distribution:
| Statistic | Formula | Meaning |
|---|---|---|
| Mean μ | E[X] = ∑ x · p(x) | Center of mass — the "expected" value |
| Variance σ² | E[(X − μ)²] | Spread around the mean |
| Mode | argmaxx p(x) | Most probable value |
Left: discrete Binomial(10, p). Right: continuous Gaussian(μ, σ). Drag sliders to change parameters.
You test positive for a rare disease. The test has a 99% sensitivity (true positive rate) and a 1% false positive rate. Should you panic? Most people's intuition says yes. But if the disease prevalence is 1 in 10,000, the math tells a very different story.
This is the domain of Bayes' rule — the formula that tells you how to update your beliefs in the face of evidence. It follows directly from the product rule of probability:
| Term | Name | Role |
|---|---|---|
| p(θ) | Prior | What you believed before seeing data |
| p(D | θ) | Likelihood | How probable the data is under each θ |
| p(θ | D) | Posterior | Updated belief after seeing data |
| p(D) | Evidence | Normalizing constant = ∫ p(D|θ)p(θ)dθ |
Back to the disease test. Let D = "positive test", θ = "has disease." The prior p(θ) = 0.0001. The likelihood p(D|θ) = 0.99. The false positive rate p(D|not θ) = 0.01. By Bayes' rule:
Less than 1%! Despite a "99% accurate" test, the low base rate means a positive result is still probably a false alarm. This is the base rate fallacy, and Bayes' rule is the antidote.
Adjust prevalence, sensitivity, and specificity. See how the posterior probability of disease changes after a positive test.
The simplest interesting distribution: a coin flip. A Bernoulli distribution models a single binary trial with probability θ of success:
The mean is θ and the variance is θ(1 − θ). Simple — but this is the foundation of logistic regression, Naive Bayes, and binary classification.
Now flip the coin N times. The number of heads follows a Binomial distribution:
Murphy connects these to ML through the sigmoid function σ(a) = 1/(1 + e−a), which maps any real number to [0, 1]. This makes it perfect for parameterizing a Bernoulli: θ = σ(wTx). This is exactly what logistic regression does (Chapter 10).
For multiple classes C > 2, we generalize to the Categorical distribution with the softmax function:
Left: the sigmoid squashes any real number to [0,1]. Right: softmax converts a vector of logits to a probability distribution. Drag the slider to change the input.
If you could only know one distribution, this would be it. The Gaussian (or normal) distribution is the workhorse of ML — it appears in linear regression, Kalman filters, variational autoencoders, Gaussian processes, and almost everywhere else.
It has two parameters: the mean μ (where the bell is centered) and the variance σ² (how wide the bell is). The standard deviation σ is the square root of the variance.
Key properties:
| Property | Value |
|---|---|
| Mean | μ |
| Mode | μ (same as mean) |
| Variance | σ² |
| 68-95-99.7 rule | 68% within ±1σ, 95% within ±2σ, 99.7% within ±3σ |
| Entropy | ½ ln(2πeσ²) — maximum entropy for given mean and variance |
For regression, we model the output as y = f(x) + ε, where ε ~ N(0, σ²). This means the likelihood is p(y | x, w) = N(y | wTx, σ²). Maximizing this likelihood is equivalent to minimizing squared error — that is why least squares is so natural.
Adjust μ and σ to see how the bell curve changes. The shaded regions show the 68-95-99.7 rule.
Real data has many dimensions. Images have millions of pixels. An ML model might have thousands of parameters. We need distributions over vectors, not just scalars. The multivariate Gaussian (MVN) generalizes the bell curve to D dimensions:
The mean vector μ gives the center. The covariance matrix Σ encodes both the spread and the correlations between dimensions. Its shape determines whether the distribution looks like a sphere, an axis-aligned ellipse, or a tilted ellipse.
The conditional distribution p(x1 | x2) for a partitioned MVN is:
This formula is the heart of Gaussian processes, Kalman filters, and Bayesian linear regression.
Adjust the correlation ρ to see how the covariance matrix shapes the distribution. Click to condition on x2 and see the conditional distribution.
A single Gaussian can model one cluster. But real data often has multiple clusters — digit images contain 10 classes, customer data has segments, genes form functional groups. A Gaussian Mixture Model (GMM) handles this by combining K Gaussians:
Each component k has a mean μk, covariance Σk, and mixing weight πk (where all πk sum to 1). The mixing weights tell us how likely each cluster is a priori.
Given a data point x, we can compute the responsibility — the posterior probability that x belongs to cluster k:
This is just Bayes' rule again. The mixture weight is the prior, the Gaussian is the likelihood, and the responsibility is the posterior.
This is where everything comes together. We have a coin with unknown bias θ. We place a Beta prior on θ, flip the coin repeatedly, and watch the posterior update in real time. This is Bayesian inference in its purest form.
The Beta-Binomial model is the textbook conjugate pair. The prior is Beta(α, β), the likelihood is Binomial, and the posterior is Beta(α + h, β + N − h), where h is the number of heads in N flips.
Set the true coin bias and prior parameters. Click Flip to generate data and watch the posterior concentrate. Gray dashed = prior, teal = posterior, orange line = true θ.
We have seen Bernoulli, Binomial, Categorical, and Gaussian distributions. They seem different, but Murphy reveals they all belong to a single unified family: the exponential family.
Where:
| Symbol | Name | Role |
|---|---|---|
| η | Natural parameters | The canonical parameterization of the distribution |
| T(x) | Sufficient statistics | All the info the data provides about η |
| A(η) | Log partition function | Normalizer; its derivatives give the mean and variance |
| h(x) | Base measure | A scaling factor independent of η |
Every member of the exponential family has a beautiful property: the derivatives of the log partition function A(η) give the moments. The first derivative gives the mean of the sufficient statistics. The second derivative gives their covariance.
| Distribution | Natural param η | Sufficient stat T(x) | A(η) |
|---|---|---|---|
| Bernoulli | log(θ/(1−θ)) | x | log(1 + eη) |
| Gaussian | (μ/σ², −1/(2σ²)) | (x, x²) | −η1²/(4η2) − ½log(−2η2) |
| Poisson | log(λ) | x | eη |
With many random variables, writing out the full joint distribution becomes intractable. If we have D binary variables, the joint has 2D entries. For D = 100, that is more states than atoms in the universe.
Probabilistic graphical models (PGMs) solve this by encoding conditional independence structure as a graph. Each node is a random variable. Each edge (or lack of edge) encodes a dependence (or independence) relationship. This lets us factorize the joint into a product of smaller terms.
Two main types:
PGMs unify a vast range of models. Hidden Markov Models, Kalman filters, mixture models, topic models, deep generative models — all can be expressed as graphical models with specific structure.
Probability is the foundation that every subsequent chapter of Murphy's book builds on. Here is the map of where each concept leads:
| Concept from this chapter | Where it leads in the book |
|---|---|
| Bayes' rule | Bayesian statistics (Ch 4), MAP estimation, Bayesian neural nets (Ch 13) |
| Gaussian distribution | Linear regression (Ch 11), GPs (Ch 17), Kalman filters |
| Bernoulli / sigmoid | Logistic regression (Ch 10), binary classification |
| Mixture models | EM algorithm (Ch 8), clustering (Ch 21), VAEs (Ch 20) |
| Exponential family | GLMs (Ch 12), conjugate priors, sufficient statistics |
| Graphical models | HMMs, Bayesian networks, deep generative models |
| Conditional Gaussians | Gaussian processes, Bayesian linear regression, sensor fusion |
"Probability theory is nothing but common sense reduced to calculation." — Pierre-Simon Laplace