Murphy, Chapters 2–3

Probability: The Language of Uncertainty

Every ML model is a probabilistic claim about data. Here we build the toolkit — from coin flips to multivariate Gaussians.

Prerequisites: Basic algebra + comfort with summation/integral notation. That's it.
11
Chapters
5
Simulations
11
Quizzes

Chapter 0: Why Probability?

You build a spam filter. It looks at an email and says "spam" or "not spam." But how confident is it? A patient walks into a clinic and tests positive for a rare disease. Should the doctor be alarmed? A self-driving car's camera sees something that might be a pedestrian. How certain must it be before it brakes?

In every case, we need to quantify uncertainty. Not just a binary answer, but a calibrated degree of belief. That is what probability gives us: a mathematical language for reasoning about things we don't know for sure.

The core idea of this book: Almost every model in machine learning is a probabilistic statement. Classification assigns class probabilities. Regression predicts a distribution over outputs. Even neural networks, behind the softmax, are computing conditional probability distributions. Understanding probability is not optional — it is the operating system of ML.

Murphy identifies two kinds of uncertainty:

Epistemic uncertainty (model uncertainty) comes from our ignorance. We don't have enough data, or our model is too simple. More data can reduce it.
Aleatoric uncertainty (data uncertainty) comes from inherent randomness. A fair coin has aleatoric uncertainty — no amount of data tells you the next flip.

This chapter covers the probabilistic toolkit that the entire book builds on: random variables, distributions, Bayes' rule, the Gaussian, and multivariate models. By the end, you will be able to build a live Bayesian inference engine.

Foundations
Random variables, PMF, PDF, CDF
Bayes' Rule
Prior × Likelihood → Posterior
Distributions
Bernoulli, Gaussian, Exponential Family
Multivariate
MVN, covariance, mixtures, graphical models
Why does machine learning need probability, rather than just hard yes/no predictions?

Chapter 1: Random Variables

Before we can do any math, we need the concept of a random variable. Think of it as a quantity whose value is uncertain. The temperature tomorrow. The label of an image. The number of heads in 10 coin flips. We describe the random variable by its probability distribution — a rule that assigns probabilities to each possible value.

Discrete random variables take values from a countable set. We describe them with a probability mass function (PMF): p(X = x) gives the probability of each specific value. All values must be non-negative and sum to 1.

p(X = x) ≥ 0  for all x,    ∑x p(X = x) = 1

Continuous random variables take values in an uncountable set (like the real line). We describe them with a probability density function (PDF) p(x). The crucial difference: p(x) is not a probability. It is a density — probability per unit length. Probability lives in intervals:

P(a ≤ X ≤ b) = ∫ab p(x) dx
Key insight: For continuous variables, P(X = exactly 3.14159...) = 0. Probability only lives in intervals. A density can exceed 1 — what matters is that it integrates to 1 over the entire domain.

The cumulative distribution function (CDF) unifies both cases: F(x) = P(X ≤ x). For discrete X it is a staircase; for continuous X it is a smooth curve rising from 0 to 1.

Key statistics of a distribution:

StatisticFormulaMeaning
Mean μE[X] = ∑ x · p(x)Center of mass — the "expected" value
Variance σ²E[(X − μ)²]Spread around the mean
Modeargmaxx p(x)Most probable value
PMF vs PDF

Left: discrete Binomial(10, p). Right: continuous Gaussian(μ, σ). Drag sliders to change parameters.

p (Binomial)0.50
μ (Gaussian)0.0
σ (Gaussian)1.0
Can a probability density function p(x) exceed 1?

Chapter 2: Bayes' Rule

You test positive for a rare disease. The test has a 99% sensitivity (true positive rate) and a 1% false positive rate. Should you panic? Most people's intuition says yes. But if the disease prevalence is 1 in 10,000, the math tells a very different story.

This is the domain of Bayes' rule — the formula that tells you how to update your beliefs in the face of evidence. It follows directly from the product rule of probability:

p(θ | D) = p(D | θ) · p(θ) / p(D)
TermNameRole
p(θ)PriorWhat you believed before seeing data
p(D | θ)LikelihoodHow probable the data is under each θ
p(θ | D)PosteriorUpdated belief after seeing data
p(D)EvidenceNormalizing constant = ∫ p(D|θ)p(θ)dθ
The Bayesian recipe: Prior × Likelihood → Posterior. Your old beliefs, multiplied by how well they explain the data, give your new beliefs. More data concentrates the posterior. A bad prior gets overwhelmed. The evidence p(D) just normalizes everything to sum to 1.

Back to the disease test. Let D = "positive test", θ = "has disease." The prior p(θ) = 0.0001. The likelihood p(D|θ) = 0.99. The false positive rate p(D|not θ) = 0.01. By Bayes' rule:

p(θ|D) = (0.99 × 0.0001) / (0.99 × 0.0001 + 0.01 × 0.9999) ≈ 0.0098

Less than 1%! Despite a "99% accurate" test, the low base rate means a positive result is still probably a false alarm. This is the base rate fallacy, and Bayes' rule is the antidote.

COVID-19 example from Murphy: In Section 2.3.1, Murphy works through Bayesian reasoning for COVID tests. With a 1% prevalence, even a test with 87.5% sensitivity and 97.5% specificity gives a posterior of only about 26% after a single positive test. This is why repeat testing matters — each test is another Bayesian update.
Bayes' Rule: Disease Testing

Adjust prevalence, sensitivity, and specificity. See how the posterior probability of disease changes after a positive test.

Prevalence1.00%
Sensitivity95.0%
Specificity99.0%
A rare disease has prevalence 0.1%. A test has 95% sensitivity and 95% specificity. After one positive test, is the patient likely sick?

Chapter 3: The Bernoulli & Binomial

The simplest interesting distribution: a coin flip. A Bernoulli distribution models a single binary trial with probability θ of success:

Y ~ Ber(θ)    p(Y=1) = θ,   p(Y=0) = 1 − θ

The mean is θ and the variance is θ(1 − θ). Simple — but this is the foundation of logistic regression, Naive Bayes, and binary classification.

Now flip the coin N times. The number of heads follows a Binomial distribution:

Y ~ Bin(N, θ)    p(Y=k) = C(N,k) · θk · (1−θ)N−k

Murphy connects these to ML through the sigmoid function σ(a) = 1/(1 + e−a), which maps any real number to [0, 1]. This makes it perfect for parameterizing a Bernoulli: θ = σ(wTx). This is exactly what logistic regression does (Chapter 10).

Why sigmoid? If you model the log-odds (logit) as a linear function of the input, log(θ/(1−θ)) = wTx, then inverting gives θ = σ(wTx). The sigmoid is not arbitrary — it is the canonical link function for the Bernoulli in the exponential family.

For multiple classes C > 2, we generalize to the Categorical distribution with the softmax function:

p(Y=c | x) = exp(ac) / ∑k exp(ak)
Sigmoid & Softmax

Left: the sigmoid squashes any real number to [0,1]. Right: softmax converts a vector of logits to a probability distribution. Drag the slider to change the input.

Input a0.0
What does the sigmoid function do?

Chapter 4: The Gaussian Distribution

If you could only know one distribution, this would be it. The Gaussian (or normal) distribution is the workhorse of ML — it appears in linear regression, Kalman filters, variational autoencoders, Gaussian processes, and almost everywhere else.

N(x | μ, σ²) = (1 / √(2πσ²)) · exp(−(x − μ)² / (2σ²))

It has two parameters: the mean μ (where the bell is centered) and the variance σ² (how wide the bell is). The standard deviation σ is the square root of the variance.

Why is the Gaussian everywhere? The Central Limit Theorem says: if you sum many small independent random effects, the result is approximately Gaussian, no matter the original distributions. Heights are Gaussian because they result from many genetic and environmental factors. Measurement noise is Gaussian because it aggregates many small error sources. This is why the Gaussian appears in so much of science and engineering.

Key properties:

PropertyValue
Meanμ
Modeμ (same as mean)
Varianceσ²
68-95-99.7 rule68% within ±1σ, 95% within ±2σ, 99.7% within ±3σ
Entropy½ ln(2πeσ²) — maximum entropy for given mean and variance

For regression, we model the output as y = f(x) + ε, where ε ~ N(0, σ²). This means the likelihood is p(y | x, w) = N(y | wTx, σ²). Maximizing this likelihood is equivalent to minimizing squared error — that is why least squares is so natural.

The Gaussian Distribution

Adjust μ and σ to see how the bell curve changes. The shaded regions show the 68-95-99.7 rule.

μ0.0
σ1.0
Why does the Gaussian appear so often in practice?

Chapter 5: The Multivariate Gaussian

Real data has many dimensions. Images have millions of pixels. An ML model might have thousands of parameters. We need distributions over vectors, not just scalars. The multivariate Gaussian (MVN) generalizes the bell curve to D dimensions:

N(x | μ, Σ) = (2π)−D/2 |Σ|−1/2 exp(−½(x − μ)TΣ−1(x − μ))

The mean vector μ gives the center. The covariance matrix Σ encodes both the spread and the correlations between dimensions. Its shape determines whether the distribution looks like a sphere, an axis-aligned ellipse, or a tilted ellipse.

The magic of Gaussians: Marginals of a Gaussian are Gaussian. Conditionals of a Gaussian are Gaussian. Linear transformations of a Gaussian are Gaussian. This closure is why Gaussians are so analytically tractable — and why they dominate probabilistic ML.

The conditional distribution p(x1 | x2) for a partitioned MVN is:

p(x1 | x2) = N(x1 | μ1|2, Σ1|2)
μ1|2 = μ1 + Σ12Σ22−1(x2 − μ2)

This formula is the heart of Gaussian processes, Kalman filters, and Bayesian linear regression.

2D Gaussian Contours

Adjust the correlation ρ to see how the covariance matrix shapes the distribution. Click to condition on x2 and see the conditional distribution.

Correlation ρ0.60
Click on the canvas to condition on x₂
What happens when you condition a multivariate Gaussian on some of its variables?

Chapter 6: Mixture Models

A single Gaussian can model one cluster. But real data often has multiple clusters — digit images contain 10 classes, customer data has segments, genes form functional groups. A Gaussian Mixture Model (GMM) handles this by combining K Gaussians:

p(x) = ∑k=1K πk · N(x | μk, Σk)

Each component k has a mean μk, covariance Σk, and mixing weight πk (where all πk sum to 1). The mixing weights tell us how likely each cluster is a priori.

Latent variables: We can think of a GMM as a two-step generative process. First, pick a cluster z ~ Cat(π). Then, draw a sample from that cluster's Gaussian: x | z=k ~ N(μk, Σk). The cluster assignment z is a latent variable — hidden, never directly observed. This idea of latent variables is foundational to VAEs, topic models, and much of unsupervised learning.

Given a data point x, we can compute the responsibility — the posterior probability that x belongs to cluster k:

rk(x) = πk N(x | μk, Σk) / ∑j πj N(x | μj, Σj)

This is just Bayes' rule again. The mixture weight is the prior, the Gaussian is the likelihood, and the responsibility is the posterior.

In a GMM, what role does the mixing weight πk play?

Chapter 7: Bayesian Inference Updater

This is where everything comes together. We have a coin with unknown bias θ. We place a Beta prior on θ, flip the coin repeatedly, and watch the posterior update in real time. This is Bayesian inference in its purest form.

The Beta-Binomial model is the textbook conjugate pair. The prior is Beta(α, β), the likelihood is Binomial, and the posterior is Beta(α + h, β + N − h), where h is the number of heads in N flips.

Prior: θ ~ Beta(α, β)
Likelihood: h | θ ~ Binomial(N, θ)
Posterior: θ | h ~ Beta(α + h, β + N − h)
Conjugacy means convenience: When the prior and likelihood are "conjugate," the posterior has the same functional form as the prior. We just update the parameters. No integrals needed. The Beta-Binomial is the simplest example; the Gaussian-Gaussian and Dirichlet-Multinomial are others.
Live Bayesian Inference

Set the true coin bias and prior parameters. Click Flip to generate data and watch the posterior concentrate. Gray dashed = prior, teal = posterior, orange line = true θ.

0 flips: 0H 0T
True θ0.70
Prior α2.0
Prior β2.0
What to observe: With a weak prior (small α, β near 1), the posterior quickly tracks the data. With a strong prior (large α, β), it takes many flips before the data overwhelms the prior. As N → ∞, the posterior always concentrates around the true θ — this is Bayesian consistency.
Prior sensitivity: Try setting a strong prior at α=20, β=2 (believing the coin is very biased toward heads) with a true θ=0.3 (actually biased toward tails). Watch how many flips it takes for the data to correct the prior. This demonstrates why strong priors need strong justification.
In the Beta-Binomial model, what happens to the posterior as you collect more data?

Chapter 8: The Exponential Family

We have seen Bernoulli, Binomial, Categorical, and Gaussian distributions. They seem different, but Murphy reveals they all belong to a single unified family: the exponential family.

p(x | η) = h(x) · exp(ηT T(x) − A(η))

Where:

SymbolNameRole
ηNatural parametersThe canonical parameterization of the distribution
T(x)Sufficient statisticsAll the info the data provides about η
A(η)Log partition functionNormalizer; its derivatives give the mean and variance
h(x)Base measureA scaling factor independent of η
Why this matters: The exponential family gives us: (1) conjugate priors for free — the conjugate prior always has the form p(η) ∝ exp(ηTν − τ A(η)); (2) a unified approach to Generalized Linear Models (GLMs) in Chapter 12; (3) sufficient statistics, meaning we can compress the entire dataset into T(x) without losing any information about η.

Every member of the exponential family has a beautiful property: the derivatives of the log partition function A(η) give the moments. The first derivative gives the mean of the sufficient statistics. The second derivative gives their covariance.

E[T(x)] = ∇A(η)     Cov[T(x)] = ∇²A(η)
DistributionNatural param ηSufficient stat T(x)A(η)
Bernoullilog(θ/(1−θ))xlog(1 + eη)
Gaussian(μ/σ², −1/(2σ²))(x, x²)−η1²/(4η2) − ½log(−2η2)
Poissonlog(λ)xeη
What is the key advantage of the exponential family for Bayesian inference?

Chapter 9: Probabilistic Graphical Models

With many random variables, writing out the full joint distribution becomes intractable. If we have D binary variables, the joint has 2D entries. For D = 100, that is more states than atoms in the universe.

Probabilistic graphical models (PGMs) solve this by encoding conditional independence structure as a graph. Each node is a random variable. Each edge (or lack of edge) encodes a dependence (or independence) relationship. This lets us factorize the joint into a product of smaller terms.

The key insight: Most real-world systems have sparse dependencies. A patient's headache depends on whether they have the flu, which depends on whether they were exposed — but the headache doesn't directly depend on the exposure once we know the flu status. PGMs formalize this as X ⊥ Z | Y.

Two main types:

Directed graphs (Bayesian networks): edges are arrows showing causal or generative direction. The joint factors as p(x1, ..., xD) = ∏d p(xd | pa(xd)), where pa(xd) are the parents of node d.
Undirected graphs (Markov random fields): edges are undirected, encoding symmetric affinities. The joint factors over cliques: p(x) ∝ ∏c ψc(xc).

PGMs unify a vast range of models. Hidden Markov Models, Kalman filters, mixture models, topic models, deep generative models — all can be expressed as graphical models with specific structure.

What problem do probabilistic graphical models solve?

Chapter 10: Connections

Probability is the foundation that every subsequent chapter of Murphy's book builds on. Here is the map of where each concept leads:

Concept from this chapterWhere it leads in the book
Bayes' ruleBayesian statistics (Ch 4), MAP estimation, Bayesian neural nets (Ch 13)
Gaussian distributionLinear regression (Ch 11), GPs (Ch 17), Kalman filters
Bernoulli / sigmoidLogistic regression (Ch 10), binary classification
Mixture modelsEM algorithm (Ch 8), clustering (Ch 21), VAEs (Ch 20)
Exponential familyGLMs (Ch 12), conjugate priors, sufficient statistics
Graphical modelsHMMs, Bayesian networks, deep generative models
Conditional GaussiansGaussian processes, Bayesian linear regression, sensor fusion
What we covered: Random variables (discrete and continuous), PMFs and PDFs, Bayes' rule (the engine of inference), the Bernoulli/Binomial/Categorical family, the Gaussian and multivariate Gaussian, mixture models, live Bayesian updating, the exponential family, and graphical models.
What comes next: Chapter 4 (Statistics) takes these distributions and asks: given data, how do we estimate the parameters? MLE finds the parameters that maximize the likelihood. MAP adds a prior. Full Bayesian inference computes the entire posterior. These are the three pillars of statistical learning.

"Probability theory is nothing but common sense reduced to calculation." — Pierre-Simon Laplace

Which concept from this chapter is the foundation of all Bayesian machine learning?