Ch 2–3: Probability

Chapter 0: Why Probability?

You build a spam filter. It looks at an email and says "spam" or "not spam." But how confident is it? A patient walks into a clinic and tests positive for a rare disease. Should the doctor be alarmed? A self-driving car's camera sees something that might be a pedestrian. How certain must it be before it brakes?

In every case, we need to quantify uncertainty. Not just a binary answer, but a calibrated degree of belief. That is what probability gives us: a mathematical language for reasoning about things we don't know for sure.

The core idea of this book: Almost every model in machine learning is a probabilistic statement. Classification assigns class probabilities. Regression predicts a distribution over outputs. Even neural networks, behind the softmax, are computing conditional probability distributions. Understanding probability is not optional — it is the operating system of ML.

Murphy identifies two kinds of uncertainty:

Epistemic uncertainty (model uncertainty) comes from our ignorance. We don't have enough data, or our model is too simple. More data can reduce it.

Aleatoric uncertainty (data uncertainty) comes from inherent randomness. A fair coin has aleatoric uncertainty — no amount of data tells you the next flip.

This chapter covers the probabilistic toolkit that the entire book builds on: random variables, distributions, Bayes' rule, the Gaussian, and multivariate models. By the end, you will be able to build a live Bayesian inference engine.

Foundations

Random variables, PMF, PDF, CDF

↓

Bayes' Rule

Prior × Likelihood → Posterior

↓

Distributions

Bernoulli, Gaussian, Exponential Family

↓

Multivariate

MVN, covariance, mixtures, graphical models

Why does machine learning need probability, rather than just hard yes/no predictions?

It makes computations faster It provides a principled framework for quantifying and reasoning about uncertainty It is only needed for random number generation

Chapter 1: Random Variables

Before we can do any math, we need the concept of a random variable. Think of it as a quantity whose value is uncertain. The temperature tomorrow. The label of an image. The number of heads in 10 coin flips. We describe the random variable by its probability distribution — a rule that assigns probabilities to each possible value.

Discrete random variables take values from a countable set. We describe them with a probability mass function (PMF): p(X = x) gives the probability of each specific value. All values must be non-negative and sum to 1.

p(X = x) ≥ 0 for all x, ∑_x p(X = x) = 1

Continuous random variables take values in an uncountable set (like the real line). We describe them with a probability density function (PDF) p(x). The crucial difference: p(x) is not a probability. It is a density — probability per unit length. Probability lives in intervals:

P(a ≤ X ≤ b) = ∫_a^b p(x) dx

Key insight: For continuous variables, P(X = exactly 3.14159...) = 0. Probability only lives in intervals. A density can exceed 1 — what matters is that it integrates to 1 over the entire domain.

The cumulative distribution function (CDF) unifies both cases: F(x) = P(X ≤ x). For discrete X it is a staircase; for continuous X it is a smooth curve rising from 0 to 1.

Key statistics of a distribution:

Statistic	Formula	Meaning
Mean μ	E[X] = ∑ x · p(x)	Center of mass — the "expected" value
Variance σ²	E[(X − μ)²]	Spread around the mean
Mode	argmax_x p(x)	Most probable value

PMF vs PDF

Left: discrete Binomial(10, p). Right: continuous Gaussian(μ, σ). Drag sliders to change parameters.

p (Binomial)0.50

μ (Gaussian)0.0

σ (Gaussian)1.0

Can a probability density function p(x) exceed 1?

Yes — it is a density, not a probability. Only the integral must equal 1. No — probabilities never exceed 1 Only for multivariate distributions

Chapter 2: Bayes' Rule

You test positive for a rare disease. The test has a 99% sensitivity (true positive rate) and a 1% false positive rate. Should you panic? Most people's intuition says yes. But if the disease prevalence is 1 in 10,000, the math tells a very different story.

This is the domain of Bayes' rule — the formula that tells you how to update your beliefs in the face of evidence. It follows directly from the product rule of probability:

p(θ | D) = p(D | θ) · p(θ) / p(D)

Term	Name	Role
p(θ)	Prior	What you believed before seeing data
p(D \| θ)	Likelihood	How probable the data is under each θ
p(θ \| D)	Posterior	Updated belief after seeing data
p(D)	Evidence	Normalizing constant = ∫ p(D\|θ)p(θ)dθ

The Bayesian recipe: Prior × Likelihood → Posterior. Your old beliefs, multiplied by how well they explain the data, give your new beliefs. More data concentrates the posterior. A bad prior gets overwhelmed. The evidence p(D) just normalizes everything to sum to 1.

Back to the disease test. Let D = "positive test", θ = "has disease." The prior p(θ) = 0.0001. The likelihood p(D|θ) = 0.99. The false positive rate p(D|not θ) = 0.01. By Bayes' rule:

p(θ|D) = (0.99 × 0.0001) / (0.99 × 0.0001 + 0.01 × 0.9999) ≈ 0.0098

Less than 1%! Despite a "99% accurate" test, the low base rate means a positive result is still probably a false alarm. This is the base rate fallacy, and Bayes' rule is the antidote.

COVID-19 example from Murphy: In Section 2.3.1, Murphy works through Bayesian reasoning for COVID tests. With a 1% prevalence, even a test with 87.5% sensitivity and 97.5% specificity gives a posterior of only about 26% after a single positive test. This is why repeat testing matters — each test is another Bayesian update.

Bayes' Rule: Disease Testing

Adjust prevalence, sensitivity, and specificity. See how the posterior probability of disease changes after a positive test.

Prevalence1.00%

Sensitivity95.0%

Specificity99.0%

A rare disease has prevalence 0.1%. A test has 95% sensitivity and 95% specificity. After one positive test, is the patient likely sick?

Yes — the test is 95% accurate No — the low base rate means the posterior is still under 2% It depends on the patient's age

Chapter 3: The Bernoulli & Binomial

The simplest interesting distribution: a coin flip. A Bernoulli distribution models a single binary trial with probability θ of success:

Y ~ Ber(θ) p(Y=1) = θ, p(Y=0) = 1 − θ

The mean is θ and the variance is θ(1 − θ). Simple — but this is the foundation of logistic regression, Naive Bayes, and binary classification.

Now flip the coin N times. The number of heads follows a Binomial distribution:

Y ~ Bin(N, θ) p(Y=k) = C(N,k) · θ^k · (1−θ)^N−k

Murphy connects these to ML through the sigmoid function σ(a) = 1/(1 + e^−a), which maps any real number to [0, 1]. This makes it perfect for parameterizing a Bernoulli: θ = σ(w^Tx). This is exactly what logistic regression does (Chapter 10).

Why sigmoid? If you model the log-odds (logit) as a linear function of the input, log(θ/(1−θ)) = w^Tx, then inverting gives θ = σ(w^Tx). The sigmoid is not arbitrary — it is the canonical link function for the Bernoulli in the exponential family.

For multiple classes C > 2, we generalize to the Categorical distribution with the softmax function:

p(Y=c | x) = exp(a_c) / ∑_k exp(a_k)

Sigmoid & Softmax

Left: the sigmoid squashes any real number to [0,1]. Right: softmax converts a vector of logits to a probability distribution. Drag the slider to change the input.

Input a0.0

What does the sigmoid function do?

It doubles the input It maps any real number to the range [0, 1], making it suitable for parameterizing a Bernoulli It computes the log of the input

Chapter 4: The Gaussian Distribution

If you could only know one distribution, this would be it. The Gaussian (or normal) distribution is the workhorse of ML — it appears in linear regression, Kalman filters, variational autoencoders, Gaussian processes, and almost everywhere else.

N(x | μ, σ²) = (1 / √(2πσ²)) · exp(−(x − μ)² / (2σ²))

It has two parameters: the mean μ (where the bell is centered) and the variance σ² (how wide the bell is). The standard deviation σ is the square root of the variance.

Why is the Gaussian everywhere? The Central Limit Theorem says: if you sum many small independent random effects, the result is approximately Gaussian, no matter the original distributions. Heights are Gaussian because they result from many genetic and environmental factors. Measurement noise is Gaussian because it aggregates many small error sources. This is why the Gaussian appears in so much of science and engineering.

Key properties:

Property	Value
Mean	μ
Mode	μ (same as mean)
Variance	σ²
68-95-99.7 rule	68% within ±1σ, 95% within ±2σ, 99.7% within ±3σ
Entropy	½ ln(2πeσ²) — maximum entropy for given mean and variance

For regression, we model the output as y = f(x) + ε, where ε ~ N(0, σ²). This means the likelihood is p(y | x, w) = N(y | w^Tx, σ²). Maximizing this likelihood is equivalent to minimizing squared error — that is why least squares is so natural.

The Gaussian Distribution

Adjust μ and σ to see how the bell curve changes. The shaded regions show the 68-95-99.7 rule.

μ0.0

σ1.0

Why does the Gaussian appear so often in practice?

The Central Limit Theorem: sums of many independent effects converge to a Gaussian It is the simplest distribution to compute It was mandated by a standards committee

Chapter 5: The Multivariate Gaussian

Real data has many dimensions. Images have millions of pixels. An ML model might have thousands of parameters. We need distributions over vectors, not just scalars. The multivariate Gaussian (MVN) generalizes the bell curve to D dimensions:

N(x | μ, Σ) = (2π)^−D/2 |Σ|^−1/2 exp(−½(x − μ)^TΣ⁻¹(x − μ))

The mean vector μ gives the center. The covariance matrix Σ encodes both the spread and the correlations between dimensions. Its shape determines whether the distribution looks like a sphere, an axis-aligned ellipse, or a tilted ellipse.

The magic of Gaussians: Marginals of a Gaussian are Gaussian. Conditionals of a Gaussian are Gaussian. Linear transformations of a Gaussian are Gaussian. This closure is why Gaussians are so analytically tractable — and why they dominate probabilistic ML.

The conditional distribution p(x₁ | x₂) for a partitioned MVN is:

p(x₁ | x₂) = N(x₁ | μ_1|2, Σ_1|2)

μ_1|2 = μ₁ + Σ₁₂Σ₂₂⁻¹(x₂ − μ₂)

This formula is the heart of Gaussian processes, Kalman filters, and Bayesian linear regression.

2D Gaussian Contours

Adjust the correlation ρ to see how the covariance matrix shapes the distribution. Click to condition on x₂ and see the conditional distribution.

Correlation ρ0.60

Click on the canvas to condition on x₂

What happens when you condition a multivariate Gaussian on some of its variables?

You get another Gaussian — with a shifted mean and reduced variance You get a uniform distribution The result is always the same as the marginal

Chapter 6: Mixture Models

A single Gaussian can model one cluster. But real data often has multiple clusters — digit images contain 10 classes, customer data has segments, genes form functional groups. A Gaussian Mixture Model (GMM) handles this by combining K Gaussians:

p(x) = ∑_k=1^K π_k · N(x | μ_k, Σ_k)

Each component k has a mean μ_k, covariance Σ_k, and mixing weight π_k (where all π_k sum to 1). The mixing weights tell us how likely each cluster is a priori.

Latent variables: We can think of a GMM as a two-step generative process. First, pick a cluster z ~ Cat(π). Then, draw a sample from that cluster's Gaussian: x | z=k ~ N(μ_k, Σ_k). The cluster assignment z is a latent variable — hidden, never directly observed. This idea of latent variables is foundational to VAEs, topic models, and much of unsupervised learning.

Given a data point x, we can compute the responsibility — the posterior probability that x belongs to cluster k:

r_k(x) = π_k N(x | μ_k, Σ_k) / ∑_j π_j N(x | μ_j, Σ_j)

This is just Bayes' rule again. The mixture weight is the prior, the Gaussian is the likelihood, and the responsibility is the posterior.

In a GMM, what role does the mixing weight π_k play?

It is the prior probability that a data point belongs to component k It controls the variance of component k It is the likelihood of the data under component k

Chapter 7: Bayesian Inference Updater

This is where everything comes together. We have a coin with unknown bias θ. We place a Beta prior on θ, flip the coin repeatedly, and watch the posterior update in real time. This is Bayesian inference in its purest form.

The Beta-Binomial model is the textbook conjugate pair. The prior is Beta(α, β), the likelihood is Binomial, and the posterior is Beta(α + h, β + N − h), where h is the number of heads in N flips.

Prior: θ ~ Beta(α, β)

Likelihood: h | θ ~ Binomial(N, θ)

Posterior: θ | h ~ Beta(α + h, β + N − h)

Conjugacy means convenience: When the prior and likelihood are "conjugate," the posterior has the same functional form as the prior. We just update the parameters. No integrals needed. The Beta-Binomial is the simplest example; the Gaussian-Gaussian and Dirichlet-Multinomial are others.

Live Bayesian Inference

Set the true coin bias and prior parameters. Click Flip to generate data and watch the posterior concentrate. Gray dashed = prior, teal = posterior, orange line = true θ.

0 flips: 0H 0T

True θ0.70

Prior α2.0

Prior β2.0

What to observe: With a weak prior (small α, β near 1), the posterior quickly tracks the data. With a strong prior (large α, β), it takes many flips before the data overwhelms the prior. As N → ∞, the posterior always concentrates around the true θ — this is Bayesian consistency.

Prior sensitivity: Try setting a strong prior at α=20, β=2 (believing the coin is very biased toward heads) with a true θ=0.3 (actually biased toward tails). Watch how many flips it takes for the data to correct the prior. This demonstrates why strong priors need strong justification.

In the Beta-Binomial model, what happens to the posterior as you collect more data?

It concentrates around the true parameter value, regardless of the prior It always stays equal to the prior It becomes uniform

Chapter 8: The Exponential Family

We have seen Bernoulli, Binomial, Categorical, and Gaussian distributions. They seem different, but Murphy reveals they all belong to a single unified family: the exponential family.

p(x | η) = h(x) · exp(η^T T(x) − A(η))

Where:

Symbol	Name	Role
η	Natural parameters	The canonical parameterization of the distribution
T(x)	Sufficient statistics	All the info the data provides about η
A(η)	Log partition function	Normalizer; its derivatives give the mean and variance
h(x)	Base measure	A scaling factor independent of η

Why this matters: The exponential family gives us: (1) conjugate priors for free — the conjugate prior always has the form p(η) ∝ exp(η^Tν − τ A(η)); (2) a unified approach to Generalized Linear Models (GLMs) in Chapter 12; (3) sufficient statistics, meaning we can compress the entire dataset into T(x) without losing any information about η.

Every member of the exponential family has a beautiful property: the derivatives of the log partition function A(η) give the moments. The first derivative gives the mean of the sufficient statistics. The second derivative gives their covariance.

E[T(x)] = ∇A(η) Cov[T(x)] = ∇²A(η)

Distribution	Natural param η	Sufficient stat T(x)	A(η)
Bernoulli	log(θ/(1−θ))	x	log(1 + e^η)
Gaussian	(μ/σ², −1/(2σ²))	(x, x²)	−η₁²/(4η₂) − ½log(−2η₂)
Poisson	log(λ)	x	e^η

What is the key advantage of the exponential family for Bayesian inference?

It makes distributions discrete It guarantees the existence of conjugate priors, enabling closed-form posterior updates It makes all distributions Gaussian

Chapter 9: Probabilistic Graphical Models

With many random variables, writing out the full joint distribution becomes intractable. If we have D binary variables, the joint has 2^D entries. For D = 100, that is more states than atoms in the universe.

Probabilistic graphical models (PGMs) solve this by encoding conditional independence structure as a graph. Each node is a random variable. Each edge (or lack of edge) encodes a dependence (or independence) relationship. This lets us factorize the joint into a product of smaller terms.

The key insight: Most real-world systems have sparse dependencies. A patient's headache depends on whether they have the flu, which depends on whether they were exposed — but the headache doesn't directly depend on the exposure once we know the flu status. PGMs formalize this as X ⊥ Z | Y.

Two main types:

Directed graphs (Bayesian networks): edges are arrows showing causal or generative direction. The joint factors as p(x₁, ..., x_D) = ∏_d p(x_d | pa(x_d)), where pa(x_d) are the parents of node d.

Undirected graphs (Markov random fields): edges are undirected, encoding symmetric affinities. The joint factors over cliques: p(x) ∝ ∏_c ψ_c(x_c).

PGMs unify a vast range of models. Hidden Markov Models, Kalman filters, mixture models, topic models, deep generative models — all can be expressed as graphical models with specific structure.

What problem do probabilistic graphical models solve?

They make the joint distribution tractable by encoding conditional independence as graph structure They make all variables independent They eliminate the need for probability entirely

Chapter 10: Connections

Probability is the foundation that every subsequent chapter of Murphy's book builds on. Here is the map of where each concept leads:

Concept from this chapter	Where it leads in the book
Bayes' rule	Bayesian statistics (Ch 4), MAP estimation, Bayesian neural nets (Ch 13)
Gaussian distribution	Linear regression (Ch 11), GPs (Ch 17), Kalman filters
Bernoulli / sigmoid	Logistic regression (Ch 10), binary classification
Mixture models	EM algorithm (Ch 8), clustering (Ch 21), VAEs (Ch 20)
Exponential family	GLMs (Ch 12), conjugate priors, sufficient statistics
Graphical models	HMMs, Bayesian networks, deep generative models
Conditional Gaussians	Gaussian processes, Bayesian linear regression, sensor fusion

What we covered: Random variables (discrete and continuous), PMFs and PDFs, Bayes' rule (the engine of inference), the Bernoulli/Binomial/Categorical family, the Gaussian and multivariate Gaussian, mixture models, live Bayesian updating, the exponential family, and graphical models.

What comes next: Chapter 4 (Statistics) takes these distributions and asks: given data, how do we estimate the parameters? MLE finds the parameters that maximize the likelihood. MAP adds a prior. Full Bayesian inference computes the entire posterior. These are the three pillars of statistical learning.

"Probability theory is nothing but common sense reduced to calculation." — Pierre-Simon Laplace

Which concept from this chapter is the foundation of all Bayesian machine learning?

The sigmoid function Bayes' rule: Prior × Likelihood → Posterior The Central Limit Theorem

Probability: The Language of Uncertainty