Ch 2: Probability Distributions

Chapter 0: Why Distributions?

In Chapter 1, we treated the data as fixed and learned parameters. But where does the data come from? What assumptions are we making about how it was generated? To answer these questions, we need a language for describing random processes. That language is probability distributions.

A probability distribution is a mathematical recipe for generating data. If we believe our data was generated by some distribution, then learning amounts to figuring out which distribution (or which parameters of that distribution) best explains what we've observed.

The generative story: Behind every dataset, there is (at least conceptually) a process that generated it. "Nature picked parameters, then generated data from those parameters." Our job is to run this story in reverse: observe data, infer parameters. The distributions in this chapter are the building blocks for these generative stories.

This chapter covers a progression of increasingly rich distributions:

Distribution	What it models	Example
Bernoulli	Binary outcomes	Coin flips, spam/not-spam
Binomial	Count of successes in N trials	Number of heads in 10 flips
Beta	Prior over Bernoulli parameter	Belief about coin bias
Multinomial	Outcomes with K categories	Dice rolls, word counts
Dirichlet	Prior over multinomial parameters	Belief about dice bias
Gaussian	Continuous, bell-shaped data	Measurement noise, heights

A unifying theme: each distribution has a natural conjugate prior — a prior distribution that, when combined with the likelihood via Bayes' theorem, yields a posterior in the same family. This makes Bayesian inference analytically tractable. Conjugacy is elegant mathematics, but it's also practically useful: it gives us closed-form updates instead of intractable integrals.

Check: What is a conjugate prior?

A prior that, combined with the likelihood, gives a posterior in the same distributional family Any prior distribution with zero mean The prior that maximizes the likelihood

Chapter 1: The Bernoulli Distribution

The simplest possible distribution: a single binary random variable x ∈ {0, 1}. A coin flip. The Bernoulli distribution has one parameter μ ∈ [0, 1], the probability of "heads" (x = 1):

Bern(x|μ) = μ^x (1 − μ)^1−x

Its mean is E[x] = μ and variance is var[x] = μ(1 − μ). Given N observations D = {x₁, …, x_N}, the likelihood is:

p(D|μ) = ∏_n=1^N μ^x_n(1 − μ)^1−x_n = μ^m(1 − μ)^N−m

where m = ∑ x_n is the number of "heads." The sufficient statistic is m — given m and N, the individual order of observations doesn't matter.

Maximizing the log-likelihood gives the maximum likelihood estimate:

μ_ML = m / N

The sample proportion. If you flip 3 heads in 3 tries, ML says μ = 1. The coin is definitely always heads. This is clearly absurd for small N — ML overfits with limited data.

Key insight: With just 3 observations, ML estimates μ = 1 with certainty. But surely we wouldn't bet our life that the coin always lands heads. We need a way to express uncertainty about μ and to incorporate prior knowledge (like "most coins are roughly fair"). That's what the beta prior gives us.

The binomial distribution counts the total number m of heads in N independent Bernoulli trials:

Bin(m|N, μ) = C(N, m) μ^m (1 − μ)^N−m

where C(N, m) = N! / (m!(N−m)!) is the binomial coefficient. The mean is Nμ and variance is Nμ(1 − μ).

Check: What is the ML estimate of μ from 3 heads out of 3 flips?

μ = 1 (which is clearly overfit to the small sample) μ = 0.5 (fair coin assumption) μ = 0.75

Chapter 2: The Beta Prior

To do Bayesian inference on the Bernoulli parameter μ, we need a prior distribution over μ ∈ [0, 1]. The beta distribution is the conjugate prior for the Bernoulli:

Beta(μ|a, b) = Γ(a+b) / (Γ(a)Γ(b)) · μ^a−1 (1−μ)^b−1

The hyperparameters a and b control the shape. Think of them as "pseudo-counts": a is the number of imaginary heads, b is the number of imaginary tails we've seen before collecting any real data.

Beta Distribution & Bayesian Updating

Set your prior (a, b), then observe coin flips. Watch the posterior update in real time as evidence accumulates.

a2.0

b2.0

0 heads, 0 tails

After observing m heads and l = N − m tails, the posterior is also a beta distribution:

p(μ|D) = Beta(μ| a + m, b + l)

Conjugacy in action: The prior was Beta(a, b). The posterior is Beta(a+m, b+l). Same family! We just add the observed counts to the prior pseudo-counts. This is the beauty of conjugate priors: Bayesian updating reduces to simple arithmetic. No integrals needed.

The posterior mean is (a+m)/(a+b+N), which is a weighted average of the prior mean a/(a+b) and the ML estimate m/N. As N grows, the data overwhelms the prior and the posterior concentrates around the ML estimate. With small N, the prior keeps us from extreme conclusions.

Sequential updating: A beautiful property: the posterior from one batch of data becomes the prior for the next. If you observe data one point at a time, the final posterior is identical to what you'd get from processing all data at once. Bayesian inference is inherently sequential.

Check: Starting with Beta(2, 2) and observing 3 heads and 1 tail, what is the posterior?

Beta(5, 3) Beta(3, 1) Beta(2, 2) (the prior doesn't change)

Chapter 3: Multinomial Variables

Binary outcomes are a special case. Often we have K ≥ 2 possible outcomes: K sides of a die, K categories of email, K words in a vocabulary. The multinomial distribution generalizes the binomial to K categories.

Represent a single observation as a one-hot vector x = (0, …, 0, 1, 0, …, 0)^T where the k-th element is 1 and the rest are 0. The distribution is:

p(x|μ) = ∏_k=1^K μ_k^x_k

where μ = (μ₁, …, μ_K) with ∑ μ_k = 1. The ML estimate is μ_k,ML = m_k/N where m_k is the count of category k.

The conjugate prior for the multinomial is the Dirichlet distribution:

Dir(μ|α) = C(α) ∏_k=1^K μ_k^α_k−1

The hyperparameters α_k play the same role as the beta's a and b — pseudo-counts for each category. The posterior after observing counts (m₁, …, m_K) is:

p(μ|D) = Dir(μ| α₁+m₁, …, α_K+m_K)

The pattern repeats: Bernoulli/Beta in 2 categories. Multinomial/Dirichlet in K categories. Same idea: conjugate priors that absorb observed counts. The Dirichlet is a distribution over probability vectors — it lives on the (K−1)-simplex (the space where all components are non-negative and sum to 1).

Check: What is the Dirichlet distribution a prior over?

Individual real numbers Probability vectors (that sum to 1) Covariance matrices

Chapter 4: The Gaussian in Depth

We met the Gaussian in Chapter 1. Now we study it seriously. The D-dimensional Gaussian is:

N(x|μ, Σ) = (2π)^−D/2 |Σ|^−1/2 exp(−½(x−μ)^TΣ⁻¹(x−μ))

The exponent Δ² = (x−μ)^TΣ⁻¹(x−μ) is the Mahalanobis distance — a distance that accounts for the shape and orientation of the distribution. Contours of constant probability are ellipses defined by constant Mahalanobis distance.

Three crucial operations on Gaussians (all yield Gaussians):

Operation	Result
Conditioning: p(x_a\|x_b)	Gaussian with mean that depends linearly on x_b
Marginalizing: ∫ p(x_a, x_b) dx_b	Gaussian with mean μ_a and covariance Σ_aa
Multiplying: p(x_a\|x_b) · p(x_b)	Joint Gaussian over (x_a, x_b)

Gaussians are closed under conditioning and marginalization. This is why they are so useful: every operation you need for Bayesian inference (conditioning on data, marginalizing over latent variables) keeps you in the Gaussian family. No approximations needed. This "closure" property is the engine behind Bayesian linear regression (Ch 3), Gaussian processes (Ch 6), and Kalman filters (Ch 13).

The precision matrix Λ = Σ⁻¹ is often more natural than the covariance. Zero entries in the precision matrix indicate conditional independence: Λ_ij = 0 means x_i and x_j are independent given all other variables. This connects to graphical models (Ch 8).

Check: What does the Mahalanobis distance measure?

Ordinary Euclidean distance Distance scaled by the covariance structure of the distribution The number of standard deviations from the mode

Chapter 5: MLE for the Gaussian

Given N data points drawn i.i.d. from a D-dimensional Gaussian, the ML estimates are:

μ_ML = (1/N) ∑_n=1^N x_n

Σ_ML = (1/N) ∑_n=1^N (x_n − μ_ML)(x_n − μ_ML)^T

The ML mean is unbiased: E[μ_ML] = μ. But the ML covariance is biased: E[Σ_ML] = (N−1)/N · Σ. It underestimates the true covariance because it measures spread around the fitted mean rather than the true mean.

Sequential estimation: We can update the ML mean one data point at a time: μ_ML^(N) = μ_ML^(N−1) + (1/N)(x_N − μ_ML^(N−1)). The correction is proportional to the "error" between the new observation and the current estimate. This pattern — update = current + learning_rate × error — appears throughout ML, from stochastic gradient descent to Kalman filters.

Beyond the standard Gaussian, Bishop covers two useful variants:

Student's t-distribution: A Gaussian with uncertain variance. Arises by marginalizing a Gaussian-Gamma posterior. Has heavier tails — more robust to outliers. As the degrees of freedom ν → ∞, it becomes a Gaussian.

Mixture of Gaussians: A weighted sum of K Gaussians. Can model multimodal distributions. The parameters (means, covariances, weights) are learned via the EM algorithm (Ch 9).

Check: The sequential update for the ML mean follows which pattern?

new_estimate = old_estimate + learning_rate * error new_estimate = average of all past estimates new_estimate = maximum of old and new data

Chapter 6: Bayesian Inference for the Gaussian

Instead of point estimates, the Bayesian approach maintains distributions over the Gaussian parameters.

Known variance, unknown mean: If σ² is known, the conjugate prior for μ is a Gaussian:

p(μ) = N(μ|μ₀, σ₀²)

After observing N points with sample mean x̄, the posterior is:

p(μ|D) = N(μ|μ_N, σ_N²)

where σ_N² = 1/(1/σ₀² + N/σ²) and μ_N = σ_N²(N x̄/σ² + μ₀/σ₀²). The posterior mean is a precision-weighted average of the prior mean and the data mean.

Precision addition: The posterior precision (1/σ_N²) equals the prior precision plus N times the data precision. More data → higher precision → tighter posterior. Each observation adds the same amount of precision. This is the "information accumulation" property of Bayesian updating.

Known mean, unknown variance: The conjugate prior for the precision β = 1/σ² is a Gamma distribution. After N observations, the posterior is also Gamma with updated parameters.

Both unknown: The conjugate prior for (μ, β) jointly is the Normal-Gamma distribution. For the multivariate case, it's the Normal-Wishart (where the Wishart is the conjugate prior for the precision matrix).

Predictive distribution: Instead of plugging in point estimates, we can marginalize over the posterior: p(x_new|D) = ∫ p(x_new|μ) p(μ|D) dμ. This predictive distribution is wider than using the ML estimate alone — it honestly reflects our uncertainty about the parameters. For the Gaussian with known variance, the predictive is a Gaussian with variance = data variance + posterior uncertainty about the mean.

Check: In Bayesian inference for the Gaussian mean, what does each new observation add?

One unit of posterior mean One unit of precision to the posterior One unit of variance

Chapter 7: The Exponential Family

The Bernoulli, Gaussian, multinomial, Poisson, gamma, beta, and Dirichlet all look different. But they share a common structure. The exponential family is a unified framework that encompasses all of them:

p(x|η) = h(x) g(η) exp(η^T u(x))

where η is the natural parameter vector, u(x) is the sufficient statistic vector, g(η) is the normalizing factor, and h(x) is a scaling function.

Why this matters: Any member of the exponential family automatically has: (1) a conjugate prior, (2) sufficient statistics that compress the data, and (3) a maximum likelihood estimate in terms of those sufficient statistics. These properties let us build a general-purpose Bayesian inference engine that works for any exponential family distribution.

Key properties of exponential family members:

Property	What it gives you
Sufficient statistics	You can summarize the data without losing information for inference
Conjugate prior	Posterior = prior with updated pseudo-counts
−log g(η)	Generates all moments: ∇ (−log g) = E[u(x)]
Maximum entropy	Exponential family = max entropy distribution given constraints on E[u]

Noninformative priors: When we have no prior knowledge, we might want a prior that "says nothing." Two approaches: (1) flat priors that are constant everywhere (but can be improper, i.e., not normalizable), and (2) Jeffreys' prior, which is invariant under reparameterization. Jeffreys' prior is proportional to the square root of the determinant of the Fisher information matrix.

Check: What do all exponential family distributions have in common?

They are all continuous They all have Gaussian shape They share the form h(x)g(η)exp(η^Tu(x)), have conjugate priors and sufficient statistics

Chapter 8: Nonparametric Methods

Everything so far assumes a parametric form for the distribution (Gaussian, binomial, etc.). What if the true distribution has a shape that no parametric family can capture? Nonparametric methods let the data speak for itself.

The fundamental idea: in a region R of volume V containing K of N data points, the density is approximately:

p(x) ≈ K / (NV)

Two strategies arise from this formula, depending on what you fix:

Fix V, count K: This gives kernel density estimation (KDE, Parzen windows). Place a kernel (e.g., Gaussian) of fixed width at each data point. The density estimate is the average of all kernels. Smooth but sensitive to bandwidth choice.

Fix K, adapt V: This gives K-nearest neighbors (KNN). For each query point, find the K nearest data points. The density is K/(NV) where V is the volume of the sphere containing those K points. Adapts to local density automatically.

KNN classification: To classify a new point, find its K nearest neighbors in the training set and take a majority vote. Simple, no training phase, and surprisingly effective. The decision boundary adapts to the data density. The catch: it stores the entire training set and is slow at test time (must search through all N points). With K=1, the error rate is guaranteed to be at most twice the optimal Bayes error rate as N → ∞.

Kernel Density Estimation

Each data point contributes a small Gaussian bump. Adjust the bandwidth h to see how smoothness changes. Too small: spiky. Too large: over-smoothed.

h0.20

Check: In KNN, what happens as K increases?

The decision boundary becomes smoother (less sensitive to individual points) The algorithm becomes faster The training set shrinks

Chapter 9: Summary

Chapter 2 gave us the probabilistic toolkit. Every model in the rest of the book is built from these distributions.

Distribution	Conjugate Prior	Used in
Bernoulli	Beta	Binary classification, naive Bayes
Multinomial	Dirichlet	Text models, topic models, HMMs
Gaussian	Gaussian / Normal-Wishart	Regression, GPs, GMMs, Kalman
Exponential family	Generic conjugate	Unifying framework

The pattern: Choose a likelihood (how data is generated). Choose a conjugate prior (what you believe before seeing data). Multiply to get the posterior (updated beliefs). Marginalize to get the predictive (what you expect next). This four-step recipe is the backbone of Bayesian machine learning.

What comes next: Chapter 3 applies these distributions to the first real modeling task: linear regression. We'll see the Gaussian likelihood + Gaussian prior = Gaussian posterior pattern in action, giving us Bayesian linear regression with closed-form solutions.

"The posterior distribution for parameters combines the
prior information with the information from the observed data."
— Christopher Bishop, PRML

Check: What is the key advantage of conjugate priors?

The posterior has the same functional form as the prior, giving closed-form updates They always give the most accurate predictions They require no hyperparameters