Bishop PRML, Chapter 2

Probability Distributions

The building blocks of probabilistic models: binary, multinomial, Gaussian, exponential family, and nonparametric methods.

Prerequisites: Chapter 1 (probability basics, Bayes' theorem).
10
Chapters
2
Simulations
10
Quizzes

Chapter 0: Why Distributions?

In Chapter 1, we treated the data as fixed and learned parameters. But where does the data come from? What assumptions are we making about how it was generated? To answer these questions, we need a language for describing random processes. That language is probability distributions.

A probability distribution is a mathematical recipe for generating data. If we believe our data was generated by some distribution, then learning amounts to figuring out which distribution (or which parameters of that distribution) best explains what we've observed.

The generative story: Behind every dataset, there is (at least conceptually) a process that generated it. "Nature picked parameters, then generated data from those parameters." Our job is to run this story in reverse: observe data, infer parameters. The distributions in this chapter are the building blocks for these generative stories.

This chapter covers a progression of increasingly rich distributions:

DistributionWhat it modelsExample
BernoulliBinary outcomesCoin flips, spam/not-spam
BinomialCount of successes in N trialsNumber of heads in 10 flips
BetaPrior over Bernoulli parameterBelief about coin bias
MultinomialOutcomes with K categoriesDice rolls, word counts
DirichletPrior over multinomial parametersBelief about dice bias
GaussianContinuous, bell-shaped dataMeasurement noise, heights

A unifying theme: each distribution has a natural conjugate prior — a prior distribution that, when combined with the likelihood via Bayes' theorem, yields a posterior in the same family. This makes Bayesian inference analytically tractable. Conjugacy is elegant mathematics, but it's also practically useful: it gives us closed-form updates instead of intractable integrals.

Check: What is a conjugate prior?

Chapter 1: The Bernoulli Distribution

The simplest possible distribution: a single binary random variable x ∈ {0, 1}. A coin flip. The Bernoulli distribution has one parameter μ ∈ [0, 1], the probability of "heads" (x = 1):

Bern(x|μ) = μx (1 − μ)1−x

Its mean is E[x] = μ and variance is var[x] = μ(1 − μ). Given N observations D = {x1, …, xN}, the likelihood is:

p(D|μ) = ∏n=1N μxn(1 − μ)1−xn = μm(1 − μ)N−m

where m = ∑ xn is the number of "heads." The sufficient statistic is m — given m and N, the individual order of observations doesn't matter.

Maximizing the log-likelihood gives the maximum likelihood estimate:

μML = m / N

The sample proportion. If you flip 3 heads in 3 tries, ML says μ = 1. The coin is definitely always heads. This is clearly absurd for small N — ML overfits with limited data.

Key insight: With just 3 observations, ML estimates μ = 1 with certainty. But surely we wouldn't bet our life that the coin always lands heads. We need a way to express uncertainty about μ and to incorporate prior knowledge (like "most coins are roughly fair"). That's what the beta prior gives us.

The binomial distribution counts the total number m of heads in N independent Bernoulli trials:

Bin(m|N, μ) = C(N, m) μm (1 − μ)N−m

where C(N, m) = N! / (m!(N−m)!) is the binomial coefficient. The mean is Nμ and variance is Nμ(1 − μ).

Check: What is the ML estimate of μ from 3 heads out of 3 flips?

Chapter 2: The Beta Prior

To do Bayesian inference on the Bernoulli parameter μ, we need a prior distribution over μ ∈ [0, 1]. The beta distribution is the conjugate prior for the Bernoulli:

Beta(μ|a, b) = Γ(a+b) / (Γ(a)Γ(b)) · μa−1 (1−μ)b−1

The hyperparameters a and b control the shape. Think of them as "pseudo-counts": a is the number of imaginary heads, b is the number of imaginary tails we've seen before collecting any real data.

Beta Distribution & Bayesian Updating

Set your prior (a, b), then observe coin flips. Watch the posterior update in real time as evidence accumulates.

a2.0
b2.0
0 heads, 0 tails

After observing m heads and l = N − m tails, the posterior is also a beta distribution:

p(μ|D) = Beta(μ| a + m, b + l)
Conjugacy in action: The prior was Beta(a, b). The posterior is Beta(a+m, b+l). Same family! We just add the observed counts to the prior pseudo-counts. This is the beauty of conjugate priors: Bayesian updating reduces to simple arithmetic. No integrals needed.

The posterior mean is (a+m)/(a+b+N), which is a weighted average of the prior mean a/(a+b) and the ML estimate m/N. As N grows, the data overwhelms the prior and the posterior concentrates around the ML estimate. With small N, the prior keeps us from extreme conclusions.

Sequential updating: A beautiful property: the posterior from one batch of data becomes the prior for the next. If you observe data one point at a time, the final posterior is identical to what you'd get from processing all data at once. Bayesian inference is inherently sequential.
Check: Starting with Beta(2, 2) and observing 3 heads and 1 tail, what is the posterior?

Chapter 3: Multinomial Variables

Binary outcomes are a special case. Often we have K ≥ 2 possible outcomes: K sides of a die, K categories of email, K words in a vocabulary. The multinomial distribution generalizes the binomial to K categories.

Represent a single observation as a one-hot vector x = (0, …, 0, 1, 0, …, 0)T where the k-th element is 1 and the rest are 0. The distribution is:

p(x|μ) = ∏k=1K μkxk

where μ = (μ1, …, μK) with ∑ μk = 1. The ML estimate is μk,ML = mk/N where mk is the count of category k.

The conjugate prior for the multinomial is the Dirichlet distribution:

Dir(μ|α) = C(α) ∏k=1K μkαk−1

The hyperparameters αk play the same role as the beta's a and b — pseudo-counts for each category. The posterior after observing counts (m1, …, mK) is:

p(μ|D) = Dir(μ| α1+m1, …, αK+mK)
The pattern repeats: Bernoulli/Beta in 2 categories. Multinomial/Dirichlet in K categories. Same idea: conjugate priors that absorb observed counts. The Dirichlet is a distribution over probability vectors — it lives on the (K−1)-simplex (the space where all components are non-negative and sum to 1).
Check: What is the Dirichlet distribution a prior over?

Chapter 4: The Gaussian in Depth

We met the Gaussian in Chapter 1. Now we study it seriously. The D-dimensional Gaussian is:

N(x|μ, Σ) = (2π)−D/2 |Σ|−1/2 exp(−½(xμ)TΣ−1(xμ))

The exponent Δ2 = (xμ)TΣ−1(xμ) is the Mahalanobis distance — a distance that accounts for the shape and orientation of the distribution. Contours of constant probability are ellipses defined by constant Mahalanobis distance.

Three crucial operations on Gaussians (all yield Gaussians):

OperationResult
Conditioning: p(xa|xb)Gaussian with mean that depends linearly on xb
Marginalizing: ∫ p(xa, xb) dxbGaussian with mean μa and covariance Σaa
Multiplying: p(xa|xb) · p(xb)Joint Gaussian over (xa, xb)
Gaussians are closed under conditioning and marginalization. This is why they are so useful: every operation you need for Bayesian inference (conditioning on data, marginalizing over latent variables) keeps you in the Gaussian family. No approximations needed. This "closure" property is the engine behind Bayesian linear regression (Ch 3), Gaussian processes (Ch 6), and Kalman filters (Ch 13).

The precision matrix Λ = Σ−1 is often more natural than the covariance. Zero entries in the precision matrix indicate conditional independence: Λij = 0 means xi and xj are independent given all other variables. This connects to graphical models (Ch 8).

Check: What does the Mahalanobis distance measure?

Chapter 5: MLE for the Gaussian

Given N data points drawn i.i.d. from a D-dimensional Gaussian, the ML estimates are:

μML = (1/N) ∑n=1N xn
ΣML = (1/N) ∑n=1N (xnμML)(xnμML)T

The ML mean is unbiased: E[μML] = μ. But the ML covariance is biased: E[ΣML] = (N−1)/N · Σ. It underestimates the true covariance because it measures spread around the fitted mean rather than the true mean.

Sequential estimation: We can update the ML mean one data point at a time: μML(N) = μML(N−1) + (1/N)(xNμML(N−1)). The correction is proportional to the "error" between the new observation and the current estimate. This pattern — update = current + learning_rate × error — appears throughout ML, from stochastic gradient descent to Kalman filters.

Beyond the standard Gaussian, Bishop covers two useful variants:

Student's t-distribution: A Gaussian with uncertain variance. Arises by marginalizing a Gaussian-Gamma posterior. Has heavier tails — more robust to outliers. As the degrees of freedom ν → ∞, it becomes a Gaussian.

Mixture of Gaussians: A weighted sum of K Gaussians. Can model multimodal distributions. The parameters (means, covariances, weights) are learned via the EM algorithm (Ch 9).

Check: The sequential update for the ML mean follows which pattern?

Chapter 6: Bayesian Inference for the Gaussian

Instead of point estimates, the Bayesian approach maintains distributions over the Gaussian parameters.

Known variance, unknown mean: If σ2 is known, the conjugate prior for μ is a Gaussian:

p(μ) = N(μ|μ0, σ02)

After observing N points with sample mean x̄, the posterior is:

p(μ|D) = N(μ|μN, σN2)

where σN2 = 1/(1/σ02 + N/σ2) and μN = σN2(N x̄/σ2 + μ002). The posterior mean is a precision-weighted average of the prior mean and the data mean.

Precision addition: The posterior precision (1/σN2) equals the prior precision plus N times the data precision. More data → higher precision → tighter posterior. Each observation adds the same amount of precision. This is the "information accumulation" property of Bayesian updating.

Known mean, unknown variance: The conjugate prior for the precision β = 1/σ2 is a Gamma distribution. After N observations, the posterior is also Gamma with updated parameters.

Both unknown: The conjugate prior for (μ, β) jointly is the Normal-Gamma distribution. For the multivariate case, it's the Normal-Wishart (where the Wishart is the conjugate prior for the precision matrix).

Predictive distribution: Instead of plugging in point estimates, we can marginalize over the posterior: p(xnew|D) = ∫ p(xnew|μ) p(μ|D) dμ. This predictive distribution is wider than using the ML estimate alone — it honestly reflects our uncertainty about the parameters. For the Gaussian with known variance, the predictive is a Gaussian with variance = data variance + posterior uncertainty about the mean.
Check: In Bayesian inference for the Gaussian mean, what does each new observation add?

Chapter 7: The Exponential Family

The Bernoulli, Gaussian, multinomial, Poisson, gamma, beta, and Dirichlet all look different. But they share a common structure. The exponential family is a unified framework that encompasses all of them:

p(x|η) = h(x) g(η) exp(ηT u(x))

where η is the natural parameter vector, u(x) is the sufficient statistic vector, g(η) is the normalizing factor, and h(x) is a scaling function.

Why this matters: Any member of the exponential family automatically has: (1) a conjugate prior, (2) sufficient statistics that compress the data, and (3) a maximum likelihood estimate in terms of those sufficient statistics. These properties let us build a general-purpose Bayesian inference engine that works for any exponential family distribution.

Key properties of exponential family members:

PropertyWhat it gives you
Sufficient statisticsYou can summarize the data without losing information for inference
Conjugate priorPosterior = prior with updated pseudo-counts
−log g(η)Generates all moments: ∇ (−log g) = E[u(x)]
Maximum entropyExponential family = max entropy distribution given constraints on E[u]

Noninformative priors: When we have no prior knowledge, we might want a prior that "says nothing." Two approaches: (1) flat priors that are constant everywhere (but can be improper, i.e., not normalizable), and (2) Jeffreys' prior, which is invariant under reparameterization. Jeffreys' prior is proportional to the square root of the determinant of the Fisher information matrix.

Check: What do all exponential family distributions have in common?

Chapter 8: Nonparametric Methods

Everything so far assumes a parametric form for the distribution (Gaussian, binomial, etc.). What if the true distribution has a shape that no parametric family can capture? Nonparametric methods let the data speak for itself.

The fundamental idea: in a region R of volume V containing K of N data points, the density is approximately:

p(x) ≈ K / (NV)

Two strategies arise from this formula, depending on what you fix:

Fix V, count K: This gives kernel density estimation (KDE, Parzen windows). Place a kernel (e.g., Gaussian) of fixed width at each data point. The density estimate is the average of all kernels. Smooth but sensitive to bandwidth choice.

Fix K, adapt V: This gives K-nearest neighbors (KNN). For each query point, find the K nearest data points. The density is K/(NV) where V is the volume of the sphere containing those K points. Adapts to local density automatically.

KNN classification: To classify a new point, find its K nearest neighbors in the training set and take a majority vote. Simple, no training phase, and surprisingly effective. The decision boundary adapts to the data density. The catch: it stores the entire training set and is slow at test time (must search through all N points). With K=1, the error rate is guaranteed to be at most twice the optimal Bayes error rate as N → ∞.
Kernel Density Estimation

Each data point contributes a small Gaussian bump. Adjust the bandwidth h to see how smoothness changes. Too small: spiky. Too large: over-smoothed.

h0.20
Check: In KNN, what happens as K increases?

Chapter 9: Summary

Chapter 2 gave us the probabilistic toolkit. Every model in the rest of the book is built from these distributions.

DistributionConjugate PriorUsed in
BernoulliBetaBinary classification, naive Bayes
MultinomialDirichletText models, topic models, HMMs
GaussianGaussian / Normal-WishartRegression, GPs, GMMs, Kalman
Exponential familyGeneric conjugateUnifying framework
The pattern: Choose a likelihood (how data is generated). Choose a conjugate prior (what you believe before seeing data). Multiply to get the posterior (updated beliefs). Marginalize to get the predictive (what you expect next). This four-step recipe is the backbone of Bayesian machine learning.

What comes next: Chapter 3 applies these distributions to the first real modeling task: linear regression. We'll see the Gaussian likelihood + Gaussian prior = Gaussian posterior pattern in action, giving us Bayesian linear regression with closed-form solutions.

"The posterior distribution for parameters combines the
prior information with the information from the observed data."
— Christopher Bishop, PRML
Check: What is the key advantage of conjugate priors?