The building blocks of probabilistic models: binary, multinomial, Gaussian, exponential family, and nonparametric methods.
In Chapter 1, we treated the data as fixed and learned parameters. But where does the data come from? What assumptions are we making about how it was generated? To answer these questions, we need a language for describing random processes. That language is probability distributions.
A probability distribution is a mathematical recipe for generating data. If we believe our data was generated by some distribution, then learning amounts to figuring out which distribution (or which parameters of that distribution) best explains what we've observed.
This chapter covers a progression of increasingly rich distributions:
| Distribution | What it models | Example |
|---|---|---|
| Bernoulli | Binary outcomes | Coin flips, spam/not-spam |
| Binomial | Count of successes in N trials | Number of heads in 10 flips |
| Beta | Prior over Bernoulli parameter | Belief about coin bias |
| Multinomial | Outcomes with K categories | Dice rolls, word counts |
| Dirichlet | Prior over multinomial parameters | Belief about dice bias |
| Gaussian | Continuous, bell-shaped data | Measurement noise, heights |
A unifying theme: each distribution has a natural conjugate prior — a prior distribution that, when combined with the likelihood via Bayes' theorem, yields a posterior in the same family. This makes Bayesian inference analytically tractable. Conjugacy is elegant mathematics, but it's also practically useful: it gives us closed-form updates instead of intractable integrals.
The simplest possible distribution: a single binary random variable x ∈ {0, 1}. A coin flip. The Bernoulli distribution has one parameter μ ∈ [0, 1], the probability of "heads" (x = 1):
Its mean is E[x] = μ and variance is var[x] = μ(1 − μ). Given N observations D = {x1, …, xN}, the likelihood is:
where m = ∑ xn is the number of "heads." The sufficient statistic is m — given m and N, the individual order of observations doesn't matter.
Maximizing the log-likelihood gives the maximum likelihood estimate:
The sample proportion. If you flip 3 heads in 3 tries, ML says μ = 1. The coin is definitely always heads. This is clearly absurd for small N — ML overfits with limited data.
The binomial distribution counts the total number m of heads in N independent Bernoulli trials:
where C(N, m) = N! / (m!(N−m)!) is the binomial coefficient. The mean is Nμ and variance is Nμ(1 − μ).
To do Bayesian inference on the Bernoulli parameter μ, we need a prior distribution over μ ∈ [0, 1]. The beta distribution is the conjugate prior for the Bernoulli:
The hyperparameters a and b control the shape. Think of them as "pseudo-counts": a is the number of imaginary heads, b is the number of imaginary tails we've seen before collecting any real data.
Set your prior (a, b), then observe coin flips. Watch the posterior update in real time as evidence accumulates.
After observing m heads and l = N − m tails, the posterior is also a beta distribution:
The posterior mean is (a+m)/(a+b+N), which is a weighted average of the prior mean a/(a+b) and the ML estimate m/N. As N grows, the data overwhelms the prior and the posterior concentrates around the ML estimate. With small N, the prior keeps us from extreme conclusions.
Binary outcomes are a special case. Often we have K ≥ 2 possible outcomes: K sides of a die, K categories of email, K words in a vocabulary. The multinomial distribution generalizes the binomial to K categories.
Represent a single observation as a one-hot vector x = (0, …, 0, 1, 0, …, 0)T where the k-th element is 1 and the rest are 0. The distribution is:
where μ = (μ1, …, μK) with ∑ μk = 1. The ML estimate is μk,ML = mk/N where mk is the count of category k.
The conjugate prior for the multinomial is the Dirichlet distribution:
The hyperparameters αk play the same role as the beta's a and b — pseudo-counts for each category. The posterior after observing counts (m1, …, mK) is:
We met the Gaussian in Chapter 1. Now we study it seriously. The D-dimensional Gaussian is:
The exponent Δ2 = (x−μ)TΣ−1(x−μ) is the Mahalanobis distance — a distance that accounts for the shape and orientation of the distribution. Contours of constant probability are ellipses defined by constant Mahalanobis distance.
Three crucial operations on Gaussians (all yield Gaussians):
| Operation | Result |
|---|---|
| Conditioning: p(xa|xb) | Gaussian with mean that depends linearly on xb |
| Marginalizing: ∫ p(xa, xb) dxb | Gaussian with mean μa and covariance Σaa |
| Multiplying: p(xa|xb) · p(xb) | Joint Gaussian over (xa, xb) |
The precision matrix Λ = Σ−1 is often more natural than the covariance. Zero entries in the precision matrix indicate conditional independence: Λij = 0 means xi and xj are independent given all other variables. This connects to graphical models (Ch 8).
Given N data points drawn i.i.d. from a D-dimensional Gaussian, the ML estimates are:
The ML mean is unbiased: E[μML] = μ. But the ML covariance is biased: E[ΣML] = (N−1)/N · Σ. It underestimates the true covariance because it measures spread around the fitted mean rather than the true mean.
Beyond the standard Gaussian, Bishop covers two useful variants:
Student's t-distribution: A Gaussian with uncertain variance. Arises by marginalizing a Gaussian-Gamma posterior. Has heavier tails — more robust to outliers. As the degrees of freedom ν → ∞, it becomes a Gaussian.
Mixture of Gaussians: A weighted sum of K Gaussians. Can model multimodal distributions. The parameters (means, covariances, weights) are learned via the EM algorithm (Ch 9).
Instead of point estimates, the Bayesian approach maintains distributions over the Gaussian parameters.
Known variance, unknown mean: If σ2 is known, the conjugate prior for μ is a Gaussian:
After observing N points with sample mean x̄, the posterior is:
where σN2 = 1/(1/σ02 + N/σ2) and μN = σN2(N x̄/σ2 + μ0/σ02). The posterior mean is a precision-weighted average of the prior mean and the data mean.
Known mean, unknown variance: The conjugate prior for the precision β = 1/σ2 is a Gamma distribution. After N observations, the posterior is also Gamma with updated parameters.
Both unknown: The conjugate prior for (μ, β) jointly is the Normal-Gamma distribution. For the multivariate case, it's the Normal-Wishart (where the Wishart is the conjugate prior for the precision matrix).
The Bernoulli, Gaussian, multinomial, Poisson, gamma, beta, and Dirichlet all look different. But they share a common structure. The exponential family is a unified framework that encompasses all of them:
where η is the natural parameter vector, u(x) is the sufficient statistic vector, g(η) is the normalizing factor, and h(x) is a scaling function.
Key properties of exponential family members:
| Property | What it gives you |
|---|---|
| Sufficient statistics | You can summarize the data without losing information for inference |
| Conjugate prior | Posterior = prior with updated pseudo-counts |
| −log g(η) | Generates all moments: ∇ (−log g) = E[u(x)] |
| Maximum entropy | Exponential family = max entropy distribution given constraints on E[u] |
Noninformative priors: When we have no prior knowledge, we might want a prior that "says nothing." Two approaches: (1) flat priors that are constant everywhere (but can be improper, i.e., not normalizable), and (2) Jeffreys' prior, which is invariant under reparameterization. Jeffreys' prior is proportional to the square root of the determinant of the Fisher information matrix.
Everything so far assumes a parametric form for the distribution (Gaussian, binomial, etc.). What if the true distribution has a shape that no parametric family can capture? Nonparametric methods let the data speak for itself.
The fundamental idea: in a region R of volume V containing K of N data points, the density is approximately:
Two strategies arise from this formula, depending on what you fix:
Fix V, count K: This gives kernel density estimation (KDE, Parzen windows). Place a kernel (e.g., Gaussian) of fixed width at each data point. The density estimate is the average of all kernels. Smooth but sensitive to bandwidth choice.
Fix K, adapt V: This gives K-nearest neighbors (KNN). For each query point, find the K nearest data points. The density is K/(NV) where V is the volume of the sphere containing those K points. Adapts to local density automatically.
Each data point contributes a small Gaussian bump. Adjust the bandwidth h to see how smoothness changes. Too small: spiky. Too large: over-smoothed.
Chapter 2 gave us the probabilistic toolkit. Every model in the rest of the book is built from these distributions.
| Distribution | Conjugate Prior | Used in |
|---|---|---|
| Bernoulli | Beta | Binary classification, naive Bayes |
| Multinomial | Dirichlet | Text models, topic models, HMMs |
| Gaussian | Gaussian / Normal-Wishart | Regression, GPs, GMMs, Kalman |
| Exponential family | Generic conjugate | Unifying framework |
What comes next: Chapter 3 applies these distributions to the first real modeling task: linear regression. We'll see the Gaussian likelihood + Gaussian prior = Gaussian posterior pattern in action, giving us Bayesian linear regression with closed-form solutions.