Polynomial curve fitting, probability theory, Bayesian inference, the curse of dimensionality, decision theory, and information theory.
You are handed a pile of envelopes. Each one has a handwritten zip code. Your job: build a system that reads the digits automatically. Some people write "7" with a cross-stroke, others don't. Some "1"s look like "7"s. Some "4"s look like "9"s. How do you handle this?
One approach: write rules by hand. "If the digit has a loop at the top, it's probably a 9 or 6." But rules breed exceptions, exceptions breed sub-rules, and soon you have thousands of fragile heuristics that still fail on messy handwriting.
A better approach: show the system thousands of examples and let it learn the patterns. This is pattern recognition — the automatic discovery of regularities in data through algorithms, and the use of those regularities to take actions like classifying digits.
Bishop's textbook covers three types of learning:
| Type | What you have | Goal |
|---|---|---|
| Supervised | Inputs + labels | Predict labels for new inputs |
| Unsupervised | Inputs only | Discover structure (clusters, density) |
| Reinforcement | States + rewards | Learn actions that maximize reward |
Supervised learning itself splits into two sub-problems. When the output is a discrete label (like "this digit is a 7"), we call it classification. When the output is a continuous value (like "the house price is $350k"), we call it regression.
This chapter introduces the key concepts through a beautifully simple example: fitting a polynomial to noisy data. From there, we build up to probability theory, Bayesian reasoning, and information theory — the tools that power everything in the remaining 13 chapters.
Suppose you observe 10 data points generated from the function sin(2πx) plus some random noise. You don't know the underlying function. You only see the noisy points. Can you figure out what generated them?
A natural approach: fit a polynomial. A polynomial of order M takes the form:
The weights w0, w1, …, wM are unknown parameters we need to determine. We choose them to minimize an error function — the sum of squared differences between the polynomial's predictions and the actual data:
This is the sum-of-squares error. We pick the weights that make E as small as possible. Since E is a quadratic function of the weights, it has a unique minimum that can be found in closed form. (Later, we'll see that this error function arises naturally from a probabilistic perspective.)
10 data points from sin(2πx) + noise. Adjust the polynomial order M to see how the fit changes. Low M underfits, high M overfits.
| Symbol | Meaning |
|---|---|
| xn | The n-th input value |
| tn | The n-th target (observed output) |
| wj | The j-th polynomial coefficient |
| M | Polynomial order (model complexity) |
| E(w) | Sum-of-squares error function |
With M = 0 (a constant), the fit is terrible — a flat line that ignores the data. With M = 1 (a line), it's better but still misses the sine-wave shape. With M = 3, it captures the curve beautifully. So far, higher M means better fit.
But try M = 9. With 10 data points and 10 parameters, the polynomial passes exactly through every point (E = 0). The training error is perfect. Yet the fitted curve oscillates wildly between the data points. It has memorized the noise instead of learning the signal.
How do we detect overfitting? Use a separate test set — data the model has never seen. Compute the root-mean-square error (RMS) on both training and test sets:
For M = 9, the training RMS is near zero but the test RMS is enormous. For M = 3, both are small. The test error reveals the truth.
Two cures for overfitting:
The regularization parameter λ controls the trade-off. Large λ means heavy penalty (underfitting). λ = 0 means no penalty (back to raw least squares). The sweet spot is in between.
Imagine two boxes: a red box with 2 apples and 6 oranges, and a blue box with 3 apples and 1 orange. You pick a box at random (red 40% of the time, blue 60%) and then pick a fruit. What's the probability of getting an apple? If you got an orange, which box did it probably come from?
All of probability theory rests on two rules:
The sum rule says: to find the probability of X alone, sum (marginalize) the joint probability over all values of Y. The product rule says: the joint probability of X and Y equals the conditional probability of Y given X, times the probability of X.
For the fruit example: p(apple) = p(apple|red)p(red) + p(apple|blue)p(blue) = (1/4)(4/10) + (3/4)(6/10) = 11/20. The sum rule in action.
| Concept | Notation | Meaning |
|---|---|---|
| Joint | p(X, Y) | Probability of X and Y together |
| Marginal | p(X) | Probability of X, summing over Y |
| Conditional | p(Y|X) | Probability of Y given X is known |
| Prior | p(X) | Belief about X before seeing data |
| Posterior | p(X|Y) | Updated belief after observing Y |
For continuous variables, sums become integrals and probabilities become probability densities. A density p(x) is not a probability itself — it can exceed 1. The probability of x falling in an interval [a, b] is the integral ∫ab p(x) dx.
Two important summary statistics: the expectation (average value of a function under a distribution) and the variance (spread around the mean):
Combining the two rules of probability gives us the most important equation in this book:
In words: posterior = likelihood × prior / evidence. Let's unpack each piece.
The denominator p(D) = ∫ p(D|w) p(w) dw is the model evidence (or marginal likelihood). It normalizes the posterior but also plays a crucial role in model selection — comparing different models to see which best explains the data.
Back to our fruit example: we picked an orange. Which box? Before seeing the fruit, p(red) = 0.4 (the prior). After seeing an orange:
The prior favored blue (0.6 vs 0.4). But the likelihood strongly favors red (oranges are 75% of the red box vs 25% of blue). The posterior updates to 2/3 for red. Data has overridden the prior.
The single most important probability distribution in this book. For a single variable x:
The bell curve. Parameterized by the mean μ (center) and variance σ2 (spread). The precision β = 1/σ2 is the inverse variance — higher precision means tighter concentration around the mean.
Adjust μ and σ to see how the bell curve changes shape. The shaded area always integrates to 1.
Given N data points {x1, …, xN} drawn i.i.d. from a Gaussian, the maximum likelihood estimates are:
The ML mean is just the sample average — perfectly unbiased. But the ML variance is biased: it systematically underestimates the true variance by a factor of (N−1)/N. This bias comes from measuring spread relative to the sample mean (which itself was fitted to the data) instead of the true mean. The unbiased estimator uses N−1 instead of N.
For D-dimensional vectors, the Gaussian generalizes to the multivariate Gaussian:
where Σ is the D×D covariance matrix. Its shape defines the ellipsoidal contours of the distribution.
Consider a sphere of radius 1 in D dimensions. What fraction of its volume lies in a thin shell between radius r = 1 − ε and r = 1?
For D = 1, with ε = 0.01, the fraction is just 1%. For D = 100, it is 63%. For D = 1000, it is 99.99%. Almost all the volume of a high-dimensional sphere is concentrated in a thin shell near the surface.
This doesn't mean that machine learning is hopeless in high dimensions. Real data has two saving graces:
Low intrinsic dimensionality: Images of a face at different angles live on a low-dimensional manifold (parameterized by a few rotation angles) embedded in a high-dimensional pixel space.
Local smoothness: Small changes in input usually produce small changes in output. This lets us interpolate between nearby data points, making local methods work even when the ambient dimension is huge.
A Gaussian distribution in D dimensions concentrates most of its mass in a thin shell at a specific radius √D from the center. The mode (the peak at the center) is actually in a region of very low probability mass. This counterintuitive fact means that the most probable point is not where most of the probability is.
Probability theory tells us what we know. Decision theory tells us what to do about it. Together, they form a complete framework for making optimal predictions under uncertainty.
Suppose we have a classification problem with classes C1, …, CK. We know the posterior probabilities p(Ck|x) for each class given a new input x. How should we assign x to a class?
The simplest rule: pick the class with the highest posterior probability. This minimizes the misclassification rate:
But sometimes misclassifications have different costs. Misdiagnosing cancer (calling it healthy) is far worse than a false alarm. A loss function Lkj specifies the cost of assigning to class Cj when the truth is Ck. The optimal decision minimizes expected loss:
Bishop distinguishes three increasingly powerful approaches to classification:
| Approach | What you model | Flexibility |
|---|---|---|
| Generative | p(x|Ck) and p(Ck), then use Bayes | Can generate synthetic data, detect outliers |
| Discriminative | p(Ck|x) directly | Focused on the decision boundary |
| Discriminant | A function that maps x to class label | Simplest, no probabilities |
For regression, the decision problem is: given the posterior p(t|x), what single value should we predict? Under squared loss, the optimal prediction is the conditional mean E[t|x]. Under absolute loss, it's the conditional median. The loss function determines the optimal summary of the posterior.
How much "information" does a random variable carry? If someone tells you "the sun rose this morning," that's low information — you already expected it. If they say "a meteor hit downtown," that's high information — it was extremely unlikely. Information is inversely related to probability.
Shannon formalized this intuition. The information content of an event with probability p is:
Measured in bits. Certain events (p = 1) carry 0 bits. A fair coin flip (p = 0.5) carries 1 bit. A 1-in-a-million event carries about 20 bits.
The entropy is the expected information content — the average surprise:
The KL divergence (relative entropy) measures how different two distributions are:
KL(p||q) = 0 if and only if p = q. It is always non-negative (Gibbs' inequality). But it is not symmetric: KL(p||q) ≠ KL(q||p) in general. This asymmetry will matter greatly when we study variational inference in Chapter 10.
Mutual information measures how much knowing X tells you about Y:
If X and Y are independent, the joint equals the product of marginals, and the mutual information is zero. Any dependency between them makes it positive.
Chapter 1 establishes the conceptual foundation for everything that follows. Let's consolidate the key ideas.
| Concept | Key idea | Where it leads |
|---|---|---|
| Polynomial fitting | Minimize sum-of-squares to find weights | Least squares, linear models (Ch 3) |
| Overfitting | Complex models memorize noise | Regularization, Bayesian methods |
| Probability rules | Sum rule + product rule = everything | Foundation for all inference |
| Bayes' theorem | posterior ∝ likelihood × prior | Bayesian ML throughout the book |
| Gaussian | The bell curve, ML bias | Ch 2 (distributions), Ch 3-5 (models) |
| Curse of dimensionality | High-D space is mostly empty | Motivates kernels, manifold methods |
| Decision theory | Optimal actions under uncertainty | Classification, regression losses |
| Information theory | Entropy, KL divergence, mutual info | EM (Ch 9), variational (Ch 10) |
What comes next: Chapter 2 develops probability distributions in depth — the building blocks we'll use for everything: binary variables and the beta distribution, multinomial variables and the Dirichlet, the Gaussian in full detail, the exponential family, and nonparametric methods.