Ch 1: Introduction — Bishop PRML

Chapter 0: The Problem

You are handed a pile of envelopes. Each one has a handwritten zip code. Your job: build a system that reads the digits automatically. Some people write "7" with a cross-stroke, others don't. Some "1"s look like "7"s. Some "4"s look like "9"s. How do you handle this?

One approach: write rules by hand. "If the digit has a loop at the top, it's probably a 9 or 6." But rules breed exceptions, exceptions breed sub-rules, and soon you have thousands of fragile heuristics that still fail on messy handwriting.

A better approach: show the system thousands of examples and let it learn the patterns. This is pattern recognition — the automatic discovery of regularities in data through algorithms, and the use of those regularities to take actions like classifying digits.

The core idea of this book: Instead of programming rules, we program the ability to learn from data. Probability theory gives us a language for uncertainty. Decision theory tells us how to act on that uncertainty. Together, they form the mathematical backbone of modern machine learning.

Bishop's textbook covers three types of learning:

Type	What you have	Goal
Supervised	Inputs + labels	Predict labels for new inputs
Unsupervised	Inputs only	Discover structure (clusters, density)
Reinforcement	States + rewards	Learn actions that maximize reward

Supervised learning itself splits into two sub-problems. When the output is a discrete label (like "this digit is a 7"), we call it classification. When the output is a continuous value (like "the house price is $350k"), we call it regression.

This chapter introduces the key concepts through a beautifully simple example: fitting a polynomial to noisy data. From there, we build up to probability theory, Bayesian reasoning, and information theory — the tools that power everything in the remaining 13 chapters.

Feature extraction: Raw data (images, audio, text) is often preprocessed into a more useful representation before feeding it to a learning algorithm. These preprocessed representations are called features. Choosing good features can make or break a system — a theme that pervades all of machine learning.

Check: What distinguishes supervised from unsupervised learning?

The size of the dataset Whether target labels are provided during training The type of algorithm used

Chapter 1: Polynomial Curve Fitting

Suppose you observe 10 data points generated from the function sin(2πx) plus some random noise. You don't know the underlying function. You only see the noisy points. Can you figure out what generated them?

A natural approach: fit a polynomial. A polynomial of order M takes the form:

y(x, w) = w₀ + w₁x + w₂x² + … + w_Mx^M = ∑_j=0^M w_j x^j

The weights w₀, w₁, …, w_M are unknown parameters we need to determine. We choose them to minimize an error function — the sum of squared differences between the polynomial's predictions and the actual data:

E(w) = ½ ∑_n=1^N { y(x_n, w) − t_n }²

This is the sum-of-squares error. We pick the weights that make E as small as possible. Since E is a quadratic function of the weights, it has a unique minimum that can be found in closed form. (Later, we'll see that this error function arises naturally from a probabilistic perspective.)

Key insight: The order M of the polynomial is a choice we make before seeing how well the model fits. It is not learned from the data — it is a model complexity parameter. Choosing M wisely is one of the central challenges of machine learning.

Polynomial Curve Fitting

10 data points from sin(2πx) + noise. Adjust the polynomial order M to see how the fit changes. Low M underfits, high M overfits.

Order M:

Symbol	Meaning
x_n	The n-th input value
t_n	The n-th target (observed output)
w_j	The j-th polynomial coefficient
M	Polynomial order (model complexity)
E(w)	Sum-of-squares error function

Check: What does minimizing E(w) achieve?

It finds the polynomial weights that best fit the training data (in a least-squares sense) It determines the optimal polynomial order M It guarantees good predictions on new data

Chapter 2: The Overfitting Problem

With M = 0 (a constant), the fit is terrible — a flat line that ignores the data. With M = 1 (a line), it's better but still misses the sine-wave shape. With M = 3, it captures the curve beautifully. So far, higher M means better fit.

But try M = 9. With 10 data points and 10 parameters, the polynomial passes exactly through every point (E = 0). The training error is perfect. Yet the fitted curve oscillates wildly between the data points. It has memorized the noise instead of learning the signal.

Overfitting: A model that is too complex for the amount of available data will fit the training set perfectly but generalize terribly to new data. This is the fundamental tension in machine learning: you need enough model capacity to capture the true pattern, but not so much that you memorize noise.

How do we detect overfitting? Use a separate test set — data the model has never seen. Compute the root-mean-square error (RMS) on both training and test sets:

E_RMS = √(2E(w*)/N)

For M = 9, the training RMS is near zero but the test RMS is enormous. For M = 3, both are small. The test error reveals the truth.

Two cures for overfitting:

1. More data

With enough data, even complex models can't memorize — they must generalize. As N grows, the gap between M=3 and M=9 vanishes.

↓

2. Regularization

Add a penalty for large weights: Ẽ(w) = E(w) + (λ/2)||w||². This discourages the wild oscillations of overfitting.

The regularization parameter λ controls the trade-off. Large λ means heavy penalty (underfitting). λ = 0 means no penalty (back to raw least squares). The sweet spot is in between.

Key insight: The weights of the M=9 polynomial have enormous magnitudes (like 125,201 or −640,042). Regularization penalizes these big weights, forcing the model to find a smoother curve even when it has enough parameters to perfectly interpolate the data.

Looking ahead: Later we'll see that regularization arises naturally in the Bayesian framework as a prior distribution over the weights. The penalty term λ||w||² corresponds to a Gaussian prior centered at zero. This connection between regularization and Bayesian priors is one of the deepest themes of the book.

Check: A polynomial of order 9 fits 10 data points with zero training error. Why is this a problem?

The computation is too expensive It has memorized noise and will generalize poorly to new data Zero training error always means the model is correct

Chapter 3: Probability Theory

Imagine two boxes: a red box with 2 apples and 6 oranges, and a blue box with 3 apples and 1 orange. You pick a box at random (red 40% of the time, blue 60%) and then pick a fruit. What's the probability of getting an apple? If you got an orange, which box did it probably come from?

All of probability theory rests on two rules:

Sum rule: p(X) = ∑_Y p(X, Y)

Product rule: p(X, Y) = p(Y|X) p(X)

The sum rule says: to find the probability of X alone, sum (marginalize) the joint probability over all values of Y. The product rule says: the joint probability of X and Y equals the conditional probability of Y given X, times the probability of X.

That's it. These two rules, plus the requirement that probabilities are non-negative and sum to one, generate the entire calculus of probability. Every theorem, every formula, every Bayesian computation in this book flows from them.

For the fruit example: p(apple) = p(apple|red)p(red) + p(apple|blue)p(blue) = (1/4)(4/10) + (3/4)(6/10) = 11/20. The sum rule in action.

Concept	Notation	Meaning
Joint	p(X, Y)	Probability of X and Y together
Marginal	p(X)	Probability of X, summing over Y
Conditional	p(Y\|X)	Probability of Y given X is known
Prior	p(X)	Belief about X before seeing data
Posterior	p(X\|Y)	Updated belief after observing Y

For continuous variables, sums become integrals and probabilities become probability densities. A density p(x) is not a probability itself — it can exceed 1. The probability of x falling in an interval [a, b] is the integral ∫_a^b p(x) dx.

Two important summary statistics: the expectation (average value of a function under a distribution) and the variance (spread around the mean):

E[f] = ∑_x p(x) f(x) var[f] = E[(f − E[f])²] = E[f²] − E[f]²

Check: Using the product rule, how do you compute the joint probability p(X, Y)?

p(X) + p(Y) p(Y|X) p(X) p(X) / p(Y)

Chapter 4: Bayes' Theorem

Combining the two rules of probability gives us the most important equation in this book:

p(w|D) = p(D|w) p(w) / p(D)

In words: posterior = likelihood × prior / evidence. Let's unpack each piece.

Prior p(w)

What we believe about the parameters before seeing data. Encodes assumptions like "weights should be small."

↓ observe data D

Likelihood p(D|w)

How probable the observed data is, given specific parameter values. The "evidence for w."

↓ apply Bayes

Posterior p(w|D)

Updated belief about parameters after seeing data. Combines prior knowledge with evidence.

The denominator p(D) = ∫ p(D|w) p(w) dw is the model evidence (or marginal likelihood). It normalizes the posterior but also plays a crucial role in model selection — comparing different models to see which best explains the data.

The Bayesian worldview: Parameters are not fixed unknowns to be estimated. They are random variables with distributions. Before you see data, you have a prior. After you see data, you have a posterior. The posterior IS the answer — it tells you everything you know about the parameters. Point estimates like maximum likelihood are just summaries of the posterior (specifically, the mode).

Back to our fruit example: we picked an orange. Which box? Before seeing the fruit, p(red) = 0.4 (the prior). After seeing an orange:

p(red|orange) = p(orange|red) p(red) / p(orange) = (6/8)(4/10) / (9/20) = 2/3

The prior favored blue (0.6 vs 0.4). But the likelihood strongly favors red (oranges are 75% of the red box vs 25% of blue). The posterior updates to 2/3 for red. Data has overridden the prior.

Maximum likelihood vs. Bayesian: Maximum likelihood finds the single w that maximizes p(D|w). Bayesian inference maintains the full distribution p(w|D). ML is a point estimate; Bayes gives you the whole picture, including uncertainty. The price is computational — integrals over parameter space can be hard. Much of this book develops tools (Laplace approximation, variational methods, sampling) to make Bayesian inference tractable.

Check: In Bayes' theorem, what does the likelihood p(D|w) represent?

The probability of the parameters before seeing data The probability of the data averaged over all parameter values How probable the observed data is, given specific parameter values

Chapter 5: The Gaussian Distribution

The single most important probability distribution in this book. For a single variable x:

N(x|μ, σ²) = (1 / √(2πσ²)) exp(−(x − μ)² / 2σ²)

The bell curve. Parameterized by the mean μ (center) and variance σ² (spread). The precision β = 1/σ² is the inverse variance — higher precision means tighter concentration around the mean.

Gaussian Distribution

Adjust μ and σ to see how the bell curve changes shape. The shaded area always integrates to 1.

μ0.0

σ1.0

Given N data points {x₁, …, x_N} drawn i.i.d. from a Gaussian, the maximum likelihood estimates are:

μ_ML = (1/N) ∑ x_n σ²_ML = (1/N) ∑ (x_n − μ_ML)²

The ML mean is just the sample average — perfectly unbiased. But the ML variance is biased: it systematically underestimates the true variance by a factor of (N−1)/N. This bias comes from measuring spread relative to the sample mean (which itself was fitted to the data) instead of the true mean. The unbiased estimator uses N−1 instead of N.

Why bias matters: This seemingly minor issue — measuring variance from the sample mean — is the simplest example of a deep problem with maximum likelihood. When you use the data to fit parameters, then measure how well those parameters fit the same data, you systematically overestimate the quality. This is the root cause of overfitting. Bayesian methods automatically correct for this.

For D-dimensional vectors, the Gaussian generalizes to the multivariate Gaussian:

N(x|μ, Σ) = (1 / (2π)^D/2 |Σ|^1/2) exp(−½(x−μ)^TΣ⁻¹(x−μ))

where Σ is the D×D covariance matrix. Its shape defines the ellipsoidal contours of the distribution.

Check: Why is the maximum likelihood estimate of the Gaussian variance biased?

It measures spread relative to the fitted mean, not the true mean The sample size is always too small Gaussians cannot model real data accurately

Chapter 6: The Curse of Dimensionality

Consider a sphere of radius 1 in D dimensions. What fraction of its volume lies in a thin shell between radius r = 1 − ε and r = 1?

fraction = 1 − (1 − ε)^D

For D = 1, with ε = 0.01, the fraction is just 1%. For D = 100, it is 63%. For D = 1000, it is 99.99%. Almost all the volume of a high-dimensional sphere is concentrated in a thin shell near the surface.

The curse of dimensionality: In high dimensions, most of the space is "far from everything." Data points are sparse. Local neighborhoods are empty. The intuitions we build in 2D and 3D fail catastrophically. A grid-based approach that divides each axis into 10 bins needs 10^D cells — with D = 100, that's 10¹⁰⁰ cells, far more than the atoms in the universe.

This doesn't mean that machine learning is hopeless in high dimensions. Real data has two saving graces:

Low intrinsic dimensionality: Images of a face at different angles live on a low-dimensional manifold (parameterized by a few rotation angles) embedded in a high-dimensional pixel space.

Local smoothness: Small changes in input usually produce small changes in output. This lets us interpolate between nearby data points, making local methods work even when the ambient dimension is huge.

A Gaussian distribution in D dimensions concentrates most of its mass in a thin shell at a specific radius √D from the center. The mode (the peak at the center) is actually in a region of very low probability mass. This counterintuitive fact means that the most probable point is not where most of the probability is.

Key insight: The curse of dimensionality is not a law of nature — it's a property of naive methods that treat all dimensions equally. The entire field of machine learning can be seen as developing algorithms that exploit the structure of real data (manifolds, smoothness, sparsity) to beat the curse.

Check: In a 100-dimensional sphere, where is most of the volume?

Uniformly distributed throughout Concentrated in a thin shell near the surface At the center

Chapter 7: Decision Theory

Probability theory tells us what we know. Decision theory tells us what to do about it. Together, they form a complete framework for making optimal predictions under uncertainty.

Suppose we have a classification problem with classes C₁, …, C_K. We know the posterior probabilities p(C_k|x) for each class given a new input x. How should we assign x to a class?

The simplest rule: pick the class with the highest posterior probability. This minimizes the misclassification rate:

assign x to C_k where k = argmax_j p(C_j|x)

But sometimes misclassifications have different costs. Misdiagnosing cancer (calling it healthy) is far worse than a false alarm. A loss function L_kj specifies the cost of assigning to class C_j when the truth is C_k. The optimal decision minimizes expected loss:

assign x to the class that minimizes ∑_k L_kj p(C_k|x)

The reject option: In ambiguous regions where no class has high posterior probability, it may be better to say "I don't know" than to guess. This is the reject option — refuse to classify inputs where the uncertainty is too high. A doctor who says "we need more tests" is exercising the reject option.

Bishop distinguishes three increasingly powerful approaches to classification:

Approach	What you model	Flexibility
Generative	p(x\|C_k) and p(C_k), then use Bayes	Can generate synthetic data, detect outliers
Discriminative	p(C_k\|x) directly	Focused on the decision boundary
Discriminant	A function that maps x to class label	Simplest, no probabilities

For regression, the decision problem is: given the posterior p(t|x), what single value should we predict? Under squared loss, the optimal prediction is the conditional mean E[t|x]. Under absolute loss, it's the conditional median. The loss function determines the optimal summary of the posterior.

Check: Under squared loss, what is the optimal prediction for a regression problem?

The conditional mean E[t|x] The mode of p(t|x) The maximum likelihood estimate

Chapter 8: Information Theory

How much "information" does a random variable carry? If someone tells you "the sun rose this morning," that's low information — you already expected it. If they say "a meteor hit downtown," that's high information — it was extremely unlikely. Information is inversely related to probability.

Shannon formalized this intuition. The information content of an event with probability p is:

h(x) = −log₂ p(x)

Measured in bits. Certain events (p = 1) carry 0 bits. A fair coin flip (p = 0.5) carries 1 bit. A 1-in-a-million event carries about 20 bits.

The entropy is the expected information content — the average surprise:

H[x] = −∑_x p(x) log p(x)

Entropy measures uncertainty. A fair coin has maximum entropy (1 bit). A biased coin (say, 99% heads) has low entropy — you already know the outcome. A uniform distribution over K states has entropy log K, the maximum possible. Entropy is maximized when all outcomes are equally likely.

The KL divergence (relative entropy) measures how different two distributions are:

KL(p || q) = −∑_x p(x) log(q(x)/p(x)) ≥ 0

KL(p||q) = 0 if and only if p = q. It is always non-negative (Gibbs' inequality). But it is not symmetric: KL(p||q) ≠ KL(q||p) in general. This asymmetry will matter greatly when we study variational inference in Chapter 10.

Mutual information measures how much knowing X tells you about Y:

I[X, Y] = KL(p(X, Y) || p(X)p(Y)) = H[X] − H[X|Y] ≥ 0

If X and Y are independent, the joint equals the product of marginals, and the mutual information is zero. Any dependency between them makes it positive.

Connections ahead: Entropy appears everywhere in this book. Maximizing entropy gives the least-informative distribution consistent with constraints (MaxEnt). KL divergence is the objective for variational inference (Ch 10) and connects to the EM algorithm (Ch 9). Information theory and probability theory are two perspectives on the same mathematics.

Check: Which distribution over K outcomes has the maximum entropy?

The uniform distribution (all outcomes equally likely) The distribution with all mass on one outcome Any Gaussian distribution

Chapter 9: Summary

Chapter 1 establishes the conceptual foundation for everything that follows. Let's consolidate the key ideas.

Concept	Key idea	Where it leads
Polynomial fitting	Minimize sum-of-squares to find weights	Least squares, linear models (Ch 3)
Overfitting	Complex models memorize noise	Regularization, Bayesian methods
Probability rules	Sum rule + product rule = everything	Foundation for all inference
Bayes' theorem	posterior ∝ likelihood × prior	Bayesian ML throughout the book
Gaussian	The bell curve, ML bias	Ch 2 (distributions), Ch 3-5 (models)
Curse of dimensionality	High-D space is mostly empty	Motivates kernels, manifold methods
Decision theory	Optimal actions under uncertainty	Classification, regression losses
Information theory	Entropy, KL divergence, mutual info	EM (Ch 9), variational (Ch 10)

The Bayesian thread: The single most important theme of this book is the Bayesian perspective. Maximum likelihood gives point estimates that can overfit. The Bayesian approach maintains full distributions over parameters, automatically trading off model complexity against data fit. Chapters 3-14 develop this idea for increasingly powerful models.

What comes next: Chapter 2 develops probability distributions in depth — the building blocks we'll use for everything: binary variables and the beta distribution, multinomial variables and the Dirichlet, the Gaussian in full detail, the exponential family, and nonparametric methods.

"The problem of finding the best model is fundamentally a
problem of balancing model complexity against fit to the data."
— Christopher Bishop, PRML

Check: What is the central theme that unifies most of Bishop's PRML?

Neural network architectures The Bayesian approach to machine learning Support vector machines

Introduction to Pattern Recognition