Bishop PRML, Chapter 4

Linear Models for Classification

From Fisher's discriminant to logistic regression: how to draw decision boundaries and quantify classification uncertainty.

Prerequisites: Chapters 1–3 (probability, Gaussians, linear regression).
10
Chapters
2
Simulations
10
Quizzes

Chapter 0: Why Classification?

Regression predicts a continuous number. Classification predicts a discrete label: is this email spam or not? Is this tumor malignant or benign? Which of 10 digits is written here?

The goal is to assign input vectors x to one of K discrete classes C1, …, CK. The input space is divided into decision regions separated by decision boundaries (also called decision surfaces).

Three approaches to classification: (1) Discriminant functions — directly map x to a class label. (2) Discriminative models — model the posterior p(Ck|x) directly. (3) Generative models — model the class-conditional densities p(x|Ck) and priors p(Ck), then use Bayes' theorem. Each has tradeoffs in data efficiency, flexibility, and interpretability.

For two classes, the decision is governed by a linear function wTx + w0. The decision boundary is the hyperplane where this function equals zero. Points on one side are class 1, points on the other are class 2. For K classes, we need K linear functions, and the decision regions form a partition of the input space.

Check: What are the three approaches to classification Bishop describes?

Chapter 1: Discriminant Functions

A discriminant function directly assigns an input x to a class. The simplest form for two classes:

y(x) = wTx + w0

Classify as C1 if y(x) ≥ 0, else C2. The decision boundary is the set of points where y(x) = 0 — a hyperplane.

Key geometric facts about this hyperplane:

PropertyValue
Normal vectorw is perpendicular to the decision surface
Distance from origin−w0 / ||w||
Distance of point x to surfacey(x) / ||w||

For K > 2 classes, we use K discriminant functions yk(x) = wkTx + wk0 and assign x to the class with the highest yk. This creates convex, singly-connected decision regions — a nice property that prevents pathological boundary shapes.

Why not least squares for classification? We could minimize the sum-of-squares error with one-hot class targets. But this is sensitive to outliers: a data point very far on the correct side of the boundary still generates a large residual, pulling the boundary toward it. Least squares lacks robustness for classification.
Check: The weight vector w in a linear discriminant is:

Chapter 2: Fisher's Linear Discriminant

Fisher's idea: project high-dimensional data onto a line, then classify on that line. But which direction should we project? We want a direction that maximizes class separation while minimizing within-class scatter.

Fisher's Linear Discriminant

Two classes of 2D data projected onto different directions. The Fisher direction maximizes the ratio of between-class to within-class variance.

Angle45°

Define the between-class scatter: SB = (m2m1)(m2m1)T and the within-class scatter: SW = ∑n ∈ C1 (xnm1)(xnm1)T + same for C2.

The Fisher criterion maximizes the ratio:

J(w) = (wT SB w) / (wT SW w)

The solution is w ∝ SW−1(m2m1). If the within-class covariances are equal and isotropic, this reduces to the direction between the class means. Otherwise, SW−1 "unwarps" the space to account for the class shapes.

Fisher's insight: Simply projecting onto the line connecting the class means can fail badly if the classes have elongated shapes at an angle. The within-class scatter matrix rotates the projection direction to account for the shape of each class's distribution. It's the simplest example of a general principle: good classification requires understanding not just the class means, but also their covariance structure.
Check: What does Fisher's discriminant maximize?

Chapter 3: The Perceptron

The perceptron is the simplest learning algorithm for linear classification. It processes training examples one at a time and updates the weight vector whenever it makes a mistake:

w(new) = w(old) − η ∇ En

For the perceptron criterion, the error for a misclassified point is En = −wTφn tn, where tn ∈ {−1, +1}. The update rule becomes:

w(new) = w(old) + η φn tn

If the data is linearly separable, the perceptron convergence theorem guarantees the algorithm will find a separating hyperplane in finite steps.

Limitations: (1) There are many valid separating hyperplanes — the perceptron finds one but it depends on initialization and data order. (2) If the data isn't linearly separable, the perceptron never converges — it oscillates forever. (3) It gives no probabilistic output, no measure of confidence. These limitations motivate logistic regression (Ch 5) and SVMs (Ch 7).

Despite its simplicity, the perceptron is historically important: it's the ancestor of modern neural networks. The key ideas — weighted linear combination, activation function, learning by gradient descent — are all present, just in their simplest form.

Check: When does the perceptron algorithm fail to converge?

Chapter 4: Generative Models

The generative approach models the class-conditional densities p(x|Ck) and priors p(Ck), then uses Bayes' theorem to get the posterior:

p(C1|x) = σ(wTx + w0)

where σ(a) = 1/(1 + e−a) is the logistic sigmoid. This result falls out when the class-conditionals are Gaussian with shared covariance.

If p(x|Ck) = N(x|μk, Σ) (Gaussians with same covariance), the log-posterior ratio is linear in x:

a = ln(p(x|C1)p(C1) / p(x|C2)p(C2)) = wTx + w0

where w = Σ−1(μ1μ2). The decision boundary is a hyperplane — linear in x.

Shared vs. separate covariances: If the classes share a covariance matrix, the decision boundary is linear (the quadratic terms cancel). If each class has its own covariance, we get quadratic discriminant analysis (QDA) with curved boundaries. The shared-covariance assumption is a strong constraint but reduces the number of parameters dramatically.

For K classes with shared covariance, the posterior for each class uses the softmax function:

p(Ck|x) = exp(ak) / ∑j exp(aj)

where ak = wkTx + wk0. The softmax normalizes K linear functions into a valid probability distribution.

Check: Under what assumption on the class-conditionals is the decision boundary linear?

Chapter 5: Logistic Regression

Instead of modeling class-conditional densities and using Bayes' theorem (generative), logistic regression directly models the posterior p(C1|x). This is a discriminative approach.

p(C1|φ) = σ(wTφ) = 1 / (1 + exp(−wTφ))

The likelihood for N data points with targets tn ∈ {0, 1} is:

p(t|w) = ∏n yntn (1 − yn)1−tn

where yn = σ(wTφn). The negative log-likelihood is the cross-entropy error:

E(w) = −∑n [tn ln yn + (1−tn) ln(1−yn)]
Why cross-entropy, not least squares? The sigmoid squashes outputs into [0,1]. If we used sum-of-squares, the gradient would be tiny for confident (but wrong) predictions — the sigmoid saturates. Cross-entropy avoids this: wrong confident predictions produce huge gradients, driving fast learning. This is why modern neural networks use cross-entropy for classification, not MSE.

The gradient has an elegant form: ∇E = ∑n (yn − tn)φn. The gradient for each data point is the error (yn − tn) times the basis function — the same form as in linear regression. But here the problem is no longer convex in closed form (because yn depends nonlinearly on w), so we must optimize iteratively.

Check: Why is cross-entropy preferred over sum-of-squares for classification?

Chapter 6: IRLS Optimization

The cross-entropy error has no closed-form minimum. But it is convex, so any local minimum is the global minimum. We can use Newton's method, which uses second-order (Hessian) information for faster convergence.

The Hessian of the cross-entropy is:

H = ∇∇E = ΦTRΦ

where R is a diagonal matrix with Rnn = yn(1 − yn). Newton's update is:

w(new) = w(old) − (ΦTRΦ)−1 ΦT(yt)

This can be rewritten as:

w(new) = (ΦTRΦ)−1 ΦTRz

where z = Φw(old)R−1(yt) are "adjusted targets." This has the form of a weighted least-squares problem with weights R and targets z — hence the name iteratively reweighted least squares (IRLS).

Connection to weighted regression: Each IRLS step solves a weighted regression problem. The weights Rnn = yn(1−yn) are the variance of the Bernoulli at the current prediction. Points near the decision boundary (y ≈ 0.5) have high variance and high weight — they matter most. Points far from the boundary (y ≈ 0 or 1) have low variance and are effectively downweighted. IRLS focuses on the points that are hardest to classify.
Check: In IRLS, which data points receive the highest weight?

Chapter 7: Generative vs Discriminative

This is one of the deepest distinctions in machine learning. Both approaches end up with p(Ck|x), but they get there differently:

GenerativeDiscriminative
Modelsp(x|Ck) and p(Ck)p(Ck|x) directly
ParametersMore (class means, covariance, priors)Fewer (just weight vector)
Data efficiencyBetter with small data (if model is correct)Better with large data
Asymptotic accuracyCan be worse (if model is wrong)Often better
Missing dataCan handle naturallyRequires special treatment
Outlier detectionCan detect (p(x) is available)Cannot (only has p(C|x))
The crossover: Ng and Jordan (2001) showed empirically that generative models converge faster (needing fewer examples) but to a possibly worse asymptote. Discriminative models start worse but eventually overtake. With very little data, the generative approach can be better even if its modeling assumptions are slightly wrong. With lots of data, discriminative wins because it focuses on what matters: the decision boundary.

In practice, the choice depends on the problem. If you need to generate synthetic data, detect outliers, or handle missing values, go generative. If you just need the best possible classification accuracy with ample data, go discriminative.

Check: Which approach typically needs fewer training examples to work well?

Chapter 8: The Laplace Approximation

We want to do Bayesian logistic regression: put a prior on w and compute the posterior p(w|t). But unlike linear regression (Ch 3), the posterior is not Gaussian because the sigmoid likelihood breaks conjugacy.

The Laplace approximation approximates any unimodal distribution with a Gaussian centered at its mode:

q(w) = N(w|wMAP, H−1)

where wMAP is the mode of the posterior (found by IRLS with a prior term) and H = −∇∇ ln p(w|t)|wMAP is the Hessian of the negative log-posterior.

The Laplace strategy: (1) Find the mode of the true posterior (the MAP estimate). (2) Fit a Gaussian at that mode by matching the curvature (Hessian). This works well when the posterior is roughly symmetric and unimodal. It's the first approximation technique in the book; we'll see more sophisticated ones in Chapters 10 (variational) and 11 (sampling).

With the Laplace approximation, we can compute approximate predictive distributions by marginalizing over the Gaussian approximation to the posterior. For logistic regression, this involves a one-dimensional integral that can be approximated using the probit approximation: σ(a) ≈ Φ(λ a) where Φ is the cumulative normal and λ2 = π/8.

The Laplace approximation also lets us compute an approximate model evidence for hyperparameter selection, just as in linear regression.

Check: What does the Laplace approximation do?

Chapter 9: Summary

Chapter 4 showed that classification is fundamentally different from regression: the target is discrete, the likelihood is non-Gaussian, and exact Bayesian inference is intractable. This forced us to develop new tools:

MethodApproachProbabilistic?
Least squaresMinimize squared error with 0/1 targetsPoorly calibrated
Fisher's LDAMaximize class separation ratioNot directly
PerceptronIterative error-driven updatesNo
Generative (LDA/QDA)Model class-conditionals + BayesYes, fully
Logistic regressionDiscriminative, cross-entropyYes, calibrated
Bayesian logisticLaplace approximation to posteriorYes, with uncertainty
The central lesson: For classification, the sigmoid (or softmax) function is the natural link between linear functions and probabilities. Cross-entropy is the natural loss. And when we want Bayesian treatment, we must approximate because the posterior is no longer Gaussian. The Laplace approximation is our first tool for this — variational inference (Ch 10) and MCMC (Ch 11) will give us more powerful alternatives.

What comes next: Chapter 5 moves beyond linear models to neural networks — compositions of many linear-then-nonlinear layers. This dramatically increases modeling power but also raises new challenges for training (backpropagation) and Bayesian treatment.

"For classification, the posterior probabilities are given by
a logistic sigmoid acting on a linear function of the input."
— Christopher Bishop, PRML §4.2
Check: Why can't we do exact Bayesian inference for logistic regression?