From Fisher's discriminant to logistic regression: how to draw decision boundaries and quantify classification uncertainty.
Regression predicts a continuous number. Classification predicts a discrete label: is this email spam or not? Is this tumor malignant or benign? Which of 10 digits is written here?
The goal is to assign input vectors x to one of K discrete classes C1, …, CK. The input space is divided into decision regions separated by decision boundaries (also called decision surfaces).
For two classes, the decision is governed by a linear function wTx + w0. The decision boundary is the hyperplane where this function equals zero. Points on one side are class 1, points on the other are class 2. For K classes, we need K linear functions, and the decision regions form a partition of the input space.
A discriminant function directly assigns an input x to a class. The simplest form for two classes:
Classify as C1 if y(x) ≥ 0, else C2. The decision boundary is the set of points where y(x) = 0 — a hyperplane.
Key geometric facts about this hyperplane:
| Property | Value |
|---|---|
| Normal vector | w is perpendicular to the decision surface |
| Distance from origin | −w0 / ||w|| |
| Distance of point x to surface | y(x) / ||w|| |
For K > 2 classes, we use K discriminant functions yk(x) = wkTx + wk0 and assign x to the class with the highest yk. This creates convex, singly-connected decision regions — a nice property that prevents pathological boundary shapes.
Fisher's idea: project high-dimensional data onto a line, then classify on that line. But which direction should we project? We want a direction that maximizes class separation while minimizing within-class scatter.
Two classes of 2D data projected onto different directions. The Fisher direction maximizes the ratio of between-class to within-class variance.
Define the between-class scatter: SB = (m2 − m1)(m2 − m1)T and the within-class scatter: SW = ∑n ∈ C1 (xn − m1)(xn − m1)T + same for C2.
The Fisher criterion maximizes the ratio:
The solution is w ∝ SW−1(m2 − m1). If the within-class covariances are equal and isotropic, this reduces to the direction between the class means. Otherwise, SW−1 "unwarps" the space to account for the class shapes.
The perceptron is the simplest learning algorithm for linear classification. It processes training examples one at a time and updates the weight vector whenever it makes a mistake:
For the perceptron criterion, the error for a misclassified point is En = −wTφn tn, where tn ∈ {−1, +1}. The update rule becomes:
If the data is linearly separable, the perceptron convergence theorem guarantees the algorithm will find a separating hyperplane in finite steps.
Despite its simplicity, the perceptron is historically important: it's the ancestor of modern neural networks. The key ideas — weighted linear combination, activation function, learning by gradient descent — are all present, just in their simplest form.
The generative approach models the class-conditional densities p(x|Ck) and priors p(Ck), then uses Bayes' theorem to get the posterior:
where σ(a) = 1/(1 + e−a) is the logistic sigmoid. This result falls out when the class-conditionals are Gaussian with shared covariance.
If p(x|Ck) = N(x|μk, Σ) (Gaussians with same covariance), the log-posterior ratio is linear in x:
where w = Σ−1(μ1 − μ2). The decision boundary is a hyperplane — linear in x.
For K classes with shared covariance, the posterior for each class uses the softmax function:
where ak = wkTx + wk0. The softmax normalizes K linear functions into a valid probability distribution.
Instead of modeling class-conditional densities and using Bayes' theorem (generative), logistic regression directly models the posterior p(C1|x). This is a discriminative approach.
The likelihood for N data points with targets tn ∈ {0, 1} is:
where yn = σ(wTφn). The negative log-likelihood is the cross-entropy error:
The gradient has an elegant form: ∇E = ∑n (yn − tn)φn. The gradient for each data point is the error (yn − tn) times the basis function — the same form as in linear regression. But here the problem is no longer convex in closed form (because yn depends nonlinearly on w), so we must optimize iteratively.
The cross-entropy error has no closed-form minimum. But it is convex, so any local minimum is the global minimum. We can use Newton's method, which uses second-order (Hessian) information for faster convergence.
The Hessian of the cross-entropy is:
where R is a diagonal matrix with Rnn = yn(1 − yn). Newton's update is:
This can be rewritten as:
where z = Φw(old) − R−1(y − t) are "adjusted targets." This has the form of a weighted least-squares problem with weights R and targets z — hence the name iteratively reweighted least squares (IRLS).
This is one of the deepest distinctions in machine learning. Both approaches end up with p(Ck|x), but they get there differently:
| Generative | Discriminative | |
|---|---|---|
| Models | p(x|Ck) and p(Ck) | p(Ck|x) directly |
| Parameters | More (class means, covariance, priors) | Fewer (just weight vector) |
| Data efficiency | Better with small data (if model is correct) | Better with large data |
| Asymptotic accuracy | Can be worse (if model is wrong) | Often better |
| Missing data | Can handle naturally | Requires special treatment |
| Outlier detection | Can detect (p(x) is available) | Cannot (only has p(C|x)) |
In practice, the choice depends on the problem. If you need to generate synthetic data, detect outliers, or handle missing values, go generative. If you just need the best possible classification accuracy with ample data, go discriminative.
We want to do Bayesian logistic regression: put a prior on w and compute the posterior p(w|t). But unlike linear regression (Ch 3), the posterior is not Gaussian because the sigmoid likelihood breaks conjugacy.
The Laplace approximation approximates any unimodal distribution with a Gaussian centered at its mode:
where wMAP is the mode of the posterior (found by IRLS with a prior term) and H = −∇∇ ln p(w|t)|wMAP is the Hessian of the negative log-posterior.
With the Laplace approximation, we can compute approximate predictive distributions by marginalizing over the Gaussian approximation to the posterior. For logistic regression, this involves a one-dimensional integral that can be approximated using the probit approximation: σ(a) ≈ Φ(λ a) where Φ is the cumulative normal and λ2 = π/8.
The Laplace approximation also lets us compute an approximate model evidence for hyperparameter selection, just as in linear regression.
Chapter 4 showed that classification is fundamentally different from regression: the target is discrete, the likelihood is non-Gaussian, and exact Bayesian inference is intractable. This forced us to develop new tools:
| Method | Approach | Probabilistic? |
|---|---|---|
| Least squares | Minimize squared error with 0/1 targets | Poorly calibrated |
| Fisher's LDA | Maximize class separation ratio | Not directly |
| Perceptron | Iterative error-driven updates | No |
| Generative (LDA/QDA) | Model class-conditionals + Bayes | Yes, fully |
| Logistic regression | Discriminative, cross-entropy | Yes, calibrated |
| Bayesian logistic | Laplace approximation to posterior | Yes, with uncertainty |
What comes next: Chapter 5 moves beyond linear models to neural networks — compositions of many linear-then-nonlinear layers. This dramatically increases modeling power but also raises new challenges for training (backpropagation) and Bayesian treatment.