Find the one direction that maximally separates two classes — without assuming anything about their distributions.
You have high-dimensional data — say, N = 100 features per sample. Plotting it is impossible. Training a classifier directly is expensive and prone to overfitting. What if you could find ONE direction to project onto such that the two classes are maximally separated in that 1D projection?
That's the goal: find a vector w ∈ ℝN such that when you project all data onto w (computing z = wTx for each sample x), the resulting 1D values for class 1 and class 2 are as separated as possible.
A naive approach: project onto the line connecting the two class means, w = μ1 − μ2. This maximizes the distance between the projected means. But it can fail badly.
Imagine two long, thin clouds of points tilted at 45 degrees. The line between means might be perpendicular to the clouds' elongation — projecting onto it merges the two classes into overlapping distributions. A better direction would project along the elongation, where the classes don't overlap.
Given:
• Class 1 data: {x1, ..., xn1} with sample mean m1
• Class 2 data: {x1, ..., xn2} with sample mean m2
• Total: n = n1 + n2 samples in ℝN
We seek w ∈ ℝN that maximizes class separability after projection z = wTx. Fisher found the answer in 1936 — and it's elegant.
Two classes (teal/orange ellipses). Drag the projection angle to see how the 1D histograms overlap. Some directions separate well; others collapse both classes together.
After projecting onto w, each sample xi becomes a scalar zi = wTxi. The projected class means are:
The distance between projected means is |m̃1 − m̃2|2 = (wT(m1 − m2))2.
The within-class variance after projection is:
Fisher's criterion is the ratio:
The numerator: (m̃1 − m̃2)2 = (wT(m1 − m2))2 = wT(m1−m2)(m1−m2)Tw
Define the between-class scatter matrix:
Numerator = wTSBw.
The denominator: s̃12 + s̃22 = wT(∑(x−m1)(x−m1)T + ∑(x−m2)(x−m2)T)w
Define the within-class scatter matrix:
Denominator = wTSWw. Therefore:
This is a Rayleigh quotient (or generalized Rayleigh quotient) — a ratio of quadratic forms. Maximizing it is a classic eigenvalue problem.
Class 1: points at (1,2), (2,3), (3,3). Mean m1 = [2, 8/3]T ≈ [2, 2.67]T.
Class 2: points at (5,1), (6,2), (7,1). Mean m2 = [6, 4/3]T ≈ [6, 1.33]T.
m1 − m2 = [−4, 1.33]T
SB = [−4, 1.33]·[−4, 1.33]T = [[16, −5.33], [−5.33, 1.78]]
If we try w = [1, 0] (project onto x-axis): J = (2−6)2 / (within_x) = large/moderate. If we try w = [0, 1] (project onto y-axis): J = (2.67−1.33)2 / (within_y) = smaller. The x-direction separates better for this data.
Two classes in 2D. The polar plot shows J(w) for each projection direction. The optimal w maximizes J. Compare to naive "direction of means."
Let's build intuition for SB and SW — the two matrices that define Fisher's criterion.
This is a rank-1 matrix (outer product of a vector with itself). It captures how far apart the class centroids are and in what direction. Its only nonzero eigenvector is (m1 − m2) itself.
Properties:
• Rank 1 (for two classes). For K classes: rank ≤ K − 1.
• SBw = (m1−m2)(m1−m2)Tw = (m1−m2) · scalar. So SBw always points in direction (m1−m2).
where Σk = ∑x ∈ Ck(x−mk)(x−mk)T is the scatter matrix for class k. (This is (nk−1) times the sample covariance of class k.)
SW is full-rank (generically), symmetric, and positive definite. It measures the total "spread" of data around their respective class means.
Class 1 (3 points): [1,4], [2,6], [3,5]. Mean m1 = [2, 5].
Class 2 (3 points): [5,1], [6,3], [7,2]. Mean m2 = [6, 2].
SB = (m1−m2)(m1−m2)T = [−4, 3][−4, 3]T = [[16, −12], [−12, 9]]
Σ1 = ([1,4]−[2,5])([1,4]−[2,5])T + ... = [[2, 1], [1, 2]]
Σ2 = [[2, 1], [1, 2]]
SW = [[4, 2], [2, 4]]
Now we can compute J(w) for any direction. For w = [1, 0]:
• wTSBw = 16, wTSWw = 4, J = 4.0
For w = [0, 1]:
• wTSBw = 9, wTSWw = 4, J = 2.25
For w = [−4, 3]/5 (direction of means):
• wTSBw = (16·16 + 2·(−12)·(−4)·3 + 9·9)/25 = 25, wTSWw = (16·4 − 2·2·12 + 9·4)/25 = 52/25, J = 25/(52/25) = 625/52 ≈ 12.0
The direction of means isn't always optimal. The true optimum requires solving the eigenvalue problem.
The ellipses show the shape of SW (within-class spread). The arrow shows m1−m2 (between-class direction). Optimal Fisher w balances avoiding the ellipse's long axis while staying near the mean-difference direction.
We need to maximize J(w) = wTSBw / wTSWw. This is a generalized Rayleigh quotient. The standard approach: take the derivative, set to zero.
Differentiate J(w) with respect to w using the quotient rule:
Setting the numerator to zero:
Let λ = wTSBw / wTSWw = J(w). Then:
This is a generalized eigenvalue problem: SBw = λ SWw. The w that maximizes J is the generalized eigenvector corresponding to the largest eigenvalue λ.
But we can do better than solving an eigenvalue problem. Recall that SBw = (m1−m2)(m1−m2)Tw = (m1−m2) · c where c = (m1−m2)Tw is a scalar.
Substituting into SBw = λSWw:
Since we only care about the direction of w (not its magnitude), the scalar λ/c doesn't matter. The solution is:
From Chapter 2: SW = [[4, 2], [2, 4]], m1−m2 = [−4, 3].
SW−1 = (1/12)[[4, −2], [−2, 4]] (using det = 16 − 4 = 12)
w* = (1/12)[[4,−2],[−2,4]] · [−4, 3]T = (1/12)[−16−6, 8+12]T = (1/12)[−22, 20]T ∝ [−11, 10]
Normalizing: w* ≈ [−0.74, 0.67]. This is tilted compared to the naive m1−m2 = [−4, 3] ∝ [−0.80, 0.60], because SW−1 rotates the direction to account for within-class correlations.
Drag the angle to project 2D data onto different directions. The Fisher optimal direction (green arrow) maximizes J(w). Compare to naive mean-difference direction (gray dashed). Watch the 1D histograms overlap or separate.
We've derived Fisher's discriminant purely from a separation criterion — no probability distributions, no Gaussians, no Bayes' rule. Yet the result looks suspiciously like LDA from Lecture 15.
| Fisher's Discriminant | LDA (Gaussian Bayes) | |
|---|---|---|
| Assumption | None on distribution | Gaussian with shared Σ |
| Criterion | Max J(w) = between/within | Min Bayes risk R(f) |
| Solution | w = SW−1(m1−m2) | w = Σ−1(μ1−μ2) |
| Equivalence | If SW/(n−2) = Σ̂ (pooled sample covariance) and mk = μ̂k, they give the SAME direction | |
1. Robustness: Fisher's direction is sensible even for non-Gaussian data (bimodal, skewed, heavy-tailed). LDA's Bayes optimality only holds for Gaussians.
2. Same computation: In practice, you compute the same thing either way. But the interpretation differs — Fisher is a geometric projection; LDA is a probabilistic classifier.
3. Threshold differs: Fisher gives you the direction but doesn't specify where to threshold. LDA specifies the threshold via Bayes' rule. In practice, you often set the threshold by cross-validation regardless.
Fisher and LDA agree on the direction but can give different classifiers when:
• Unequal priors: LDA shifts the threshold; Fisher's criterion ignores priors.
• Non-Gaussian data: LDA's decision rule assumes Gaussian posteriors; Fisher's threshold should be set empirically.
• Unequal class sizes: Fisher uses pooled scatter; LDA uses the same but weighted differently depending on formulation.
Toggle between Gaussian data (where both are optimal) and non-Gaussian data (where Fisher still makes sense but LDA's probability model is wrong).
The Fisher solution w = SW−1(m1−m2) has a beautiful geometric interpretation: it simultaneously diagonalizes SB and SW.
For a single matrix A, we can diagonalize: find P such that PTAP = D (diagonal). For two matrices SB and SW, we want one transformation W such that:
This means: in the transformed space, the within-class scatter is isotropic (a sphere), and the between-class scatter is aligned with coordinate axes.
Step 1: Whiten with SW. Find SW = UΛWUT (eigendecomposition). Define A = ΛW−1/2UT. Then A SW AT = I. In the whitened space, within-class scatter is the identity.
Step 2: Diagonalize the whitened SB. Compute A SB AT and find its eigenvectors V. The columns of V are the Fisher directions in the whitened space.
Combined: W = ATV maps original data to the Fisher discriminant space.
The generalized eigenvalue problem SBw = λSWw can be rewritten:
This is a standard eigenvalue problem for the matrix SW−1SB. The eigenvector with largest eigenvalue is w*. For two classes, SB is rank-1, so there's only one nonzero eigenvalue and eigenvector: w* = SW−1(m1−m2).
Left: original data with correlated classes. Right: after whitening by SW−1/2, within-class scatter becomes spherical. In the whitened space, the optimal direction is simply the mean-difference direction.
With K > 2 classes, we can no longer project onto a single direction. We need up to K − 1 directions to separate K classes (since SB has rank at most K − 1).
Define the overall mean: m = (1/n)∑ixi.
Between-class scatter:
This is a sum of K rank-1 matrices, but since the means live in a (K−1)-dimensional affine subspace (they sum to n·m), SB has rank at most K − 1.
Within-class scatter:
We seek a projection matrix W ∈ ℝN × d (d ≤ K−1) that maximizes:
(where |·| is the determinant). The solution: the columns of W are the top d eigenvectors of SW−1SB.
With K = 3 classes in N = 3 dimensions, we get d = 2 Fisher directions. The data is projected from 3D to 2D — a plane that maximally separates the three classes.
If the original features are 100-dimensional (e.g., 100-point spectra), Fisher reduces to just 2 dimensions (for 3 classes) or K−1 dimensions in general. This is extreme dimensionality reduction guided by class structure.
| PCA | Fisher/LDA | |
|---|---|---|
| Criterion | Max total variance | Max between/within class ratio |
| Uses labels? | No (unsupervised) | Yes (supervised) |
| Max dimensions | rank(X) | K − 1 |
| Good for | Reconstruction, visualization | Classification |
| Failure mode | High-variance direction may not separate classes | May discard high-variance directions that don't help classification |
Three classes projected onto the two Fisher discriminant directions. Compare to PCA projection which ignores class labels.
| Result | Formula | Significance |
|---|---|---|
| Fisher criterion | J(w) = wTSBw / wTSWw | Generalized Rayleigh quotient; max = largest gen. eigenvalue |
| Optimal w (2-class) | w* = SW−1(m1−m2) | Closed form! No iterative optimization |
| Equivalence | Same direction as Gaussian LDA | Distribution-free justification for LDA |
| Multi-class | Top K−1 eigenvectors of SW−1SB | Supervised dimensionality reduction |
| Geometry | Whiten by SW, then project onto means | Simultaneous diagonalization of SB, SW |
• Lecture 14: Bayes classifiers — the probability-based framework Fisher avoids
• Lecture 15: LDA/QDA — the Gaussian classifiers that give the same direction as Fisher
• Lecture 9: Distance-based classification — Fisher reduces dimensions before computing distances
• At most K−1 dimensions. With 2 classes, you get one direction. If the optimal boundary is nonlinear, Fisher's linear projection loses information.
• SW must be invertible. Requires n > N (more samples than features). Otherwise, regularize: SW + εI.
• Only captures first/second order stats. If classes differ in higher moments (skewness, kurtosis) but have the same mean and variance structure, Fisher can't help.