EE269 Lecture 16 — Mert Pilanci, Stanford

Fisher's Discriminant

Find the one direction that maximally separates two classes — without assuming anything about their distributions.

Prerequisites: EE269 Lecture 15 (LDA) + Linear algebra (eigenvalues). That's it.
8
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: The Projection Problem

You have high-dimensional data — say, N = 100 features per sample. Plotting it is impossible. Training a classifier directly is expensive and prone to overfitting. What if you could find ONE direction to project onto such that the two classes are maximally separated in that 1D projection?

That's the goal: find a vector w ∈ ℝN such that when you project all data onto w (computing z = wTx for each sample x), the resulting 1D values for class 1 and class 2 are as separated as possible.

Why Not Just Use the Direction Between Means?

A naive approach: project onto the line connecting the two class means, w = μ1 − μ2. This maximizes the distance between the projected means. But it can fail badly.

Imagine two long, thin clouds of points tilted at 45 degrees. The line between means might be perpendicular to the clouds' elongation — projecting onto it merges the two classes into overlapping distributions. A better direction would project along the elongation, where the classes don't overlap.

The key insight: Good separation requires BOTH large distance between class means AND small spread within each class, measured in the projection direction. Fisher's criterion captures both: maximize (between-class variance) / (within-class variance) in the projected space.

Formal Setup

Given:

• Class 1 data: {x1, ..., xn1} with sample mean m1

• Class 2 data: {x1, ..., xn2} with sample mean m2

• Total: n = n1 + n2 samples in ℝN

We seek w ∈ ℝN that maximizes class separability after projection z = wTx. Fisher found the answer in 1936 — and it's elegant.

Projection Direction Matters

Two classes (teal/orange ellipses). Drag the projection angle to see how the 1D histograms overlap. Some directions separate well; others collapse both classes together.

Projection angle θ 45°
Why can projecting onto the direction between the two class means fail?

Chapter 1: Fisher Criterion J(w)

After projecting onto w, each sample xi becomes a scalar zi = wTxi. The projected class means are:

1 = wTm1,   m̃2 = wTm2

The distance between projected means is |m̃1 − m̃2|2 = (wT(m1 − m2))2.

The within-class variance after projection is:

12 = ∑x ∈ C1 (wTx − m̃1)2,   s̃22 = ∑x ∈ C2 (wTx − m̃2)2

Fisher's criterion is the ratio:

J(w) = (m̃1 − m̃2)2 / (s̃12 + s̃22)
Fisher's criterion J(w): The squared distance between projected means divided by the total within-class scatter in the projection. Maximize J(w) = find the direction where classes are far apart AND tight. It's a signal-to-noise ratio for class separability.

Expressing J(w) in Matrix Form

The numerator: (m̃1 − m̃2)2 = (wT(m1 − m2))2 = wT(m1−m2)(m1−m2)Tw

Define the between-class scatter matrix:

SB = (m1 − m2)(m1 − m2)T

Numerator = wTSBw.

The denominator: s̃12 + s̃22 = wT(∑(x−m1)(x−m1)T + ∑(x−m2)(x−m2)T)w

Define the within-class scatter matrix:

SW = ∑x ∈ C1(x−m1)(x−m1)T + ∑x ∈ C2(x−m2)(x−m2)T

Denominator = wTSWw. Therefore:

J(w) = wTSBw / wTSWw

This is a Rayleigh quotient (or generalized Rayleigh quotient) — a ratio of quadratic forms. Maximizing it is a classic eigenvalue problem.

Worked Example (2D)

Class 1: points at (1,2), (2,3), (3,3). Mean m1 = [2, 8/3]T ≈ [2, 2.67]T.

Class 2: points at (5,1), (6,2), (7,1). Mean m2 = [6, 4/3]T ≈ [6, 1.33]T.

m1 − m2 = [−4, 1.33]T

SB = [−4, 1.33]·[−4, 1.33]T = [[16, −5.33], [−5.33, 1.78]]

If we try w = [1, 0] (project onto x-axis): J = (2−6)2 / (within_x) = large/moderate. If we try w = [0, 1] (project onto y-axis): J = (2.67−1.33)2 / (within_y) = smaller. The x-direction separates better for this data.

Fisher Criterion J(w) vs Angle

Two classes in 2D. The polar plot shows J(w) for each projection direction. The optimal w maximizes J. Compare to naive "direction of means."

Fisher's criterion J(w) = wTSBw / wTSWw is maximized when:

Chapter 2: Between/Within Scatter

Let's build intuition for SB and SW — the two matrices that define Fisher's criterion.

Between-Class Scatter SB

SB = (m1 − m2)(m1 − m2)T

This is a rank-1 matrix (outer product of a vector with itself). It captures how far apart the class centroids are and in what direction. Its only nonzero eigenvector is (m1 − m2) itself.

Properties:

• Rank 1 (for two classes). For K classes: rank ≤ K − 1.

• SBw = (m1−m2)(m1−m2)Tw = (m1−m2) · scalar. So SBw always points in direction (m1−m2).

Within-Class Scatter SW

SW = Σ1 + Σ2

where Σk = ∑x ∈ Ck(x−mk)(x−mk)T is the scatter matrix for class k. (This is (nk−1) times the sample covariance of class k.)

SW is full-rank (generically), symmetric, and positive definite. It measures the total "spread" of data around their respective class means.

Geometric interpretation:
SB points toward the between-class direction. Making wTSBw large means w is aligned with (m1−m2).
SW captures within-class spread. Making wTSWw small means w avoids directions where classes are internally dispersed.
J(w) balances both: align with the inter-mean direction while avoiding the directions of high variance.

Worked Example (2D Numerical)

Class 1 (3 points): [1,4], [2,6], [3,5]. Mean m1 = [2, 5].

Class 2 (3 points): [5,1], [6,3], [7,2]. Mean m2 = [6, 2].

SB = (m1−m2)(m1−m2)T = [−4, 3][−4, 3]T = [[16, −12], [−12, 9]]

Σ1 = ([1,4]−[2,5])([1,4]−[2,5])T + ... = [[2, 1], [1, 2]]

Σ2 = [[2, 1], [1, 2]]

SW = [[4, 2], [2, 4]]

Now we can compute J(w) for any direction. For w = [1, 0]:

• wTSBw = 16, wTSWw = 4, J = 4.0

For w = [0, 1]:

• wTSBw = 9, wTSWw = 4, J = 2.25

For w = [−4, 3]/5 (direction of means):

• wTSBw = (16·16 + 2·(−12)·(−4)·3 + 9·9)/25 = 25, wTSWw = (16·4 − 2·2·12 + 9·4)/25 = 52/25, J = 25/(52/25) = 625/52 ≈ 12.0

The direction of means isn't always optimal. The true optimum requires solving the eigenvalue problem.

Scatter Matrices Visualized

The ellipses show the shape of SW (within-class spread). The arrow shows m1−m2 (between-class direction). Optimal Fisher w balances avoiding the ellipse's long axis while staying near the mean-difference direction.

The between-class scatter SB = (m1−m2)(m1−m2)T has rank:

Chapter 3: Optimal w Derivation

We need to maximize J(w) = wTSBw / wTSWw. This is a generalized Rayleigh quotient. The standard approach: take the derivative, set to zero.

The Derivation

Differentiate J(w) with respect to w using the quotient rule:

∂J/∂w = (2SBw · wTSWw − wTSBw · 2SWw) / (wTSWw)2 = 0

Setting the numerator to zero:

SBw · (wTSWw) = SWw · (wTSBw)

Let λ = wTSBw / wTSWw = J(w). Then:

SBw = λ SWw

This is a generalized eigenvalue problem: SBw = λ SWw. The w that maximizes J is the generalized eigenvector corresponding to the largest eigenvalue λ.

The Closed-Form Solution

But we can do better than solving an eigenvalue problem. Recall that SBw = (m1−m2)(m1−m2)Tw = (m1−m2) · c where c = (m1−m2)Tw is a scalar.

Substituting into SBw = λSWw:

(m1−m2) · c = λ SWw
w = (λ/c) · SW−1(m1−m2)

Since we only care about the direction of w (not its magnitude), the scalar λ/c doesn't matter. The solution is:

w* = SW−1(m1 − m2)
Fisher's optimal projection: w* = SW−1(m1 − m2). The within-class scatter inverse "de-correlates" and normalizes the features, then we project onto the mean difference. It's like LDA's w = Σ−11−μ2) but uses the SAMPLE scatter matrices. And crucially — Fisher derived this WITHOUT assuming Gaussian distributions.

Worked Example

From Chapter 2: SW = [[4, 2], [2, 4]], m1−m2 = [−4, 3].

SW−1 = (1/12)[[4, −2], [−2, 4]] (using det = 16 − 4 = 12)

w* = (1/12)[[4,−2],[−2,4]] · [−4, 3]T = (1/12)[−16−6, 8+12]T = (1/12)[−22, 20]T ∝ [−11, 10]

Normalizing: w* ≈ [−0.74, 0.67]. This is tilted compared to the naive m1−m2 = [−4, 3] ∝ [−0.80, 0.60], because SW−1 rotates the direction to account for within-class correlations.

Interactive Fisher Projection

Drag the angle to project 2D data onto different directions. The Fisher optimal direction (green arrow) maximizes J(w). Compare to naive mean-difference direction (gray dashed). Watch the 1D histograms overlap or separate.

Projection angle θ 45°
Within-class correlation ρ 0.70
The Fisher optimal projection w* = SW−1(m1−m2) differs from the naive mean-difference direction because:

Chapter 4: Connection to LDA

We've derived Fisher's discriminant purely from a separation criterion — no probability distributions, no Gaussians, no Bayes' rule. Yet the result looks suspiciously like LDA from Lecture 15.

The Remarkable Equivalence

Fisher's DiscriminantLDA (Gaussian Bayes)
AssumptionNone on distributionGaussian with shared Σ
CriterionMax J(w) = between/withinMin Bayes risk R(f)
Solutionw = SW−1(m1−m2)w = Σ−11−μ2)
EquivalenceIf SW/(n−2) = Σ̂ (pooled sample covariance) and mk = μ̂k, they give the SAME direction
The punchline: Fisher's discriminant and Gaussian LDA produce the same projection direction. But Fisher requires NO distributional assumption — it works for any data shape. It's optimal as a variance-ratio criterion regardless of the underlying distribution. LDA is Bayes-optimal only when the data is truly Gaussian.

Why This Matters

1. Robustness: Fisher's direction is sensible even for non-Gaussian data (bimodal, skewed, heavy-tailed). LDA's Bayes optimality only holds for Gaussians.

2. Same computation: In practice, you compute the same thing either way. But the interpretation differs — Fisher is a geometric projection; LDA is a probabilistic classifier.

3. Threshold differs: Fisher gives you the direction but doesn't specify where to threshold. LDA specifies the threshold via Bayes' rule. In practice, you often set the threshold by cross-validation regardless.

When They Diverge

Fisher and LDA agree on the direction but can give different classifiers when:

Unequal priors: LDA shifts the threshold; Fisher's criterion ignores priors.

Non-Gaussian data: LDA's decision rule assumes Gaussian posteriors; Fisher's threshold should be set empirically.

Unequal class sizes: Fisher uses pooled scatter; LDA uses the same but weighted differently depending on formulation.

Fisher vs LDA: Same Direction, Different Assumptions

Toggle between Gaussian data (where both are optimal) and non-Gaussian data (where Fisher still makes sense but LDA's probability model is wrong).

Fisher's discriminant and LDA give the same projection direction. What's the key advantage of Fisher's derivation?

Chapter 5: Simultaneous Diagonalization

The Fisher solution w = SW−1(m1−m2) has a beautiful geometric interpretation: it simultaneously diagonalizes SB and SW.

What Simultaneous Diagonalization Means

For a single matrix A, we can diagonalize: find P such that PTAP = D (diagonal). For two matrices SB and SW, we want one transformation W such that:

WTSWW = I   (within-class becomes identity)
WTSBW = Λ   (between-class becomes diagonal)

This means: in the transformed space, the within-class scatter is isotropic (a sphere), and the between-class scatter is aligned with coordinate axes.

Two-Step Procedure

Step 1: Whiten with SW. Find SW = UΛWUT (eigendecomposition). Define A = ΛW−1/2UT. Then A SW AT = I. In the whitened space, within-class scatter is the identity.

Step 2: Diagonalize the whitened SB. Compute A SB AT and find its eigenvectors V. The columns of V are the Fisher directions in the whitened space.

Combined: W = ATV maps original data to the Fisher discriminant space.

Geometric picture: SW−1 first "spheres" the data so that within-class scatter is uniform in all directions. Once within-class scatter is a sphere, the best separation direction is simply the one connecting the whitened means. This is why w = SW−1(m1−m2) works: SW−1 does the whitening, then (m1−m2) picks the direction.

Connection to Generalized Eigenvalue Problem

The generalized eigenvalue problem SBw = λSWw can be rewritten:

SW−1SBw = λw

This is a standard eigenvalue problem for the matrix SW−1SB. The eigenvector with largest eigenvalue is w*. For two classes, SB is rank-1, so there's only one nonzero eigenvalue and eigenvector: w* = SW−1(m1−m2).

Whitening Then Projecting

Left: original data with correlated classes. Right: after whitening by SW−1/2, within-class scatter becomes spherical. In the whitened space, the optimal direction is simply the mean-difference direction.

After whitening the data with SW−1/2, what is the optimal Fisher direction in the whitened space?

Chapter 6: Multi-Class Fisher

With K > 2 classes, we can no longer project onto a single direction. We need up to K − 1 directions to separate K classes (since SB has rank at most K − 1).

Multi-Class Scatter Matrices

Define the overall mean: m = (1/n)∑ixi.

Between-class scatter:

SB = ∑k=1K nk(mk − m)(mk − m)T

This is a sum of K rank-1 matrices, but since the means live in a (K−1)-dimensional affine subspace (they sum to n·m), SB has rank at most K − 1.

Within-class scatter:

SW = ∑k=1Kx ∈ Ck(x − mk)(x − mk)T

Multi-Class Fisher Criterion

We seek a projection matrix W ∈ ℝN × d (d ≤ K−1) that maximizes:

J(W) = |WTSBW| / |WTSWW|

(where |·| is the determinant). The solution: the columns of W are the top d eigenvectors of SW−1SB.

Multi-class Fisher: For K classes, solve the generalized eigenvalue problem SBw = λSWw. Take the top K−1 eigenvectors as projection directions. This gives a (K−1)-dimensional space where classes are maximally separated. For K=2, this reduces to the single vector w* = SW−1(m1−m2).

Example: 3 Classes in 3D

With K = 3 classes in N = 3 dimensions, we get d = 2 Fisher directions. The data is projected from 3D to 2D — a plane that maximally separates the three classes.

If the original features are 100-dimensional (e.g., 100-point spectra), Fisher reduces to just 2 dimensions (for 3 classes) or K−1 dimensions in general. This is extreme dimensionality reduction guided by class structure.

Fisher Dimensionality Reduction vs PCA

PCAFisher/LDA
CriterionMax total varianceMax between/within class ratio
Uses labels?No (unsupervised)Yes (supervised)
Max dimensionsrank(X)K − 1
Good forReconstruction, visualizationClassification
Failure modeHigh-variance direction may not separate classesMay discard high-variance directions that don't help classification
Multi-Class Fisher: 3 Classes → 2D

Three classes projected onto the two Fisher discriminant directions. Compare to PCA projection which ignores class labels.

For K = 5 classes in N = 100 dimensions, multi-class Fisher discriminant analysis projects to at most:

Chapter 7: Mastery

Fisher's Discriminant: Complete Picture

Data
K classes in ℝN, compute mk
SB
∑nk(mk−m)(mk−m)T — between-class scatter (rank ≤ K−1)
SW
∑∑(x−mk)(x−mk)T — within-class scatter
Solve
SW−1SBw = λw — top K−1 eigenvectors
Project
z = WTx ∈ ℝK−1 — classify in reduced space

Key Results

ResultFormulaSignificance
Fisher criterionJ(w) = wTSBw / wTSWwGeneralized Rayleigh quotient; max = largest gen. eigenvalue
Optimal w (2-class)w* = SW−1(m1−m2)Closed form! No iterative optimization
EquivalenceSame direction as Gaussian LDADistribution-free justification for LDA
Multi-classTop K−1 eigenvectors of SW−1SBSupervised dimensionality reduction
GeometryWhiten by SW, then project onto meansSimultaneous diagonalization of SB, SW

Connections

Lecture 14: Bayes classifiers — the probability-based framework Fisher avoids

Lecture 15: LDA/QDA — the Gaussian classifiers that give the same direction as Fisher

Lecture 9: Distance-based classification — Fisher reduces dimensions before computing distances

Limitations

At most K−1 dimensions. With 2 classes, you get one direction. If the optimal boundary is nonlinear, Fisher's linear projection loses information.

SW must be invertible. Requires n > N (more samples than features). Otherwise, regularize: SW + εI.

Only captures first/second order stats. If classes differ in higher moments (skewness, kurtosis) but have the same mean and variance structure, Fisher can't help.

"The best projection for classification is not the same as the best projection for visualization." PCA finds directions of maximum variance; Fisher finds directions of maximum discrimination. A high-variance direction with both classes mixed is useless for classification. Always use labels when you have them.
Fisher's discriminant requires no Gaussian assumption, yet gives the same direction as LDA. The practical implication is: