EE269 Lecture 16 — Fisher's Linear Discriminant

Chapter 0: The Projection Problem

You have high-dimensional data — say, N = 100 features per sample. Plotting it is impossible. Training a classifier directly is expensive and prone to overfitting. What if you could find ONE direction to project onto such that the two classes are maximally separated in that 1D projection?

That's the goal: find a vector w ∈ ℝ^N such that when you project all data onto w (computing z = w^Tx for each sample x), the resulting 1D values for class 1 and class 2 are as separated as possible.

Why Not Just Use the Direction Between Means?

A naive approach: project onto the line connecting the two class means, w = μ₁ − μ₂. This maximizes the distance between the projected means. But it can fail badly.

Imagine two long, thin clouds of points tilted at 45 degrees. The line between means might be perpendicular to the clouds' elongation — projecting onto it merges the two classes into overlapping distributions. A better direction would project along the elongation, where the classes don't overlap.

The key insight: Good separation requires BOTH large distance between class means AND small spread within each class, measured in the projection direction. Fisher's criterion captures both: maximize (between-class variance) / (within-class variance) in the projected space.

Formal Setup

Given:

• Class 1 data: {x₁, ..., x_n₁} with sample mean m₁

• Class 2 data: {x₁, ..., x_n₂} with sample mean m₂

• Total: n = n₁ + n₂ samples in ℝ^N

We seek w ∈ ℝ^N that maximizes class separability after projection z = w^Tx. Fisher found the answer in 1936 — and it's elegant.

Projection Direction Matters

Two classes (teal/orange ellipses). Drag the projection angle to see how the 1D histograms overlap. Some directions separate well; others collapse both classes together.

Projection angle θ 45°

Why can projecting onto the direction between the two class means fail?

Because the means might be too close together Because the within-class spread in that direction might be large, causing the projected classes to overlap despite separated means Because you need at least 3 dimensions for good separation

Chapter 1: Fisher Criterion J(w)

After projecting onto w, each sample x_i becomes a scalar z_i = w^Tx_i. The projected class means are:

m̃₁ = w^Tm₁, m̃₂ = w^Tm₂

The distance between projected means is |m̃₁ − m̃₂|² = (w^T(m₁ − m₂))².

The within-class variance after projection is:

s̃₁² = ∑_{x ∈ C₁} (w^Tx − m̃₁)², s̃₂² = ∑_{x ∈ C₂} (w^Tx − m̃₂)²

Fisher's criterion is the ratio:

J(w) = (m̃₁ − m̃₂)² / (s̃₁² + s̃₂²)

Fisher's criterion J(w): The squared distance between projected means divided by the total within-class scatter in the projection. Maximize J(w) = find the direction where classes are far apart AND tight. It's a signal-to-noise ratio for class separability.

Expressing J(w) in Matrix Form

The numerator: (m̃₁ − m̃₂)² = (w^T(m₁ − m₂))² = w^T(m₁−m₂)(m₁−m₂)^Tw

Define the between-class scatter matrix:

S_B = (m₁ − m₂)(m₁ − m₂)^T

Numerator = w^TS_Bw.

The denominator: s̃₁² + s̃₂² = w^T(∑(x−m₁)(x−m₁)^T + ∑(x−m₂)(x−m₂)^T)w

Define the within-class scatter matrix:

S_W = ∑_{x ∈ C₁}(x−m₁)(x−m₁)^T + ∑_{x ∈ C₂}(x−m₂)(x−m₂)^T

Denominator = w^TS_Ww. Therefore:

J(w) = w^TS_Bw / w^TS_Ww

This is a Rayleigh quotient (or generalized Rayleigh quotient) — a ratio of quadratic forms. Maximizing it is a classic eigenvalue problem.

Worked Example (2D)

Class 1: points at (1,2), (2,3), (3,3). Mean m₁ = [2, 8/3]^T ≈ [2, 2.67]^T.

Class 2: points at (5,1), (6,2), (7,1). Mean m₂ = [6, 4/3]^T ≈ [6, 1.33]^T.

m₁ − m₂ = [−4, 1.33]^T

S_B = [−4, 1.33]·[−4, 1.33]^T = [[16, −5.33], [−5.33, 1.78]]

If we try w = [1, 0] (project onto x-axis): J = (2−6)² / (within_x) = large/moderate. If we try w = [0, 1] (project onto y-axis): J = (2.67−1.33)² / (within_y) = smaller. The x-direction separates better for this data.

Fisher Criterion J(w) vs Angle

Two classes in 2D. The polar plot shows J(w) for each projection direction. The optimal w maximizes J. Compare to naive "direction of means."

Fisher's criterion J(w) = w^TS_Bw / w^TS_Ww is maximized when:

w points toward the nearest data point w maximizes only the between-class distance w maximizes the ratio of between-class separation to within-class spread

Chapter 2: Between/Within Scatter

Let's build intuition for S_B and S_W — the two matrices that define Fisher's criterion.

Between-Class Scatter S_B

S_B = (m₁ − m₂)(m₁ − m₂)^T

This is a rank-1 matrix (outer product of a vector with itself). It captures how far apart the class centroids are and in what direction. Its only nonzero eigenvector is (m₁ − m₂) itself.

Properties:

• Rank 1 (for two classes). For K classes: rank ≤ K − 1.

• S_Bw = (m₁−m₂)(m₁−m₂)^Tw = (m₁−m₂) · scalar. So S_Bw always points in direction (m₁−m₂).

Within-Class Scatter S_W

S_W = Σ₁ + Σ₂

where Σ_k = ∑_{x ∈ C_k}(x−m_k)(x−m_k)^T is the scatter matrix for class k. (This is (n_k−1) times the sample covariance of class k.)

S_W is full-rank (generically), symmetric, and positive definite. It measures the total "spread" of data around their respective class means.

Geometric interpretation:
• S_B points toward the between-class direction. Making w^TS_Bw large means w is aligned with (m₁−m₂).
• S_W captures within-class spread. Making w^TS_Ww small means w avoids directions where classes are internally dispersed.
• J(w) balances both: align with the inter-mean direction while avoiding the directions of high variance.

Worked Example (2D Numerical)

Class 1 (3 points): [1,4], [2,6], [3,5]. Mean m₁ = [2, 5].

Class 2 (3 points): [5,1], [6,3], [7,2]. Mean m₂ = [6, 2].

S_B = (m₁−m₂)(m₁−m₂)^T = [−4, 3][−4, 3]^T = [[16, −12], [−12, 9]]

Σ₁ = ([1,4]−[2,5])([1,4]−[2,5])^T + ... = [[2, 1], [1, 2]]

Σ₂ = [[2, 1], [1, 2]]

S_W = [[4, 2], [2, 4]]

Now we can compute J(w) for any direction. For w = [1, 0]:

• w^TS_Bw = 16, w^TS_Ww = 4, J = 4.0

For w = [0, 1]:

• w^TS_Bw = 9, w^TS_Ww = 4, J = 2.25

For w = [−4, 3]/5 (direction of means):

• w^TS_Bw = (16·16 + 2·(−12)·(−4)·3 + 9·9)/25 = 25, w^TS_Ww = (16·4 − 2·2·12 + 9·4)/25 = 52/25, J = 25/(52/25) = 625/52 ≈ 12.0

The direction of means isn't always optimal. The true optimum requires solving the eigenvalue problem.

Scatter Matrices Visualized

The ellipses show the shape of S_W (within-class spread). The arrow shows m₁−m₂ (between-class direction). Optimal Fisher w balances avoiding the ellipse's long axis while staying near the mean-difference direction.

The between-class scatter S_B = (m₁−m₂)(m₁−m₂)^T has rank:

1 (it's an outer product of a single vector) N (full rank) 2 (one for each class)

Chapter 3: Optimal w Derivation

We need to maximize J(w) = w^TS_Bw / w^TS_Ww. This is a generalized Rayleigh quotient. The standard approach: take the derivative, set to zero.

The Derivation

Differentiate J(w) with respect to w using the quotient rule:

∂J/∂w = (2S_Bw · w^TS_Ww − w^TS_Bw · 2S_Ww) / (w^TS_Ww)² = 0

Setting the numerator to zero:

S_Bw · (w^TS_Ww) = S_Ww · (w^TS_Bw)

Let λ = w^TS_Bw / w^TS_Ww = J(w). Then:

S_Bw = λ S_Ww

This is a generalized eigenvalue problem: S_Bw = λ S_Ww. The w that maximizes J is the generalized eigenvector corresponding to the largest eigenvalue λ.

The Closed-Form Solution

But we can do better than solving an eigenvalue problem. Recall that S_Bw = (m₁−m₂)(m₁−m₂)^Tw = (m₁−m₂) · c where c = (m₁−m₂)^Tw is a scalar.

Substituting into S_Bw = λS_Ww:

(m₁−m₂) · c = λ S_Ww

w = (λ/c) · S_W⁻¹(m₁−m₂)

Since we only care about the direction of w (not its magnitude), the scalar λ/c doesn't matter. The solution is:

w* = S_W⁻¹(m₁ − m₂)

Fisher's optimal projection: w* = S_W⁻¹(m₁ − m₂). The within-class scatter inverse "de-correlates" and normalizes the features, then we project onto the mean difference. It's like LDA's w = Σ⁻¹(μ₁−μ₂) but uses the SAMPLE scatter matrices. And crucially — Fisher derived this WITHOUT assuming Gaussian distributions.

Worked Example

From Chapter 2: S_W = [[4, 2], [2, 4]], m₁−m₂ = [−4, 3].

S_W⁻¹ = (1/12)[[4, −2], [−2, 4]] (using det = 16 − 4 = 12)

w* = (1/12)[[4,−2],[−2,4]] · [−4, 3]^T = (1/12)[−16−6, 8+12]^T = (1/12)[−22, 20]^T ∝ [−11, 10]

Normalizing: w* ≈ [−0.74, 0.67]. This is tilted compared to the naive m₁−m₂ = [−4, 3] ∝ [−0.80, 0.60], because S_W⁻¹ rotates the direction to account for within-class correlations.

Interactive Fisher Projection

Drag the angle to project 2D data onto different directions. The Fisher optimal direction (green arrow) maximizes J(w). Compare to naive mean-difference direction (gray dashed). Watch the 1D histograms overlap or separate.

Projection angle θ 45°

Within-class correlation ρ 0.70

The Fisher optimal projection w* = S_W⁻¹(m₁−m₂) differs from the naive mean-difference direction because:

S_W⁻¹ rotates the direction to avoid projecting along high-variance (low-information) directions Fisher uses more data points The mean difference is always perpendicular to Fisher's direction

Chapter 4: Connection to LDA

We've derived Fisher's discriminant purely from a separation criterion — no probability distributions, no Gaussians, no Bayes' rule. Yet the result looks suspiciously like LDA from Lecture 15.

The Remarkable Equivalence

	Fisher's Discriminant	LDA (Gaussian Bayes)
Assumption	None on distribution	Gaussian with shared Σ
Criterion	Max J(w) = between/within	Min Bayes risk R(f)
Solution	w = S_W⁻¹(m₁−m₂)	w = Σ⁻¹(μ₁−μ₂)
Equivalence	If S_W/(n−2) = Σ̂ (pooled sample covariance) and m_k = μ̂_k, they give the SAME direction

The punchline: Fisher's discriminant and Gaussian LDA produce the same projection direction. But Fisher requires NO distributional assumption — it works for any data shape. It's optimal as a variance-ratio criterion regardless of the underlying distribution. LDA is Bayes-optimal only when the data is truly Gaussian.

Why This Matters

1. Robustness: Fisher's direction is sensible even for non-Gaussian data (bimodal, skewed, heavy-tailed). LDA's Bayes optimality only holds for Gaussians.

2. Same computation: In practice, you compute the same thing either way. But the interpretation differs — Fisher is a geometric projection; LDA is a probabilistic classifier.

3. Threshold differs: Fisher gives you the direction but doesn't specify where to threshold. LDA specifies the threshold via Bayes' rule. In practice, you often set the threshold by cross-validation regardless.

When They Diverge

Fisher and LDA agree on the direction but can give different classifiers when:

• Unequal priors: LDA shifts the threshold; Fisher's criterion ignores priors.

• Non-Gaussian data: LDA's decision rule assumes Gaussian posteriors; Fisher's threshold should be set empirically.

• Unequal class sizes: Fisher uses pooled scatter; LDA uses the same but weighted differently depending on formulation.

Fisher vs LDA: Same Direction, Different Assumptions

Toggle between Gaussian data (where both are optimal) and non-Gaussian data (where Fisher still makes sense but LDA's probability model is wrong).

Fisher's discriminant and LDA give the same projection direction. What's the key advantage of Fisher's derivation?

It requires no Gaussian assumption — the direction maximizes separability for ANY data distribution It's computationally faster It works in higher dimensions

Chapter 5: Simultaneous Diagonalization

The Fisher solution w = S_W⁻¹(m₁−m₂) has a beautiful geometric interpretation: it simultaneously diagonalizes S_B and S_W.

What Simultaneous Diagonalization Means

For a single matrix A, we can diagonalize: find P such that P^TAP = D (diagonal). For two matrices S_B and S_W, we want one transformation W such that:

W^TS_WW = I (within-class becomes identity)

W^TS_BW = Λ (between-class becomes diagonal)

This means: in the transformed space, the within-class scatter is isotropic (a sphere), and the between-class scatter is aligned with coordinate axes.

Two-Step Procedure

Step 1: Whiten with S_W. Find S_W = UΛ_WU^T (eigendecomposition). Define A = Λ_W^−1/2U^T. Then A S_W A^T = I. In the whitened space, within-class scatter is the identity.

Step 2: Diagonalize the whitened S_B. Compute A S_B A^T and find its eigenvectors V. The columns of V are the Fisher directions in the whitened space.

Combined: W = A^TV maps original data to the Fisher discriminant space.

Geometric picture: S_W⁻¹ first "spheres" the data so that within-class scatter is uniform in all directions. Once within-class scatter is a sphere, the best separation direction is simply the one connecting the whitened means. This is why w = S_W⁻¹(m₁−m₂) works: S_W⁻¹ does the whitening, then (m₁−m₂) picks the direction.

Connection to Generalized Eigenvalue Problem

The generalized eigenvalue problem S_Bw = λS_Ww can be rewritten:

S_W⁻¹S_Bw = λw

This is a standard eigenvalue problem for the matrix S_W⁻¹S_B. The eigenvector with largest eigenvalue is w*. For two classes, S_B is rank-1, so there's only one nonzero eigenvalue and eigenvector: w* = S_W⁻¹(m₁−m₂).

Whitening Then Projecting

Left: original data with correlated classes. Right: after whitening by S_W^−1/2, within-class scatter becomes spherical. In the whitened space, the optimal direction is simply the mean-difference direction.

After whitening the data with S_W^−1/2, what is the optimal Fisher direction in the whitened space?

The first principal component Simply the direction between the whitened class means (since within-class scatter is now isotropic) The eigenvector of the original covariance matrix

Chapter 6: Multi-Class Fisher

With K > 2 classes, we can no longer project onto a single direction. We need up to K − 1 directions to separate K classes (since S_B has rank at most K − 1).

Multi-Class Scatter Matrices

Define the overall mean: m = (1/n)∑_ix_i.

Between-class scatter:

S_B = ∑_k=1^K n_k(m_k − m)(m_k − m)^T

This is a sum of K rank-1 matrices, but since the means live in a (K−1)-dimensional affine subspace (they sum to n·m), S_B has rank at most K − 1.

Within-class scatter:

S_W = ∑_k=1^K ∑_{x ∈ C_k}(x − m_k)(x − m_k)^T

Multi-Class Fisher Criterion

We seek a projection matrix W ∈ ℝ^{N × d} (d ≤ K−1) that maximizes:

J(W) = |W^TS_BW| / |W^TS_WW|

(where |·| is the determinant). The solution: the columns of W are the top d eigenvectors of S_W⁻¹S_B.

Multi-class Fisher: For K classes, solve the generalized eigenvalue problem S_Bw = λS_Ww. Take the top K−1 eigenvectors as projection directions. This gives a (K−1)-dimensional space where classes are maximally separated. For K=2, this reduces to the single vector w* = S_W⁻¹(m₁−m₂).

Example: 3 Classes in 3D

With K = 3 classes in N = 3 dimensions, we get d = 2 Fisher directions. The data is projected from 3D to 2D — a plane that maximally separates the three classes.

If the original features are 100-dimensional (e.g., 100-point spectra), Fisher reduces to just 2 dimensions (for 3 classes) or K−1 dimensions in general. This is extreme dimensionality reduction guided by class structure.

Fisher Dimensionality Reduction vs PCA

	PCA	Fisher/LDA
Criterion	Max total variance	Max between/within class ratio
Uses labels?	No (unsupervised)	Yes (supervised)
Max dimensions	rank(X)	K − 1
Good for	Reconstruction, visualization	Classification
Failure mode	High-variance direction may not separate classes	May discard high-variance directions that don't help classification

Multi-Class Fisher: 3 Classes → 2D

Three classes projected onto the two Fisher discriminant directions. Compare to PCA projection which ignores class labels.

For K = 5 classes in N = 100 dimensions, multi-class Fisher discriminant analysis projects to at most:

100 dimensions 4 dimensions (K − 1 = 4), because S_B has rank at most K−1 5 dimensions (one per class)

Chapter 7: Mastery

Fisher's Discriminant: Complete Picture

Data

K classes in ℝ^N, compute m_k

↓

S_B

∑n_k(m_k−m)(m_k−m)^T — between-class scatter (rank ≤ K−1)

↓

S_W

∑∑(x−m_k)(x−m_k)^T — within-class scatter

↓

Solve

S_W⁻¹S_Bw = λw — top K−1 eigenvectors

↓

Project

z = W^Tx ∈ ℝ^K−1 — classify in reduced space

Key Results

Result	Formula	Significance
Fisher criterion	J(w) = w^TS_Bw / w^TS_Ww	Generalized Rayleigh quotient; max = largest gen. eigenvalue
Optimal w (2-class)	w* = S_W⁻¹(m₁−m₂)	Closed form! No iterative optimization
Equivalence	Same direction as Gaussian LDA	Distribution-free justification for LDA
Multi-class	Top K−1 eigenvectors of S_W⁻¹S_B	Supervised dimensionality reduction
Geometry	Whiten by S_W, then project onto means	Simultaneous diagonalization of S_B, S_W

Connections

• Lecture 14: Bayes classifiers — the probability-based framework Fisher avoids

• Lecture 15: LDA/QDA — the Gaussian classifiers that give the same direction as Fisher

• Lecture 9: Distance-based classification — Fisher reduces dimensions before computing distances

Limitations

• At most K−1 dimensions. With 2 classes, you get one direction. If the optimal boundary is nonlinear, Fisher's linear projection loses information.

• S_W must be invertible. Requires n > N (more samples than features). Otherwise, regularize: S_W + εI.

• Only captures first/second order stats. If classes differ in higher moments (skewness, kurtosis) but have the same mean and variance structure, Fisher can't help.

"The best projection for classification is not the same as the best projection for visualization." PCA finds directions of maximum variance; Fisher finds directions of maximum discrimination. A high-variance direction with both classes mixed is useless for classification. Always use labels when you have them.

Fisher's discriminant requires no Gaussian assumption, yet gives the same direction as LDA. The practical implication is:

You should always assume Gaussian distributions The LDA projection direction is sensible even for non-Gaussian data, because it maximizes a distribution-free separability criterion Fisher and LDA give different results in practice

Fisher's Discriminant