EE269 Lecture 15 — Mert Pilanci, Stanford

Autocorrelation, LDA & QDA

When signals are random processes — characterizing them with autocorrelation, then building optimal Gaussian classifiers.

Prerequisites: EE269 Lecture 14 (Bayes classifiers) + Linear algebra (matrix inverse). That's it.
8
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: Random Signals

Every time you record the same vowel "ah," you get a different waveform. The pitch varies slightly, the amplitude fluctuates, the exact timing is never the same. Yet something is consistent — the overall "character" of the signal. How do we mathematically describe a signal that's different every time but statistically consistent?

A random process (or stochastic process) x[n] is not a single signal — it's a collection of all possible signals that could be produced by the same source. Each individual realization is called a sample path. The random process is defined by the statistics that all sample paths share.

First-Order Statistics: The Mean

The mean function mx[n] = E[x[n]] tells us the average value of the signal at each time index n. For the vowel "ah," the mean might oscillate at the fundamental frequency.

mx[n] = E[x[n]] = ∫ x · px[n](x) dx

Second-Order Statistics: The Covariance

The covariance function tells us how the signal at time k relates to the signal at time l:

Σx[k, l] = E[(x[k] − mx[k])(x[l] − mx[l])*]

When k = l, this is the variance at time k. When k ≠ l, it measures the correlation between different time points.

If you stack N samples into a vector x = [x[0], x[1], ..., x[N-1]]T, the covariance function becomes an N × N covariance matrix Σx where the (k,l)-th entry is Σx[k,l].

The challenge: A general covariance matrix for N samples has N2 entries to estimate. For a 1-second signal at 16 kHz, that's 160002 = 256 million parameters — far more than we could ever estimate from data. We need structure. That's where stationarity comes in.
Random Process: Multiple Realizations

Each click generates a new realization of the same random process (a sinusoid with random amplitude and phase plus noise). Notice: each is different, but the statistical character is the same.

Noise level σ 0.5
Why does a general covariance matrix for N=16000 samples have too many parameters?

Chapter 1: Stationarity & Autocorrelation

A random process is Wide-Sense Stationary (WSS) if two conditions hold:

1. Constant mean: mx[n] = mx for all n (the average doesn't change over time)

2. Covariance depends only on lag: Σx[k, l] depends only on the difference k − l, not on k and l individually

When these hold, the covariance matrix has a special structure called Toeplitz: every diagonal has the same value. Instead of N2 free parameters, we only need N.

Autocorrelation Function

For a WSS process, define the autocorrelation function:

rx[m] = E[x[n] · x*[n − m]]

This measures how similar the signal is to a time-shifted version of itself. Properties:

• rx[0] = E[|x[n]|2] = average power (always real, ≥ 0)

• rx[−m] = rx*[m] (conjugate symmetric)

• |rx[m]| ≤ rx[0] (maximum at zero lag)

Example: Sinusoid with Random Phase

x[n] = A · sin(nω0 + φ) where A is a constant amplitude and φ ~ Uniform[0, 2π).

Mean: mx[n] = E[A sin(nω0 + φ)] = A · E[sin(nω0 + φ)] = 0 (the uniform phase averages out).

Autocorrelation:

rx[m] = E[A sin(nω0 + φ) · A sin((n−m)ω0 + φ)]
= (A2/2) E[cos(mω0) − cos((2n−m)ω0 + 2φ)]
= (A2/2) cos(mω0)

The second cosine term vanishes because E[cos(... + 2φ)] = 0 when φ is uniform on [0, 2π).

Key insight: The autocorrelation of a sinusoid is itself a cosine at the same frequency! It reveals the periodicity of the signal regardless of the random phase. This is why autocorrelation is used for pitch detection — it finds periodicity hidden in noise.

Example: White Noise

White noise w[n] with variance σ2 has the simplest possible autocorrelation:

rw[m] = σ2 · δ[m]

where δ[m] = 1 if m = 0, else 0. Adjacent samples are completely uncorrelated. The covariance matrix is Σw = σ2I (a scaled identity).

Toeplitz Structure

For a WSS process with N samples, the covariance matrix is:

Σx = [rx[k−l]] = Toeplitz(rx[0], rx[1], ..., rx[N−1])

Worked example with N = 4, rx[0] = 2, rx[1] = 1, rx[2] = 0.5, rx[3] = 0.25:

matrix
Σ = | 2.00  1.00  0.50  0.25 |
    | 1.00  2.00  1.00  0.50 |
    | 0.50  1.00  2.00  1.00 |
    | 0.25  0.50  1.00  2.00 |

Only 4 unique values instead of 16! This dramatic parameter reduction is why stationarity matters for signal processing and classification.

Autocorrelation of WSS Processes

Choose a signal type and see its autocorrelation. Note: sinusoids have periodic autocorrelation, noise has an impulse, and exponential decay indicates a lowpass process.

Frequency ω0 0.50
The autocorrelation rx[m] of a WSS process always has its maximum at:

Chapter 2: Power Spectral Density

The autocorrelation tells us about temporal correlations. Its Fourier transform tells us about the frequency content — this is the Power Spectral Density (PSD).

Wiener-Khinchin Theorem

For a WSS process, the PSD is the DTFT of the autocorrelation:

Sx(ω) = ∑m=−∞ rx[m] e−jωm

Properties of the PSD:

• Sx(ω) ≥ 0 for all ω (power can't be negative)

• Sx(ω) is real-valued (since rx[m] is conjugate-symmetric)

• Total power = (1/2π) ∫ Sx(ω) dω = rx[0]

Physical meaning: Sx(ω)dω is the average power contributed by frequencies in the band [ω, ω + dω]. A peak in the PSD means that frequency is strongly present in the signal. White noise has Sw(ω) = σ2 (flat everywhere — equal power at all frequencies, hence "white" like white light).

Examples

Processrx[m]Sx(ω)Shape
White noiseσ2δ[m]σ2Flat
Sinusoid(A2/2)cos(mω0)(πA2/2)[δ(ω−ω0) + δ(ω+ω0)]Two impulses
AR(1)σ2a|m|/(1−a2)σ2/|1 − ae−jω|2Lowpass (if 0<a<1)

Worked Example: AR(1) Process

x[n] = 0.9 · x[n−1] + w[n] where w[n] ~ N(0, 1).

• Coefficient a = 0.9, σw2 = 1

• rx[m] = (1/(1−0.81)) · 0.9|m| = 5.26 · 0.9|m|

• rx[0] = 5.26 (total power)

• rx[1] = 4.74, rx[5] = 3.11 (slowly decaying — strong temporal correlation)

• Sx(ω) = 1/|1 − 0.9e−jω|2 peaks at ω = 0 (lowpass)

Contrast with a = −0.9: rx[m] = 5.26 · (−0.9)|m| alternates sign, and Sx(ω) peaks at ω = π (highpass).

Power Spectral Density of AR(1)

An AR(1) process x[n] = a·x[n-1] + w[n]. Adjust 'a' to see how the PSD shape changes. Positive a = lowpass, negative a = highpass.

AR coefficient a 0.80
White noise has a flat PSD Sw(ω) = σ2. What does this mean physically?

Chapter 3: Gaussian Classification

Now we connect random processes to classification. In Lecture 14, the Bayes classifier was f*(x) = argmaxk πk gk(x). The most important case in practice: Gaussian class-conditionals.

Multivariate Gaussian

Class k has distribution N(μk, Σk):

gk(x) = (2π)−N/2k|−1/2 exp(−(1/2)(x − μk)T Σk−1 (x − μk))

The Mahalanobis distance dk(x) = (x − μk)T Σk−1 (x − μk) generalizes Euclidean distance by accounting for the covariance structure. If features are correlated or have different scales, Mahalanobis handles it.

Log-Posterior for Gaussian Classes

Taking the log of πk gk(x) and dropping terms constant across all k:

log(πk gk(x)) = log πk − (1/2) log|Σk| − (1/2)(x − μk)T Σk−1 (x − μk)

This is the discriminant function δk(x). The Bayes classifier assigns x to the class with the largest δk(x).

Three cases emerge: The form of the decision boundary depends on whether the covariance matrices are the same or different across classes.
Shared Σ: boundaries are linear → LDA
Different Σk: boundaries are quadratic → QDA
Σ = σ2I: boundaries are perpendicular bisectors → nearest centroid classifier

Why Gaussian?

In signal processing, Gaussian models arise naturally from:

Central Limit Theorem: Sums of many independent effects → Gaussian

Thermal noise: Physical noise sources are well-modeled as Gaussian

Maximum entropy: Given only mean and covariance constraints, the Gaussian maximizes uncertainty (most "honest" distribution)

Mathematical tractability: Closed-form inverses, determinants, and Bayes classifiers

2D Gaussian Contours

Two Gaussian classes with different covariance structures. Contours of equal density (Mahalanobis distance) are ellipses. The Bayes boundary depends on whether covariances match.

The Mahalanobis distance (x−μ)TΣ−1(x−μ) differs from Euclidean distance by:

Chapter 4: LDA Derivation

Linear Discriminant Analysis (LDA) is the Bayes classifier when all classes share the same covariance Σ1 = Σ2 = ... = ΣK = Σ.

Derivation for Two Classes

Start from the log-likelihood ratio. With shared Σ:

log(π1g1(x)) − log(π2g2(x))
= log(π12) − (1/2)(x−μ1)TΣ−1(x−μ1) + (1/2)(x−μ2)TΣ−1(x−μ2)

The |Σ| terms cancel (same covariance). Expanding the quadratics:

= log(π12) − (1/2)[xTΣ−1x − 2μ1TΣ−1x + μ1TΣ−1μ1] + (1/2)[xTΣ−1x − 2μ2TΣ−1x + μ2TΣ−1μ2]

The xTΣ−1x terms cancel! We're left with:

= (μ1 − μ2)TΣ−1x − (1/2)(μ1TΣ−1μ1 − μ2TΣ−1μ2) + log(π12)

Declare class 1 if this is ≥ 0. Define the LDA weight vector:

w = Σ−11 − μ2)

The classifier is: assign x to class 1 if wTx ≥ threshold, else class 2.

LDA weight vector: w = Σ−11 − μ2). This is the direction that maximally separates the two class means while accounting for the within-class spread. If Σ = σ2I, then w = (μ1 − μ2)/σ2, which points from one mean to the other. But if the classes are elongated, Σ−1 rotates w to avoid projecting along the high-variance direction.

Worked Example (2D)

μ1 = [0, 0]T, μ2 = [2, 1]T, Σ = [[2, 1], [1, 2]].

Step 1: Compute Σ−1.

• det(Σ) = 4 − 1 = 3

• Σ−1 = (1/3)[[2, −1], [−1, 2]]

Step 2: Compute w = Σ−11 − μ2) = (1/3)[[2,−1],[−1,2]] · [−2, −1]T

• w = (1/3)[−4+1, 2−2]T = [−1, 0]T

Step 3: The decision is wTx ≥ threshold:

• Threshold = (1/2) wT1 + μ2) = (1/2)[−1, 0] · [2, 1]T = −1

• Classify as class 1 if −x1 ≥ −1, i.e., x1 ≤ 1

The boundary is a vertical line at x1 = 1. Even though the means differ in both coordinates, the covariance structure tells us that the x2 direction isn't informative (high correlation makes it redundant).

Interactive LDA Classifier

Two Gaussian classes with shared covariance. Drag the sliders to adjust means and correlation. The LDA boundary (solid line) is always linear. Samples are drawn from each class.

μ2 x-component 2.5
μ2 y-component 1.0
Correlation ρ 0.50
Variance σ2 1.0
In LDA, why does the quadratic term xTΣ−1x cancel between the two classes?

Chapter 5: QDA

When classes have different covariance matrices Σ1 ≠ Σ2, the xTΣ−1x terms DON'T cancel. The discriminant function becomes:

δk(x) = −(1/2)(x−μk)TΣk−1(x−μk) − (1/2)log|Σk| + log πk

The boundary δ1(x) = δ2(x) is a quadratic in x. This gives Quadratic Discriminant Analysis (QDA) with elliptical, parabolic, or hyperbolic boundaries.

Expanding QDA Boundary

Setting δ1(x) = δ2(x):

xT2−1 − Σ1−1)x − 2(μ2TΣ2−1 − μ1TΣ1−1)x + const = 0

The xTAx term makes it quadratic. When Σ1 = Σ2, the matrix A = Σ2−1 − Σ1−1 = 0, and we recover LDA.

Worked Example (2D)

Class 1: μ1 = [0,0]T, Σ1 = [[1, 0], [0, 4]] (tall ellipse)

Class 2: μ2 = [3,0]T, Σ2 = [[4, 0], [0, 1]] (wide ellipse)

Equal priors π1 = π2 = 0.5.

Class 1 is "tall" (spread in y), class 2 is "wide" (spread in x). A point with large |y| is more consistent with class 1. A point far from the origin in x is more consistent with class 2. The boundary will curve to reflect this.

At point x = [1.5, 2]T:

• Mahalanobis to μ1: [1.5, 2] [[1,0],[0,0.25]] [1.5, 2]T = 2.25 + 1 = 3.25

• Mahalanobis to μ2: [−1.5, 2] [[0.25,0],[0,1]] [−1.5, 2]T = 0.5625 + 4 = 4.5625

• Log-det correction: (1/2)(log(4)−log(4)) = 0 (both have det=4)

• Class 1 wins (lower Mahalanobis distance). Even though x1 = 1.5 is closer to μ2 in Euclidean terms, the large y = 2 is more consistent with class 1's vertical spread.

LDA vs QDA tradeoff:
LDA: N·K + N(N+1)/2 parameters (K means + one shared covariance). Low variance, high bias if covariances truly differ.
QDA: N·K + K·N(N+1)/2 parameters (K means + K covariances). More flexible, but needs more data to estimate reliably.
Rule of thumb: use LDA when training size per class < 5N, QDA when > 10N.
QDA: Quadratic Decision Boundaries

Two classes with different covariance structures. The QDA boundary is curved — it can be an ellipse, hyperbola, or parabola depending on the covariance matrices.

Class 1 variance x 1.0
Class 1 variance y 2.5
Class 2 variance x 2.5
Class 2 variance y 1.0
QDA has quadratic boundaries because:

Chapter 6: Scaled Identity Case

The simplest (and most intuitive) special case: when the shared covariance is a scaled identity Σ = σ2I. This means all features have equal variance and are uncorrelated.

The Nearest Centroid Classifier

With Σ = σ2I, the LDA weight vector becomes:

w = (σ2I)−11 − μ2) = (1/σ2)(μ1 − μ2)

And the Mahalanobis distance reduces to scaled Euclidean distance:

(x−μk)T2I)−1(x−μk) = ||x − μk||2 / σ2

The classifier becomes: assign x to the class with the nearest mean (centroid). No matrix inversions needed!

Nearest centroid classifier: When Σ = σ2I with equal priors, the Bayes-optimal classifier just computes ||x − μk||2 for each k and picks the smallest. The boundary between two classes is the perpendicular bisector of the line segment connecting their means.

Geometric Interpretation

The decision boundary between class 1 and class 2 satisfies:

||x − μ1||2 = ||x − μ2||2

Expanding:

xTx − 2μ1Tx + μ1Tμ1 = xTx − 2μ2Tx + μ2Tμ2
1 − μ2)Tx = (||μ1||2 − ||μ2||2) / 2

This is a hyperplane with normal vector (μ1 − μ2) passing through the midpoint (μ1 + μ2)/2. It's literally the perpendicular bisector.

When Priors Are Unequal

With unequal priors, the boundary shifts toward the less likely class:

||x − μ1||2 − ||x − μ2||2 ≤ 2σ2 log(π12)

The boundary is still a hyperplane (linear), just shifted from the midpoint.

Connection to Signal Detection

This is exactly the signal detection problem from Lecture 14! With H1: x ~ N(0, σ2I) and H2: x ~ N(μ, σ2I), the optimal test was xTμ ≥ ||μ||2/2. That's the perpendicular bisector between 0 and μ. Full circle.

Nearest Centroid vs LDA with Correlation

Compare Σ = σ2I (perpendicular bisector) vs Σ with correlation (tilted boundary). When features are correlated, the boundary rotates away from the perpendicular bisector.

Correlation ρ 0.00
When Σ = σ2I (equal variance, no correlation), the Bayes decision boundary between two classes is:

Chapter 7: Mastery

The Classification Hierarchy

MethodCovariance AssumptionBoundary ShapeParameters
Nearest CentroidΣ = σ2IPerpendicular bisectorK·N + 1
LDAΣ1 = Σ2 = ΣLinear (hyperplane)K·N + N(N+1)/2
QDAΣk differentQuadratic (conic)K·N + K·N(N+1)/2
Full BayesNon-Gaussian gkArbitrary

Signal Processing Connection

Random Process
Multiple realizations of same source
WSS Assumption
Σ is Toeplitz → only N autocorrelation values needed
Gaussian Model
Each class: N(μk, Σk)
Bayes Classifier
LDA (shared Σ) or QDA (different Σk)

What Comes Next

Lecture 14: The Bayes framework and ROC curves (prerequisite)

Lecture 16: Fisher's discriminant — finding the best projection WITHOUT assuming Gaussianity

Practical Considerations

Estimation: In practice, μk and Σk are estimated from training data. Sample mean x̄k and sample covariance Sk.

Regularization: When N > number of training samples, Σ is singular. Common fix: Σreg = (1−λ)Σ + λσ2I (shrink toward identity).

Dimensionality reduction: Before LDA/QDA, reduce dimensionality (PCA, MFCC, wavelets) to make covariance estimation feasible.

"The goal of a good feature representation is to make simple classifiers work well." If your features have equal variance and are uncorrelated, the nearest centroid classifier is optimal — no fancy algorithm needed. Good feature engineering makes the simple model correct.
Going from LDA to QDA, what changes and what stays the same?