EE269 Lecture 15 — Autocorrelation, LDA & QDA

Chapter 0: Random Signals

Every time you record the same vowel "ah," you get a different waveform. The pitch varies slightly, the amplitude fluctuates, the exact timing is never the same. Yet something is consistent — the overall "character" of the signal. How do we mathematically describe a signal that's different every time but statistically consistent?

A random process (or stochastic process) x[n] is not a single signal — it's a collection of all possible signals that could be produced by the same source. Each individual realization is called a sample path. The random process is defined by the statistics that all sample paths share.

First-Order Statistics: The Mean

The mean function m_x[n] = E[x[n]] tells us the average value of the signal at each time index n. For the vowel "ah," the mean might oscillate at the fundamental frequency.

m_x[n] = E[x[n]] = ∫ x · p_x[n](x) dx

Second-Order Statistics: The Covariance

The covariance function tells us how the signal at time k relates to the signal at time l:

Σ_x[k, l] = E[(x[k] − m_x[k])(x[l] − m_x[l])*]

When k = l, this is the variance at time k. When k ≠ l, it measures the correlation between different time points.

If you stack N samples into a vector x = [x[0], x[1], ..., x[N-1]]^T, the covariance function becomes an N × N covariance matrix Σ_x where the (k,l)-th entry is Σ_x[k,l].

The challenge: A general covariance matrix for N samples has N² entries to estimate. For a 1-second signal at 16 kHz, that's 16000² = 256 million parameters — far more than we could ever estimate from data. We need structure. That's where stationarity comes in.

Random Process: Multiple Realizations

Each click generates a new realization of the same random process (a sinusoid with random amplitude and phase plus noise). Notice: each is different, but the statistical character is the same.

Noise level σ 0.5

Why does a general covariance matrix for N=16000 samples have too many parameters?

Because the mean is unknown Because it has N² = 256 million entries, far more than available training data Because the signal is continuous, not discrete

Chapter 1: Stationarity & Autocorrelation

A random process is Wide-Sense Stationary (WSS) if two conditions hold:

1. Constant mean: m_x[n] = m_x for all n (the average doesn't change over time)

2. Covariance depends only on lag: Σ_x[k, l] depends only on the difference k − l, not on k and l individually

When these hold, the covariance matrix has a special structure called Toeplitz: every diagonal has the same value. Instead of N² free parameters, we only need N.

Autocorrelation Function

For a WSS process, define the autocorrelation function:

r_x[m] = E[x[n] · x*[n − m]]

This measures how similar the signal is to a time-shifted version of itself. Properties:

• r_x[0] = E[|x[n]|²] = average power (always real, ≥ 0)

• r_x[−m] = r_x*[m] (conjugate symmetric)

• |r_x[m]| ≤ r_x[0] (maximum at zero lag)

Example: Sinusoid with Random Phase

x[n] = A · sin(nω₀ + φ) where A is a constant amplitude and φ ~ Uniform[0, 2π).

Mean: m_x[n] = E[A sin(nω₀ + φ)] = A · E[sin(nω₀ + φ)] = 0 (the uniform phase averages out).

Autocorrelation:

r_x[m] = E[A sin(nω₀ + φ) · A sin((n−m)ω₀ + φ)]

= (A²/2) E[cos(mω₀) − cos((2n−m)ω₀ + 2φ)]

= (A²/2) cos(mω₀)

The second cosine term vanishes because E[cos(... + 2φ)] = 0 when φ is uniform on [0, 2π).

Key insight: The autocorrelation of a sinusoid is itself a cosine at the same frequency! It reveals the periodicity of the signal regardless of the random phase. This is why autocorrelation is used for pitch detection — it finds periodicity hidden in noise.

Example: White Noise

White noise w[n] with variance σ² has the simplest possible autocorrelation:

r_w[m] = σ² · δ[m]

where δ[m] = 1 if m = 0, else 0. Adjacent samples are completely uncorrelated. The covariance matrix is Σ_w = σ²I (a scaled identity).

Toeplitz Structure

For a WSS process with N samples, the covariance matrix is:

Σ_x = [r_x[k−l]] = Toeplitz(r_x[0], r_x[1], ..., r_x[N−1])

Worked example with N = 4, r_x[0] = 2, r_x[1] = 1, r_x[2] = 0.5, r_x[3] = 0.25:

matrix
Σ = | 2.00  1.00  0.50  0.25 |
    | 1.00  2.00  1.00  0.50 |
    | 0.50  1.00  2.00  1.00 |
    | 0.25  0.50  1.00  2.00 |

Only 4 unique values instead of 16! This dramatic parameter reduction is why stationarity matters for signal processing and classification.

Autocorrelation of WSS Processes

Choose a signal type and see its autocorrelation. Note: sinusoids have periodic autocorrelation, noise has an impulse, and exponential decay indicates a lowpass process.

Frequency ω₀ 0.50

The autocorrelation r_x[m] of a WSS process always has its maximum at:

Lag m = 0, because r_x[0] = average power, and |r_x[m]| ≤ r_x[0] Lag m = 1, because adjacent samples are most correlated It depends on the signal type

Chapter 2: Power Spectral Density

The autocorrelation tells us about temporal correlations. Its Fourier transform tells us about the frequency content — this is the Power Spectral Density (PSD).

Wiener-Khinchin Theorem

For a WSS process, the PSD is the DTFT of the autocorrelation:

S_x(ω) = ∑_m=−∞^∞ r_x[m] e^−jωm

Properties of the PSD:

• S_x(ω) ≥ 0 for all ω (power can't be negative)

• S_x(ω) is real-valued (since r_x[m] is conjugate-symmetric)

• Total power = (1/2π) ∫ S_x(ω) dω = r_x[0]

Physical meaning: S_x(ω)dω is the average power contributed by frequencies in the band [ω, ω + dω]. A peak in the PSD means that frequency is strongly present in the signal. White noise has S_w(ω) = σ² (flat everywhere — equal power at all frequencies, hence "white" like white light).

Examples

Process	r_x[m]	S_x(ω)	Shape
White noise	σ²δ[m]	σ²	Flat
Sinusoid	(A²/2)cos(mω₀)	(πA²/2)[δ(ω−ω₀) + δ(ω+ω₀)]	Two impulses
AR(1)	σ²a^\|m\|/(1−a²)	σ²/\|1 − ae^−jω\|²	Lowpass (if 0<a<1)

Worked Example: AR(1) Process

x[n] = 0.9 · x[n−1] + w[n] where w[n] ~ N(0, 1).

• Coefficient a = 0.9, σ_w² = 1

• r_x[m] = (1/(1−0.81)) · 0.9^|m| = 5.26 · 0.9^|m|

• r_x[0] = 5.26 (total power)

• r_x[1] = 4.74, r_x[5] = 3.11 (slowly decaying — strong temporal correlation)

• S_x(ω) = 1/|1 − 0.9e^−jω|² peaks at ω = 0 (lowpass)

Contrast with a = −0.9: r_x[m] = 5.26 · (−0.9)^|m| alternates sign, and S_x(ω) peaks at ω = π (highpass).

Power Spectral Density of AR(1)

An AR(1) process x[n] = a·x[n-1] + w[n]. Adjust 'a' to see how the PSD shape changes. Positive a = lowpass, negative a = highpass.

AR coefficient a 0.80

White noise has a flat PSD S_w(ω) = σ². What does this mean physically?

Equal average power at every frequency — no frequency is preferred over another The signal has zero power Adjacent samples are perfectly correlated

Chapter 3: Gaussian Classification

Now we connect random processes to classification. In Lecture 14, the Bayes classifier was f*(x) = argmax_k π_k g_k(x). The most important case in practice: Gaussian class-conditionals.

Multivariate Gaussian

Class k has distribution N(μ_k, Σ_k):

g_k(x) = (2π)^−N/2 |Σ_k|^−1/2 exp(−(1/2)(x − μ_k)^T Σ_k⁻¹ (x − μ_k))

The Mahalanobis distance d_k(x) = (x − μ_k)^T Σ_k⁻¹ (x − μ_k) generalizes Euclidean distance by accounting for the covariance structure. If features are correlated or have different scales, Mahalanobis handles it.

Log-Posterior for Gaussian Classes

Taking the log of π_k g_k(x) and dropping terms constant across all k:

log(π_k g_k(x)) = log π_k − (1/2) log|Σ_k| − (1/2)(x − μ_k)^T Σ_k⁻¹ (x − μ_k)

This is the discriminant function δ_k(x). The Bayes classifier assigns x to the class with the largest δ_k(x).

Three cases emerge: The form of the decision boundary depends on whether the covariance matrices are the same or different across classes.
• Shared Σ: boundaries are linear → LDA
• Different Σ_k: boundaries are quadratic → QDA
• Σ = σ²I: boundaries are perpendicular bisectors → nearest centroid classifier

Why Gaussian?

In signal processing, Gaussian models arise naturally from:

• Central Limit Theorem: Sums of many independent effects → Gaussian

• Thermal noise: Physical noise sources are well-modeled as Gaussian

• Maximum entropy: Given only mean and covariance constraints, the Gaussian maximizes uncertainty (most "honest" distribution)

• Mathematical tractability: Closed-form inverses, determinants, and Bayes classifiers

2D Gaussian Contours

Two Gaussian classes with different covariance structures. Contours of equal density (Mahalanobis distance) are ellipses. The Bayes boundary depends on whether covariances match.

The Mahalanobis distance (x−μ)^TΣ⁻¹(x−μ) differs from Euclidean distance by:

Taking the square root Using only the diagonal of Σ Weighting by the inverse covariance, accounting for correlations and different scales

Chapter 4: LDA Derivation

Linear Discriminant Analysis (LDA) is the Bayes classifier when all classes share the same covariance Σ₁ = Σ₂ = ... = Σ_K = Σ.

Derivation for Two Classes

Start from the log-likelihood ratio. With shared Σ:

log(π₁g₁(x)) − log(π₂g₂(x))

= log(π₁/π₂) − (1/2)(x−μ₁)^TΣ⁻¹(x−μ₁) + (1/2)(x−μ₂)^TΣ⁻¹(x−μ₂)

The |Σ| terms cancel (same covariance). Expanding the quadratics:

= log(π₁/π₂) − (1/2)[x^TΣ⁻¹x − 2μ₁^TΣ⁻¹x + μ₁^TΣ⁻¹μ₁] + (1/2)[x^TΣ⁻¹x − 2μ₂^TΣ⁻¹x + μ₂^TΣ⁻¹μ₂]

The x^TΣ⁻¹x terms cancel! We're left with:

= (μ₁ − μ₂)^TΣ⁻¹x − (1/2)(μ₁^TΣ⁻¹μ₁ − μ₂^TΣ⁻¹μ₂) + log(π₁/π₂)

Declare class 1 if this is ≥ 0. Define the LDA weight vector:

w = Σ⁻¹(μ₁ − μ₂)

The classifier is: assign x to class 1 if w^Tx ≥ threshold, else class 2.

LDA weight vector: w = Σ⁻¹(μ₁ − μ₂). This is the direction that maximally separates the two class means while accounting for the within-class spread. If Σ = σ²I, then w = (μ₁ − μ₂)/σ², which points from one mean to the other. But if the classes are elongated, Σ⁻¹ rotates w to avoid projecting along the high-variance direction.

Worked Example (2D)

μ₁ = [0, 0]^T, μ₂ = [2, 1]^T, Σ = [[2, 1], [1, 2]].

Step 1: Compute Σ⁻¹.

• det(Σ) = 4 − 1 = 3

• Σ⁻¹ = (1/3)[[2, −1], [−1, 2]]

Step 2: Compute w = Σ⁻¹(μ₁ − μ₂) = (1/3)[[2,−1],[−1,2]] · [−2, −1]^T

• w = (1/3)[−4+1, 2−2]^T = [−1, 0]^T

Step 3: The decision is w^Tx ≥ threshold:

• Threshold = (1/2) w^T(μ₁ + μ₂) = (1/2)[−1, 0] · [2, 1]^T = −1

• Classify as class 1 if −x₁ ≥ −1, i.e., x₁ ≤ 1

The boundary is a vertical line at x₁ = 1. Even though the means differ in both coordinates, the covariance structure tells us that the x₂ direction isn't informative (high correlation makes it redundant).

Interactive LDA Classifier

Two Gaussian classes with shared covariance. Drag the sliders to adjust means and correlation. The LDA boundary (solid line) is always linear. Samples are drawn from each class.

μ₂ x-component 2.5

μ₂ y-component 1.0

Correlation ρ 0.50

Variance σ² 1.0

In LDA, why does the quadratic term x^TΣ⁻¹x cancel between the two classes?

Because both classes share the same Σ, so the same quadratic appears in both discriminant functions and subtracts out Because the determinant is zero Because the priors are equal

Chapter 5: QDA

When classes have different covariance matrices Σ₁ ≠ Σ₂, the x^TΣ⁻¹x terms DON'T cancel. The discriminant function becomes:

δ_k(x) = −(1/2)(x−μ_k)^TΣ_k⁻¹(x−μ_k) − (1/2)log|Σ_k| + log π_k

The boundary δ₁(x) = δ₂(x) is a quadratic in x. This gives Quadratic Discriminant Analysis (QDA) with elliptical, parabolic, or hyperbolic boundaries.

Expanding QDA Boundary

Setting δ₁(x) = δ₂(x):

x^T(Σ₂⁻¹ − Σ₁⁻¹)x − 2(μ₂^TΣ₂⁻¹ − μ₁^TΣ₁⁻¹)x + const = 0

The x^TAx term makes it quadratic. When Σ₁ = Σ₂, the matrix A = Σ₂⁻¹ − Σ₁⁻¹ = 0, and we recover LDA.

Worked Example (2D)

Class 1: μ₁ = [0,0]^T, Σ₁ = [[1, 0], [0, 4]] (tall ellipse)

Class 2: μ₂ = [3,0]^T, Σ₂ = [[4, 0], [0, 1]] (wide ellipse)

Equal priors π₁ = π₂ = 0.5.

Class 1 is "tall" (spread in y), class 2 is "wide" (spread in x). A point with large |y| is more consistent with class 1. A point far from the origin in x is more consistent with class 2. The boundary will curve to reflect this.

At point x = [1.5, 2]^T:

• Mahalanobis to μ₁: [1.5, 2] [[1,0],[0,0.25]] [1.5, 2]^T = 2.25 + 1 = 3.25

• Mahalanobis to μ₂: [−1.5, 2] [[0.25,0],[0,1]] [−1.5, 2]^T = 0.5625 + 4 = 4.5625

• Log-det correction: (1/2)(log(4)−log(4)) = 0 (both have det=4)

• Class 1 wins (lower Mahalanobis distance). Even though x₁ = 1.5 is closer to μ₂ in Euclidean terms, the large y = 2 is more consistent with class 1's vertical spread.

LDA vs QDA tradeoff:
• LDA: N·K + N(N+1)/2 parameters (K means + one shared covariance). Low variance, high bias if covariances truly differ.
• QDA: N·K + K·N(N+1)/2 parameters (K means + K covariances). More flexible, but needs more data to estimate reliably.
Rule of thumb: use LDA when training size per class < 5N, QDA when > 10N.

QDA: Quadratic Decision Boundaries

Two classes with different covariance structures. The QDA boundary is curved — it can be an ellipse, hyperbola, or parabola depending on the covariance matrices.

Class 1 variance x 1.0

Class 1 variance y 2.5

Class 2 variance x 2.5

Class 2 variance y 1.0

QDA has quadratic boundaries because:

The Gaussian density has a cubic term in x Different Σ_k means the x^TΣ⁻¹x terms DON'T cancel between classes We're using polynomial features

Chapter 6: Scaled Identity Case

The simplest (and most intuitive) special case: when the shared covariance is a scaled identity Σ = σ²I. This means all features have equal variance and are uncorrelated.

The Nearest Centroid Classifier

With Σ = σ²I, the LDA weight vector becomes:

w = (σ²I)⁻¹(μ₁ − μ₂) = (1/σ²)(μ₁ − μ₂)

And the Mahalanobis distance reduces to scaled Euclidean distance:

(x−μ_k)^T(σ²I)⁻¹(x−μ_k) = ||x − μ_k||² / σ²

The classifier becomes: assign x to the class with the nearest mean (centroid). No matrix inversions needed!

Nearest centroid classifier: When Σ = σ²I with equal priors, the Bayes-optimal classifier just computes ||x − μ_k||² for each k and picks the smallest. The boundary between two classes is the perpendicular bisector of the line segment connecting their means.

Geometric Interpretation

The decision boundary between class 1 and class 2 satisfies:

||x − μ₁||² = ||x − μ₂||²

Expanding:

x^Tx − 2μ₁^Tx + μ₁^Tμ₁ = x^Tx − 2μ₂^Tx + μ₂^Tμ₂

(μ₁ − μ₂)^Tx = (||μ₁||² − ||μ₂||²) / 2

This is a hyperplane with normal vector (μ₁ − μ₂) passing through the midpoint (μ₁ + μ₂)/2. It's literally the perpendicular bisector.

When Priors Are Unequal

With unequal priors, the boundary shifts toward the less likely class:

||x − μ₁||² − ||x − μ₂||² ≤ 2σ² log(π₁/π₂)

The boundary is still a hyperplane (linear), just shifted from the midpoint.

Connection to Signal Detection

This is exactly the signal detection problem from Lecture 14! With H₁: x ~ N(0, σ²I) and H₂: x ~ N(μ, σ²I), the optimal test was x^Tμ ≥ ||μ||²/2. That's the perpendicular bisector between 0 and μ. Full circle.

Nearest Centroid vs LDA with Correlation

Compare Σ = σ²I (perpendicular bisector) vs Σ with correlation (tilted boundary). When features are correlated, the boundary rotates away from the perpendicular bisector.

Correlation ρ 0.00

When Σ = σ²I (equal variance, no correlation), the Bayes decision boundary between two classes is:

A circle centered at the midpoint of the means The perpendicular bisector of the line segment connecting the two means A parabola

Chapter 7: Mastery

The Classification Hierarchy

Method	Covariance Assumption	Boundary Shape	Parameters
Nearest Centroid	Σ = σ²I	Perpendicular bisector	K·N + 1
LDA	Σ₁ = Σ₂ = Σ	Linear (hyperplane)	K·N + N(N+1)/2
QDA	Σ_k different	Quadratic (conic)	K·N + K·N(N+1)/2
Full Bayes	Non-Gaussian g_k	Arbitrary	∞

Signal Processing Connection

Random Process

Multiple realizations of same source

↓

WSS Assumption

Σ is Toeplitz → only N autocorrelation values needed

↓

Gaussian Model

Each class: N(μ_k, Σ_k)

↓

Bayes Classifier

LDA (shared Σ) or QDA (different Σ_k)

What Comes Next

• Lecture 14: The Bayes framework and ROC curves (prerequisite)

• Lecture 16: Fisher's discriminant — finding the best projection WITHOUT assuming Gaussianity

Practical Considerations

• Estimation: In practice, μ_k and Σ_k are estimated from training data. Sample mean x̄_k and sample covariance S_k.

• Regularization: When N > number of training samples, Σ is singular. Common fix: Σ_reg = (1−λ)Σ + λσ²I (shrink toward identity).

• Dimensionality reduction: Before LDA/QDA, reduce dimensionality (PCA, MFCC, wavelets) to make covariance estimation feasible.

"The goal of a good feature representation is to make simple classifiers work well." If your features have equal variance and are uncorrelated, the nearest centroid classifier is optimal — no fancy algorithm needed. Good feature engineering makes the simple model correct.

Going from LDA to QDA, what changes and what stays the same?

Same: both assume Gaussian classes. Different: LDA shares one covariance, QDA allows different covariances per class, giving quadratic boundaries Same: both have quadratic boundaries. Different: QDA uses more features Same: both assume equal priors. Different: QDA doesn't assume Gaussian