When signals are random processes — characterizing them with autocorrelation, then building optimal Gaussian classifiers.
Every time you record the same vowel "ah," you get a different waveform. The pitch varies slightly, the amplitude fluctuates, the exact timing is never the same. Yet something is consistent — the overall "character" of the signal. How do we mathematically describe a signal that's different every time but statistically consistent?
A random process (or stochastic process) x[n] is not a single signal — it's a collection of all possible signals that could be produced by the same source. Each individual realization is called a sample path. The random process is defined by the statistics that all sample paths share.
The mean function mx[n] = E[x[n]] tells us the average value of the signal at each time index n. For the vowel "ah," the mean might oscillate at the fundamental frequency.
The covariance function tells us how the signal at time k relates to the signal at time l:
When k = l, this is the variance at time k. When k ≠ l, it measures the correlation between different time points.
If you stack N samples into a vector x = [x[0], x[1], ..., x[N-1]]T, the covariance function becomes an N × N covariance matrix Σx where the (k,l)-th entry is Σx[k,l].
Each click generates a new realization of the same random process (a sinusoid with random amplitude and phase plus noise). Notice: each is different, but the statistical character is the same.
A random process is Wide-Sense Stationary (WSS) if two conditions hold:
1. Constant mean: mx[n] = mx for all n (the average doesn't change over time)
2. Covariance depends only on lag: Σx[k, l] depends only on the difference k − l, not on k and l individually
When these hold, the covariance matrix has a special structure called Toeplitz: every diagonal has the same value. Instead of N2 free parameters, we only need N.
For a WSS process, define the autocorrelation function:
This measures how similar the signal is to a time-shifted version of itself. Properties:
• rx[0] = E[|x[n]|2] = average power (always real, ≥ 0)
• rx[−m] = rx*[m] (conjugate symmetric)
• |rx[m]| ≤ rx[0] (maximum at zero lag)
x[n] = A · sin(nω0 + φ) where A is a constant amplitude and φ ~ Uniform[0, 2π).
Mean: mx[n] = E[A sin(nω0 + φ)] = A · E[sin(nω0 + φ)] = 0 (the uniform phase averages out).
Autocorrelation:
The second cosine term vanishes because E[cos(... + 2φ)] = 0 when φ is uniform on [0, 2π).
White noise w[n] with variance σ2 has the simplest possible autocorrelation:
where δ[m] = 1 if m = 0, else 0. Adjacent samples are completely uncorrelated. The covariance matrix is Σw = σ2I (a scaled identity).
For a WSS process with N samples, the covariance matrix is:
Worked example with N = 4, rx[0] = 2, rx[1] = 1, rx[2] = 0.5, rx[3] = 0.25:
matrix
Σ = | 2.00 1.00 0.50 0.25 |
| 1.00 2.00 1.00 0.50 |
| 0.50 1.00 2.00 1.00 |
| 0.25 0.50 1.00 2.00 |
Only 4 unique values instead of 16! This dramatic parameter reduction is why stationarity matters for signal processing and classification.
Choose a signal type and see its autocorrelation. Note: sinusoids have periodic autocorrelation, noise has an impulse, and exponential decay indicates a lowpass process.
The autocorrelation tells us about temporal correlations. Its Fourier transform tells us about the frequency content — this is the Power Spectral Density (PSD).
For a WSS process, the PSD is the DTFT of the autocorrelation:
Properties of the PSD:
• Sx(ω) ≥ 0 for all ω (power can't be negative)
• Sx(ω) is real-valued (since rx[m] is conjugate-symmetric)
• Total power = (1/2π) ∫ Sx(ω) dω = rx[0]
| Process | rx[m] | Sx(ω) | Shape |
|---|---|---|---|
| White noise | σ2δ[m] | σ2 | Flat |
| Sinusoid | (A2/2)cos(mω0) | (πA2/2)[δ(ω−ω0) + δ(ω+ω0)] | Two impulses |
| AR(1) | σ2a|m|/(1−a2) | σ2/|1 − ae−jω|2 | Lowpass (if 0<a<1) |
x[n] = 0.9 · x[n−1] + w[n] where w[n] ~ N(0, 1).
• Coefficient a = 0.9, σw2 = 1
• rx[m] = (1/(1−0.81)) · 0.9|m| = 5.26 · 0.9|m|
• rx[0] = 5.26 (total power)
• rx[1] = 4.74, rx[5] = 3.11 (slowly decaying — strong temporal correlation)
• Sx(ω) = 1/|1 − 0.9e−jω|2 peaks at ω = 0 (lowpass)
Contrast with a = −0.9: rx[m] = 5.26 · (−0.9)|m| alternates sign, and Sx(ω) peaks at ω = π (highpass).
An AR(1) process x[n] = a·x[n-1] + w[n]. Adjust 'a' to see how the PSD shape changes. Positive a = lowpass, negative a = highpass.
Now we connect random processes to classification. In Lecture 14, the Bayes classifier was f*(x) = argmaxk πk gk(x). The most important case in practice: Gaussian class-conditionals.
Class k has distribution N(μk, Σk):
The Mahalanobis distance dk(x) = (x − μk)T Σk−1 (x − μk) generalizes Euclidean distance by accounting for the covariance structure. If features are correlated or have different scales, Mahalanobis handles it.
Taking the log of πk gk(x) and dropping terms constant across all k:
This is the discriminant function δk(x). The Bayes classifier assigns x to the class with the largest δk(x).
In signal processing, Gaussian models arise naturally from:
• Central Limit Theorem: Sums of many independent effects → Gaussian
• Thermal noise: Physical noise sources are well-modeled as Gaussian
• Maximum entropy: Given only mean and covariance constraints, the Gaussian maximizes uncertainty (most "honest" distribution)
• Mathematical tractability: Closed-form inverses, determinants, and Bayes classifiers
Two Gaussian classes with different covariance structures. Contours of equal density (Mahalanobis distance) are ellipses. The Bayes boundary depends on whether covariances match.
Linear Discriminant Analysis (LDA) is the Bayes classifier when all classes share the same covariance Σ1 = Σ2 = ... = ΣK = Σ.
Start from the log-likelihood ratio. With shared Σ:
The |Σ| terms cancel (same covariance). Expanding the quadratics:
The xTΣ−1x terms cancel! We're left with:
Declare class 1 if this is ≥ 0. Define the LDA weight vector:
The classifier is: assign x to class 1 if wTx ≥ threshold, else class 2.
μ1 = [0, 0]T, μ2 = [2, 1]T, Σ = [[2, 1], [1, 2]].
Step 1: Compute Σ−1.
• det(Σ) = 4 − 1 = 3
• Σ−1 = (1/3)[[2, −1], [−1, 2]]
Step 2: Compute w = Σ−1(μ1 − μ2) = (1/3)[[2,−1],[−1,2]] · [−2, −1]T
• w = (1/3)[−4+1, 2−2]T = [−1, 0]T
Step 3: The decision is wTx ≥ threshold:
• Threshold = (1/2) wT(μ1 + μ2) = (1/2)[−1, 0] · [2, 1]T = −1
• Classify as class 1 if −x1 ≥ −1, i.e., x1 ≤ 1
The boundary is a vertical line at x1 = 1. Even though the means differ in both coordinates, the covariance structure tells us that the x2 direction isn't informative (high correlation makes it redundant).
Two Gaussian classes with shared covariance. Drag the sliders to adjust means and correlation. The LDA boundary (solid line) is always linear. Samples are drawn from each class.
When classes have different covariance matrices Σ1 ≠ Σ2, the xTΣ−1x terms DON'T cancel. The discriminant function becomes:
The boundary δ1(x) = δ2(x) is a quadratic in x. This gives Quadratic Discriminant Analysis (QDA) with elliptical, parabolic, or hyperbolic boundaries.
Setting δ1(x) = δ2(x):
The xTAx term makes it quadratic. When Σ1 = Σ2, the matrix A = Σ2−1 − Σ1−1 = 0, and we recover LDA.
Class 1: μ1 = [0,0]T, Σ1 = [[1, 0], [0, 4]] (tall ellipse)
Class 2: μ2 = [3,0]T, Σ2 = [[4, 0], [0, 1]] (wide ellipse)
Equal priors π1 = π2 = 0.5.
Class 1 is "tall" (spread in y), class 2 is "wide" (spread in x). A point with large |y| is more consistent with class 1. A point far from the origin in x is more consistent with class 2. The boundary will curve to reflect this.
At point x = [1.5, 2]T:
• Mahalanobis to μ1: [1.5, 2] [[1,0],[0,0.25]] [1.5, 2]T = 2.25 + 1 = 3.25
• Mahalanobis to μ2: [−1.5, 2] [[0.25,0],[0,1]] [−1.5, 2]T = 0.5625 + 4 = 4.5625
• Log-det correction: (1/2)(log(4)−log(4)) = 0 (both have det=4)
• Class 1 wins (lower Mahalanobis distance). Even though x1 = 1.5 is closer to μ2 in Euclidean terms, the large y = 2 is more consistent with class 1's vertical spread.
Two classes with different covariance structures. The QDA boundary is curved — it can be an ellipse, hyperbola, or parabola depending on the covariance matrices.
The simplest (and most intuitive) special case: when the shared covariance is a scaled identity Σ = σ2I. This means all features have equal variance and are uncorrelated.
With Σ = σ2I, the LDA weight vector becomes:
And the Mahalanobis distance reduces to scaled Euclidean distance:
The classifier becomes: assign x to the class with the nearest mean (centroid). No matrix inversions needed!
The decision boundary between class 1 and class 2 satisfies:
Expanding:
This is a hyperplane with normal vector (μ1 − μ2) passing through the midpoint (μ1 + μ2)/2. It's literally the perpendicular bisector.
With unequal priors, the boundary shifts toward the less likely class:
The boundary is still a hyperplane (linear), just shifted from the midpoint.
This is exactly the signal detection problem from Lecture 14! With H1: x ~ N(0, σ2I) and H2: x ~ N(μ, σ2I), the optimal test was xTμ ≥ ||μ||2/2. That's the perpendicular bisector between 0 and μ. Full circle.
Compare Σ = σ2I (perpendicular bisector) vs Σ with correlation (tilted boundary). When features are correlated, the boundary rotates away from the perpendicular bisector.
| Method | Covariance Assumption | Boundary Shape | Parameters |
|---|---|---|---|
| Nearest Centroid | Σ = σ2I | Perpendicular bisector | K·N + 1 |
| LDA | Σ1 = Σ2 = Σ | Linear (hyperplane) | K·N + N(N+1)/2 |
| QDA | Σk different | Quadratic (conic) | K·N + K·N(N+1)/2 |
| Full Bayes | Non-Gaussian gk | Arbitrary | ∞ |
• Lecture 14: The Bayes framework and ROC curves (prerequisite)
• Lecture 16: Fisher's discriminant — finding the best projection WITHOUT assuming Gaussianity
• Estimation: In practice, μk and Σk are estimated from training data. Sample mean x̄k and sample covariance Sk.
• Regularization: When N > number of training samples, Σ is singular. Common fix: Σreg = (1−λ)Σ + λσ2I (shrink toward identity).
• Dimensionality reduction: Before LDA/QDA, reduce dimensionality (PCA, MFCC, wavelets) to make covariance estimation feasible.