EE269 Lecture 9 — Mert Pilanci, Stanford

Distance-Based Classification

Classifying signals by measuring how "close" they are — nearest neighbors, inner products, and Hilbert spaces.

Prerequisites: EE269 Lecture 6 (DFT) + Basic linear algebra (vectors, norms). That's it.
8
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: The Classification Problem

You have 100 recordings of the word "yes" and 100 recordings of the word "no." A new recording arrives. Is it "yes" or "no"?

Each recording is a signal — a vector of, say, 8000 samples (0.5 seconds at 16 kHz). You need an algorithm that looks at the new signal and assigns it to one of two classes. This is binary classification.

The simplest possible approach: compare the new signal to every training signal. Find the one it's most similar to. Assign the same label. This is the Nearest Neighbor classifier — surprisingly powerful and requires zero model training.

The key question: What does "similar" mean for signals? Two recordings of "yes" by different speakers look completely different as waveforms. But they sound the same. We need a distance measure that captures perceptual similarity, not sample-by-sample identity.

Formal Setup

Training set: m labeled signals {(x1, y1), (x2, y2), ..., (xm, ym)} where:

• xi ∈ ℝN is a signal (vector of N samples)

• yi ∈ {0, 1} is its class label

Test signal: xnew ∈ ℝN. Goal: predict ynew.

Why Distance?

Classification algorithms range from simple (nearest neighbor) to complex (deep neural networks). But at their core, most rely on some notion of distance or similarity:

• Nearest neighbor: explicitly uses distance

• SVM: finds the hyperplane maximizing distance to nearest points

• Neural networks: learn a representation where classes are far apart

Understanding distance metrics for signals is foundational. The choice of metric can matter more than the classifier itself.

Two-Class Signal Classification

Class A (teal) and Class B (orange) signals. The test signal (white) gets classified based on which class is "closer." Click to see different representations.

What is the main challenge in classifying audio signals by distance?

Chapter 1: Distance Metrics

A distance metric d(x, y) is a function that measures how "far apart" two vectors are. To be a valid metric, it must satisfy four properties:

1. Non-negativity: d(x, y) ≥ 0

2. Identity: d(x, y) = 0 if and only if x = y

3. Symmetry: d(x, y) = d(y, x)

4. Triangle inequality: d(x, z) ≤ d(x, y) + d(y, z)

Euclidean Distance (L2)

The most natural distance — straight-line distance in N-dimensional space:

d2(x, y) = √(∑n=0N-1 |x[n] - y[n]|2) = ||x - y||2

For signals, this is the root-mean-square difference. Two signals that differ by loud bursts at a few samples can have large Euclidean distance even if they sound similar overall.

Physical meaning: ||x||22 = ∑|x[n]|2 is the signal energy. So Euclidean distance = square root of the energy of the difference signal. If d2(x,y) is small, then x-y is a low-energy signal (nearly zero everywhere).

Manhattan Distance (L1)

d1(x, y) = ∑n=0N-1 |x[n] - y[n]|

Sum of absolute differences. More robust to outliers than Euclidean — a single large difference at one sample doesn't dominate as much (linear vs. quadratic weighting).

Chebyshev Distance (L)

d(x, y) = maxn |x[n] - y[n]|

The maximum absolute difference across all samples. Only cares about the worst-case sample. Useful when you need to guarantee that NO sample deviates by more than a threshold.

General Lp Distance

dp(x, y) = (∑n=0N-1 |x[n] - y[n]|p)1/p

Euclidean is p=2, Manhattan is p=1, Chebyshev is p→∞. As p increases, larger differences are penalized more heavily.

Comparison

MetricFormulaSensitive ToUse When
L2 (Euclidean)√(∑|x-y|2)Overall energy differenceSignals are aligned, similar amplitude
L1 (Manhattan)∑|x-y|Average absolute differenceRobust to outlier samples
L (Chebyshev)max|x-y|Worst-case sampleBounded-error constraints
Distance Metric Comparison

Two signals shown. The "distance" under each metric is visualized differently. L2 emphasizes large deviations; L1 treats all equally; Linf only sees the max.

Signal x = [0, 0, 10, 0, 0] and signal y = [2, 2, 2, 2, 2]. Which metric gives the SMALLEST distance between them?

Chapter 2: Nearest Neighbor Classifier

The Nearest Neighbor (NN) classifier is the simplest classification algorithm possible. It requires no training phase — just store the training data and compare at test time.

Algorithm

Input
Test signal xnew, training set {(xi, yi)}
Compute distances
di = d(xnew, xi) for all i = 1,...,m
Find nearest
i* = argmini di
Classify
new = yi* (label of closest training signal)

Properties

No training: The "model" IS the training data. No parameters to learn, no optimization. Just store and compare.

Nonparametric: Makes no assumptions about the shape of the decision boundary. Can capture arbitrarily complex boundaries.

Consistent: As training set size m → ∞, the NN error rate converges to at most twice the optimal (Bayes) error rate. This is a remarkable guarantee for such a simple algorithm.

Expensive at test time: Must compute distance to ALL m training signals. For large m, this is O(mN) per test sample. (Compare to a trained model which is O(1) to evaluate.)

In Code

python
import numpy as np

def nearest_neighbor(x_new, X_train, y_train):
    """Classify x_new using 1-NN.
    X_train: (m, N) array of training signals
    y_train: (m,) array of labels {0, 1}
    """
    distances = np.sqrt(((X_train - x_new) ** 2).sum(axis=1))
    nearest_idx = np.argmin(distances)
    return y_train[nearest_idx]

Worked Example

Training set (2D features for simplicity):

• x1 = [1, 2], y1 = 0 (class A)

• x2 = [3, 1], y2 = 0 (class A)

• x3 = [6, 5], y3 = 1 (class B)

• x4 = [7, 4], y4 = 1 (class B)

Test point: xnew = [4, 3]

Distances (Euclidean):

d(xnew, x1) = √((4-1)2 + (3-2)2) = √10 ≈ 3.16
d(xnew, x2) = √((4-3)2 + (3-1)2) = √5 ≈ 2.24 ← nearest!
d(xnew, x3) = √((4-6)2 + (3-5)2) = √8 ≈ 2.83
d(xnew, x4) = √((4-7)2 + (3-4)2) = √10 ≈ 3.16

Nearest neighbor is x2 with label 0 → classify xnew as class A.

Nearest Neighbor Classifier

Click anywhere to place a test point. It gets classified by the label of its nearest training point. Teal = class A, orange = class B.

Click to classify a test point.
What is the main disadvantage of the 1-NN classifier?

Chapter 3: Inner Products & Norms

Distance metrics are built from norms, and norms (at least the L2 norm) come from inner products. Understanding this hierarchy gives us powerful tools for signal comparison.

The Inner Product

For discrete signals x[n] and y[n] of length N, the inner product is:

<x, y> = ∑n=0N-1 x[n] · y*[n]

where y*[n] is the complex conjugate of y[n]. For real signals, this simplifies to the dot product: <x, y> = ∑ x[n] y[n].

What the inner product measures: It quantifies the "overlap" or "similarity" between two signals. If x and y point in the same direction (proportional), <x,y> is large and positive. If they're orthogonal (uncorrelated), <x,y> = 0. If they're opposite, <x,y> is large and negative.

The Norm (from the Inner Product)

The L2 norm (Euclidean length) of a signal is derived from the inner product:

||x||2 = √(<x, x>) = √(∑n=0N-1 |x[n]|2)

This equals the square root of the signal's energy. A loud signal has large norm; a quiet signal has small norm.

Distance from Norm

The Euclidean distance is just the norm of the difference:

d(x, y) = ||x - y||2 = √(<x-y, x-y>)

Expanding this reveals a connection to the inner product:

||x - y||2 = ||x||2 + ||y||2 - 2<x, y>

So distance is small when <x,y> is large relative to the norms — when the signals are well-correlated.

Cosine Similarity

To remove amplitude dependence and compare only "shape," we normalize:

cos(θ) = <x, y> / (||x|| · ||y||)

This equals 1 when signals are identical (up to scaling), 0 when orthogonal, -1 when opposite. It's amplitude-invariant: cos(2x, y) = cos(x, y).

Worked Example: Cosine vs. Euclidean

Signal a = [1, 1, 0, 0] and signal b = [3, 3, 0, 0] (same direction, different loudness).

• Euclidean: ||a-b|| = √(4+4) = 2√2 ≈ 2.83 (says "quite different")

• Cosine: <a,b>/(||a|| · ||b||) = 6/(√2 · 3√2) = 6/6 = 1.0 (says "identical shape")

Signal c = [0, 0, 1, 1] (orthogonal direction, same energy as a).

• Euclidean: ||a-c|| = √(1+1+1+1) = 2 (says "moderately different")

• Cosine: <a,c>/(||a|| · ||c||) = 0/(√2 · √2) = 0 (says "completely different shape")

Lesson: Euclidean distance is sensitive to amplitude. Cosine similarity only compares direction (shape). For audio classification where volume varies, cosine is often better.

Code: Inner Product and Friends

python
import numpy as np

def inner_product(x, y):
    """Inner product of two signals."""
    return np.sum(x * np.conj(y))

def norm(x):
    """L2 norm = sqrt(energy)."""
    return np.sqrt(np.sum(np.abs(x)**2))

def cosine_similarity(x, y):
    """Cosine of angle between signals."""
    return inner_product(x, y).real / (norm(x) * norm(y))
Inner Product Geometry

Two 2D vectors. Drag the angle slider to see how inner product, norm of difference, and cosine change. The inner product equals the projection of one onto the other times its length.

Angle 45°
If ||x - y||2 = ||x||2 + ||y||2 - 2<x,y>, what does it mean when <x,y> = 0?

Chapter 4: Hilbert Spaces & Parseval

We've been computing distances between signals in the time domain. But we also have the DFT. Can we compute distances in the frequency domain instead? Remarkably, the answer is yes — and the distance is IDENTICAL.

Hilbert Space

A Hilbert space is a vector space equipped with an inner product that is "complete" (every Cauchy sequence converges). The space of finite-length discrete signals ℝN with the inner product <x,y> = ∑ x[n]y*[n] is a Hilbert space.

Why does this matter? Hilbert spaces have a powerful property: they support orthonormal bases (like the DFT basis), and the geometry is preserved when you change basis.

Parseval's Theorem

Parseval's theorem states that the DFT preserves the inner product (up to a constant):

n=0N-1 |x[n]|2 = (1/N) ∑k=0N-1 |X[k]|2

More generally, for two signals:

n=0N-1 x[n] y*[n] = (1/N) ∑k=0N-1 X[k] Y*[k]
What Parseval means for classification: The Euclidean distance between two signals is THE SAME whether you compute it in the time domain or the frequency domain: ||x - y||2time = (1/N) ||X - Y||2freq. You can classify signals using their DFT representations and get identical results!

Why This Is Powerful

If distance is the same in both domains, why bother with frequency? Because:

Feature selection: In frequency domain, you can ignore irrelevant bands (e.g., only compare 300–3000 Hz for speech)

Robustness: Time shifts change the time-domain representation but only rotate the phase in frequency (magnitude is shift-invariant)

Dimensionality reduction: Most signal energy is in a few frequency bins — compute distance using only those bins

Proof of Parseval's Theorem

The DFT can be written as a matrix multiply: X = F·x, where F is the N×N DFT matrix with entries F[k,n] = e-j2πkn/N. The key property of F:

FH F = N · I

(F is N times a unitary matrix.) Therefore:

||X||2 = XHX = (Fx)H(Fx) = xHFHFx = xH(NI)x = N||x||2

Dividing both sides by N gives Parseval's theorem. The DFT is an isometry (distance-preserving map) up to the scale factor √N.

Worked Example

Signal: x = [1, -1, 1, -1] (N=4). DFT: X = [0, 0, 4, 0].

Time-domain energy: 1 + 1 + 1 + 1 = 4.

Freq-domain energy: (1/4)(0 + 0 + 16 + 0) = 4. ✓

Signal: y = [1, 1, 1, 1]. DFT: Y = [4, 0, 0, 0].

Time-domain distance: ||x-y||2 = |0|2 + |-2|2 + |0|2 + |-2|2 = 8.

Freq-domain distance: (1/4)(|-4|2 + |0|2 + |4|2 + |0|2) = (1/4)(16 + 16) = 8. ✓

Parseval: Distance Equivalence (Time vs. Frequency)

Two signals compared. Left: time-domain distance. Right: frequency-domain distance. They're always equal (up to 1/N scaling). Change the signals to verify.

Parseval's theorem says that the DFT preserves:

Chapter 5: k-NN & Decision Boundaries

The 1-NN classifier is sensitive to noise — a single mislabeled training point can flip a decision. The k-Nearest Neighbors (k-NN) algorithm fixes this by voting among the k closest training points.

Algorithm

1. Compute distances d(xnew, xi) for all training points

2. Find the k smallest distances

3. Among those k neighbors, take a majority vote of their labels

4. Assign the majority label to xnew

new = argmaxci ∈ Nk(xnew) δ(yi = c)

where Nk(xnew) is the set of k nearest neighbors.

Choosing k

k = 1: Most sensitive to noise. Jagged decision boundary.

k = 5–10: Smoother boundary. More robust.

k = m: Always predicts the majority class (useless).

Rule of thumb: k ≈ √m (square root of training set size). Use odd k for binary classification (avoids ties).

The bias-variance tradeoff: Small k = low bias (flexible boundary) but high variance (noisy). Large k = high bias (smooth boundary) but low variance (stable). Cross-validation picks the best k.

Decision Boundaries

The decision boundary is the set of points where the classifier switches from one class to another. For 1-NN, the boundary is the Voronoi diagram — the set of points equidistant from two training points of different classes.

For k-NN with k > 1, boundaries are smoother (averaging effect of multiple voters) but can be non-convex and disconnected.

Weighted k-NN

A refinement: weight each neighbor's vote by the inverse of its distance. Closer neighbors get more say:

ŷ = argmaxci ∈ Nk wi · δ(yi = c),   where wi = 1/d(x, xi)

This helps when the k neighbors span a large range of distances. Without weighting, a distant neighbor has the same vote as the nearest one. With weighting, the nearest neighbor dominates — a smooth interpolation between 1-NN (k=1) and unweighted k-NN.

Computational Tricks

For large training sets, brute-force k-NN is expensive (O(mN) per query). Practical speedups:

KD-trees: Partition space hierarchically. O(log m) average query time in low dimensions.

Ball trees: Better for high dimensions (d > 20). Partitions into hyperspheres.

Approximate NN: Locality-Sensitive Hashing (LSH) gives O(1) approximate queries.

Dimensionality reduction: Project to lower dimensions first (PCA, DFT, random projection).

In Code

python
import numpy as np

def knn_classify(x_new, X_train, y_train, k=5):
    """k-NN classifier."""
    distances = np.sqrt(((X_train - x_new) ** 2).sum(axis=1))
    nearest_k = np.argsort(distances)[:k]
    labels = y_train[nearest_k]
    # Majority vote
    return np.bincount(labels).argmax()
k-NN Decision Boundaries

Adjust k to see how the decision boundary changes. k=1 is jagged (Voronoi). Higher k smooths the boundary. Background color = predicted class at that point.

k 1
Why should k be odd for binary classification?

Chapter 6: Template Matching & Cross-Correlation

Sometimes you don't want to classify between arbitrary classes — you want to find WHERE a known pattern occurs in a longer signal. This is template matching, and it's powered by the cross-correlation function.

Cross-Correlation

The cross-correlation of signals x[n] and template t[n] (length L) at lag τ:

Rxt[τ] = ∑n=0L-1 x[n + τ] · t[n]

This slides the template t across x and computes the inner product at each position. Where Rxt[τ] is large, the template matches well.

Think of it this way: Cross-correlation asks "how much does x look like t at position τ?" It's the inner product between the template and the corresponding segment of x. Peak in Rxt = best match location.

Normalized Cross-Correlation

Raw cross-correlation is biased by amplitude. A loud section of x will always have high correlation regardless of shape. Fix: normalize by energy:

NCC[τ] = (∑ x[n+τ] · t[n]) / (||xτ|| · ||t||)

where ||xτ|| is the norm of the segment x[τ..τ+L-1]. NCC ∈ [-1, 1], with 1 = perfect match.

Fast Correlation via DFT

Direct computation of Rxt for all lags is O(N·L). But the correlation theorem says:

Rxt = IDFT{X[k] · T*[k]}

Cross-correlation in time = pointwise multiply (with conjugate) in frequency. Using FFT, this is O(N log N) — much faster for long signals.

In Code

python
import numpy as np

def template_match(signal, template):
    """Find template location via FFT-based cross-correlation."""
    N = len(signal)
    L = len(template)
    # Zero-pad to avoid circular correlation
    nfft = N + L - 1
    X = np.fft.fft(signal, n=nfft)
    T = np.fft.fft(template, n=nfft)
    R = np.fft.ifft(X * np.conj(T)).real
    # Peak location = best match position
    best_lag = np.argmax(R[:N - L + 1])
    return best_lag, R

Applications

Keyword spotting: Store templates of "hey Siri" / "OK Google," cross-correlate with audio stream

Radar: Transmit known pulse, cross-correlate received signal to find echo delay (= target distance)

Music synchronization: Find where a clip occurs in a longer recording

Image registration: Find where a template patch matches in a larger image (same math, 2D)

Autocorrelation: Matching a Signal with Itself

A special case: cross-correlate a signal with ITSELF. This is the autocorrelation:

Rxx[τ] = ∑n x[n] · x[n + τ]

The autocorrelation is always maximized at lag τ = 0 (a signal matches itself perfectly at zero delay). For periodic signals, Rxx has peaks at the period. This is the basis of pitch detection in speech processing:

1. Compute autocorrelation of speech frame

2. Find the first peak after τ = 0

3. Period T = position of that peak, frequency f0 = fs/T

Correlation vs. Convolution

Cross-correlation and convolution are closely related:

Correlation: Rxt[τ] = ∑ x[n+τ] · t[n]
Convolution: (x * t)[τ] = ∑ x[n] · t[τ-n]

The difference: convolution flips the template before sliding. Correlation does not. For symmetric templates (like Gaussians), they're identical. In the DFT domain:

• Correlation: IDFT{X · T*} (conjugate of template)

• Convolution: IDFT{X · T} (no conjugate)

Template Matching via Cross-Correlation

A template (orange) is hidden somewhere in the signal (teal). The cross-correlation (bottom) peaks at the match location.

Why is FFT-based cross-correlation faster than direct computation?

Chapter 7: Mastery

Let's consolidate the key ideas from this lecture and see how they connect to modern signal processing and machine learning.

The Big Picture

Raw Signals
x[n] ∈ ℝN — high-dimensional
Feature Extraction
DFT, spectral descriptors, etc. → compact representation
Distance Computation
d(x, y) in time or frequency domain (Parseval: same result)
Classification
k-NN, template matching, or more complex classifiers

Summary Table

ConceptFormulaUse
Euclidean dist.||x-y||2General-purpose signal comparison
Inner product<x,y> = ∑x[n]y*[n]Similarity, projection, correlation
Cosine sim.<x,y>/(||x||·||y||)Shape comparison (amplitude-invariant)
Parseval||x||2 = (1/N)||X||2Compute distance in freq domain
k-NNMajority of k nearestRobust nonparametric classification
Cross-correlationIDFT{X·T*}Template matching, delay estimation

From k-NN to Deep Learning

Modern classifiers (neural networks) can be viewed as learning a representation where k-NN would work perfectly. The network maps signals into a space where same-class signals are close and different-class signals are far apart. The final linear layer is essentially a 1-NN classifier in the learned feature space.

This is the insight behind contrastive learning (SimCLR, CLIP): train a network to minimize distance between same-class pairs and maximize distance between different-class pairs.

The Curse of Dimensionality

k-NN works well in low dimensions but degrades as dimensionality increases. In high dimensions (N > 100), counterintuitive things happen:

• All points become roughly equidistant from each other (distances concentrate around the mean)

• The "nearest" neighbor is barely closer than the "farthest" — the concept of "nearest" loses meaning

• To maintain coverage, you need exponentially more training data as N grows

This is why feature extraction (DFT, spectral descriptors, PCA) is crucial: reduce dimensionality to the 10–50 most informative features before applying k-NN. Parseval's theorem guarantees we lose no information going to the frequency domain, and then we can select the most discriminative frequency bins.

Error Bounds for k-NN

Cover & Hart (1967) proved a remarkable theorem: as m → ∞, the error rate of the 1-NN classifier RNN satisfies:

R* ≤ RNN ≤ 2R* (1 - R*)

where R* is the Bayes-optimal error rate (the best any classifier can achieve). For the k-NN classifier with k → ∞ but k/m → 0, the error rate converges to R* itself. So k-NN is universally consistent — it achieves optimal performance given enough data, regardless of the underlying distribution.

Connections

Lecture 7 (Spectral Descriptors): Features that make distance meaningful for audio

Lecture 8 (STFT): Time-varying features for classification of nonstationary signals

Machine Learning: k-NN is the starting point; SVMs, random forests, and neural nets all refine the idea of "distance in feature space"

Practical Checklist for Signal Classification

When building a distance-based signal classifier:

1. Normalize signals (zero mean, unit variance) to remove amplitude/offset bias

2. Extract features (DFT, spectral descriptors, MFCCs) to reduce dimensionality

3. Choose distance metric (cosine for shape, Euclidean for energy, Manhattan for robustness)

4. Validate k via cross-validation (start with k=5, try k=1,3,5,7,11)

5. Report confusion matrix, not just accuracy (class imbalance can mislead)

Richard Hamming (1950): "The purpose of computing is insight, not numbers." — The Hamming distance (for binary signals) was designed to detect and correct errors in transmitted data. Distance as a tool for reliability.
You have 1000 training signals of length 8000. Using the DFT, you select only 50 frequency bins that contain most of the discriminative information. How does this help k-NN?