EE269 Lecture 9 — Distance-Based Classification

Chapter 0: The Classification Problem

You have 100 recordings of the word "yes" and 100 recordings of the word "no." A new recording arrives. Is it "yes" or "no"?

Each recording is a signal — a vector of, say, 8000 samples (0.5 seconds at 16 kHz). You need an algorithm that looks at the new signal and assigns it to one of two classes. This is binary classification.

The simplest possible approach: compare the new signal to every training signal. Find the one it's most similar to. Assign the same label. This is the Nearest Neighbor classifier — surprisingly powerful and requires zero model training.

The key question: What does "similar" mean for signals? Two recordings of "yes" by different speakers look completely different as waveforms. But they sound the same. We need a distance measure that captures perceptual similarity, not sample-by-sample identity.

Formal Setup

Training set: m labeled signals {(x₁, y₁), (x₂, y₂), ..., (x_m, y_m)} where:

• x_i ∈ ℝ^N is a signal (vector of N samples)

• y_i ∈ {0, 1} is its class label

Test signal: x_new ∈ ℝ^N. Goal: predict y_new.

Why Distance?

Classification algorithms range from simple (nearest neighbor) to complex (deep neural networks). But at their core, most rely on some notion of distance or similarity:

• Nearest neighbor: explicitly uses distance

• SVM: finds the hyperplane maximizing distance to nearest points

• Neural networks: learn a representation where classes are far apart

Understanding distance metrics for signals is foundational. The choice of metric can matter more than the classifier itself.

Two-Class Signal Classification

Class A (teal) and Class B (orange) signals. The test signal (white) gets classified based on which class is "closer." Click to see different representations.

What is the main challenge in classifying audio signals by distance?

Signals are too short to compare Perceptually similar signals (same word, different speaker) may have large sample-by-sample differences Computers cannot store audio signals

Chapter 1: Distance Metrics

A distance metric d(x, y) is a function that measures how "far apart" two vectors are. To be a valid metric, it must satisfy four properties:

1. Non-negativity: d(x, y) ≥ 0

2. Identity: d(x, y) = 0 if and only if x = y

3. Symmetry: d(x, y) = d(y, x)

4. Triangle inequality: d(x, z) ≤ d(x, y) + d(y, z)

Euclidean Distance (L²)

The most natural distance — straight-line distance in N-dimensional space:

d₂(x, y) = √(∑_n=0^N-1 |x[n] - y[n]|²) = ||x - y||₂

For signals, this is the root-mean-square difference. Two signals that differ by loud bursts at a few samples can have large Euclidean distance even if they sound similar overall.

Physical meaning: ||x||₂² = ∑|x[n]|² is the signal energy. So Euclidean distance = square root of the energy of the difference signal. If d₂(x,y) is small, then x-y is a low-energy signal (nearly zero everywhere).

Manhattan Distance (L¹)

d₁(x, y) = ∑_n=0^N-1 |x[n] - y[n]|

Sum of absolute differences. More robust to outliers than Euclidean — a single large difference at one sample doesn't dominate as much (linear vs. quadratic weighting).

Chebyshev Distance (L^∞)

d_∞(x, y) = max_n |x[n] - y[n]|

The maximum absolute difference across all samples. Only cares about the worst-case sample. Useful when you need to guarantee that NO sample deviates by more than a threshold.

General L^p Distance

d_p(x, y) = (∑_n=0^N-1 |x[n] - y[n]|^p)^1/p

Euclidean is p=2, Manhattan is p=1, Chebyshev is p→∞. As p increases, larger differences are penalized more heavily.

Comparison

Metric	Formula	Sensitive To	Use When
L² (Euclidean)	√(∑\|x-y\|²)	Overall energy difference	Signals are aligned, similar amplitude
L¹ (Manhattan)	∑\|x-y\|	Average absolute difference	Robust to outlier samples
L^∞ (Chebyshev)	max\|x-y\|	Worst-case sample	Bounded-error constraints

Distance Metric Comparison

Two signals shown. The "distance" under each metric is visualized differently. L2 emphasizes large deviations; L1 treats all equally; Linf only sees the max.

Signal x = [0, 0, 10, 0, 0] and signal y = [2, 2, 2, 2, 2]. Which metric gives the SMALLEST distance between them?

Euclidean (L2) Manhattan (L1) Chebyshev (L∞) — max difference is 8, while L1=16 and L2≈9.2

Chapter 2: Nearest Neighbor Classifier

The Nearest Neighbor (NN) classifier is the simplest classification algorithm possible. It requires no training phase — just store the training data and compare at test time.

Algorithm

Input

Test signal x_new, training set {(x_i, y_i)}

↓

Compute distances

d_i = d(x_new, x_i) for all i = 1,...,m

↓

Find nearest

i* = argmin_i d_i

↓

Classify

ŷ_new = y_i* (label of closest training signal)

Properties

No training: The "model" IS the training data. No parameters to learn, no optimization. Just store and compare.

Nonparametric: Makes no assumptions about the shape of the decision boundary. Can capture arbitrarily complex boundaries.

Consistent: As training set size m → ∞, the NN error rate converges to at most twice the optimal (Bayes) error rate. This is a remarkable guarantee for such a simple algorithm.

Expensive at test time: Must compute distance to ALL m training signals. For large m, this is O(mN) per test sample. (Compare to a trained model which is O(1) to evaluate.)

In Code

python
import numpy as np

def nearest_neighbor(x_new, X_train, y_train):
    """Classify x_new using 1-NN.
    X_train: (m, N) array of training signals
    y_train: (m,) array of labels {0, 1}
    """
    distances = np.sqrt(((X_train - x_new) ** 2).sum(axis=1))
    nearest_idx = np.argmin(distances)
    return y_train[nearest_idx]

Worked Example

Training set (2D features for simplicity):

• x₁ = [1, 2], y₁ = 0 (class A)

• x₂ = [3, 1], y₂ = 0 (class A)

• x₃ = [6, 5], y₃ = 1 (class B)

• x₄ = [7, 4], y₄ = 1 (class B)

Test point: x_new = [4, 3]

Distances (Euclidean):

d(x_new, x₁) = √((4-1)² + (3-2)²) = √10 ≈ 3.16

d(x_new, x₂) = √((4-3)² + (3-1)²) = √5 ≈ 2.24 ← nearest!

d(x_new, x₃) = √((4-6)² + (3-5)²) = √8 ≈ 2.83

d(x_new, x₄) = √((4-7)² + (3-4)²) = √10 ≈ 3.16

Nearest neighbor is x₂ with label 0 → classify x_new as class A.

Nearest Neighbor Classifier

Click anywhere to place a test point. It gets classified by the label of its nearest training point. Teal = class A, orange = class B.

Click to classify a test point.

What is the main disadvantage of the 1-NN classifier?

It's sensitive to noise — a single mislabeled or outlier training point can cause misclassification It requires a long training phase It can only handle two classes

Chapter 3: Inner Products & Norms

Distance metrics are built from norms, and norms (at least the L² norm) come from inner products. Understanding this hierarchy gives us powerful tools for signal comparison.

The Inner Product

For discrete signals x[n] and y[n] of length N, the inner product is:

<x, y> = ∑_n=0^N-1 x[n] · y*[n]

where y*[n] is the complex conjugate of y[n]. For real signals, this simplifies to the dot product: <x, y> = ∑ x[n] y[n].

What the inner product measures: It quantifies the "overlap" or "similarity" between two signals. If x and y point in the same direction (proportional), <x,y> is large and positive. If they're orthogonal (uncorrelated), <x,y> = 0. If they're opposite, <x,y> is large and negative.

The Norm (from the Inner Product)

The L² norm (Euclidean length) of a signal is derived from the inner product:

||x||₂ = √(<x, x>) = √(∑_n=0^N-1 |x[n]|²)

This equals the square root of the signal's energy. A loud signal has large norm; a quiet signal has small norm.

Distance from Norm

The Euclidean distance is just the norm of the difference:

d(x, y) = ||x - y||₂ = √(<x-y, x-y>)

Expanding this reveals a connection to the inner product:

||x - y||² = ||x||² + ||y||² - 2<x, y>

So distance is small when <x,y> is large relative to the norms — when the signals are well-correlated.

Cosine Similarity

To remove amplitude dependence and compare only "shape," we normalize:

cos(θ) = <x, y> / (||x|| · ||y||)

This equals 1 when signals are identical (up to scaling), 0 when orthogonal, -1 when opposite. It's amplitude-invariant: cos(2x, y) = cos(x, y).

Worked Example: Cosine vs. Euclidean

Signal a = [1, 1, 0, 0] and signal b = [3, 3, 0, 0] (same direction, different loudness).

• Euclidean: ||a-b|| = √(4+4) = 2√2 ≈ 2.83 (says "quite different")

• Cosine: <a,b>/(||a|| · ||b||) = 6/(√2 · 3√2) = 6/6 = 1.0 (says "identical shape")

Signal c = [0, 0, 1, 1] (orthogonal direction, same energy as a).

• Euclidean: ||a-c|| = √(1+1+1+1) = 2 (says "moderately different")

• Cosine: <a,c>/(||a|| · ||c||) = 0/(√2 · √2) = 0 (says "completely different shape")

Lesson: Euclidean distance is sensitive to amplitude. Cosine similarity only compares direction (shape). For audio classification where volume varies, cosine is often better.

Code: Inner Product and Friends

python
import numpy as np

def inner_product(x, y):
    """Inner product of two signals."""
    return np.sum(x * np.conj(y))

def norm(x):
    """L2 norm = sqrt(energy)."""
    return np.sqrt(np.sum(np.abs(x)**2))

def cosine_similarity(x, y):
    """Cosine of angle between signals."""
    return inner_product(x, y).real / (norm(x) * norm(y))

Inner Product Geometry

Two 2D vectors. Drag the angle slider to see how inner product, norm of difference, and cosine change. The inner product equals the projection of one onto the other times its length.

Angle 45°

If ||x - y||² = ||x||² + ||y||² - 2<x,y>, what does it mean when <x,y> = 0?

The distance equals √(||x||² + ||y||²) — the Pythagorean theorem (x and y are orthogonal) The signals are identical The distance is zero

Chapter 4: Hilbert Spaces & Parseval

We've been computing distances between signals in the time domain. But we also have the DFT. Can we compute distances in the frequency domain instead? Remarkably, the answer is yes — and the distance is IDENTICAL.

Hilbert Space

A Hilbert space is a vector space equipped with an inner product that is "complete" (every Cauchy sequence converges). The space of finite-length discrete signals ℝ^N with the inner product <x,y> = ∑ x[n]y*[n] is a Hilbert space.

Why does this matter? Hilbert spaces have a powerful property: they support orthonormal bases (like the DFT basis), and the geometry is preserved when you change basis.

Parseval's Theorem

Parseval's theorem states that the DFT preserves the inner product (up to a constant):

∑_n=0^N-1 |x[n]|² = (1/N) ∑_k=0^N-1 |X[k]|²

More generally, for two signals:

∑_n=0^N-1 x[n] y*[n] = (1/N) ∑_k=0^N-1 X[k] Y*[k]

What Parseval means for classification: The Euclidean distance between two signals is THE SAME whether you compute it in the time domain or the frequency domain: ||x - y||²_time = (1/N) ||X - Y||²_freq. You can classify signals using their DFT representations and get identical results!

Why This Is Powerful

If distance is the same in both domains, why bother with frequency? Because:

• Feature selection: In frequency domain, you can ignore irrelevant bands (e.g., only compare 300–3000 Hz for speech)

• Robustness: Time shifts change the time-domain representation but only rotate the phase in frequency (magnitude is shift-invariant)

• Dimensionality reduction: Most signal energy is in a few frequency bins — compute distance using only those bins

Proof of Parseval's Theorem

The DFT can be written as a matrix multiply: X = F·x, where F is the N×N DFT matrix with entries F[k,n] = e^-j2πkn/N. The key property of F:

F^H F = N · I

(F is N times a unitary matrix.) Therefore:

||X||² = X^HX = (Fx)^H(Fx) = x^HF^HFx = x^H(NI)x = N||x||²

Dividing both sides by N gives Parseval's theorem. The DFT is an isometry (distance-preserving map) up to the scale factor √N.

Worked Example

Signal: x = [1, -1, 1, -1] (N=4). DFT: X = [0, 0, 4, 0].

Time-domain energy: 1 + 1 + 1 + 1 = 4.

Freq-domain energy: (1/4)(0 + 0 + 16 + 0) = 4. ✓

Signal: y = [1, 1, 1, 1]. DFT: Y = [4, 0, 0, 0].

Time-domain distance: ||x-y||² = |0|² + |-2|² + |0|² + |-2|² = 8.

Freq-domain distance: (1/4)(|-4|² + |0|² + |4|² + |0|²) = (1/4)(16 + 16) = 8. ✓

Parseval: Distance Equivalence (Time vs. Frequency)

Two signals compared. Left: time-domain distance. Right: frequency-domain distance. They're always equal (up to 1/N scaling). Change the signals to verify.

Parseval's theorem says that the DFT preserves:

The signal's phase The inner product (and therefore distances) between signals, up to a scale of 1/N The signal's time ordering

Chapter 5: k-NN & Decision Boundaries

The 1-NN classifier is sensitive to noise — a single mislabeled training point can flip a decision. The k-Nearest Neighbors (k-NN) algorithm fixes this by voting among the k closest training points.

Algorithm

1. Compute distances d(x_new, x_i) for all training points

2. Find the k smallest distances

3. Among those k neighbors, take a majority vote of their labels

4. Assign the majority label to x_new

ŷ_new = argmax_c ∑_{i ∈ N_k(x_new)} δ(y_i = c)

where N_k(x_new) is the set of k nearest neighbors.

Choosing k

• k = 1: Most sensitive to noise. Jagged decision boundary.

• k = 5–10: Smoother boundary. More robust.

• k = m: Always predicts the majority class (useless).

• Rule of thumb: k ≈ √m (square root of training set size). Use odd k for binary classification (avoids ties).

The bias-variance tradeoff: Small k = low bias (flexible boundary) but high variance (noisy). Large k = high bias (smooth boundary) but low variance (stable). Cross-validation picks the best k.

Decision Boundaries

The decision boundary is the set of points where the classifier switches from one class to another. For 1-NN, the boundary is the Voronoi diagram — the set of points equidistant from two training points of different classes.

For k-NN with k > 1, boundaries are smoother (averaging effect of multiple voters) but can be non-convex and disconnected.

Weighted k-NN

A refinement: weight each neighbor's vote by the inverse of its distance. Closer neighbors get more say:

ŷ = argmax_c ∑_{i ∈ N_k} w_i · δ(y_i = c), where w_i = 1/d(x, x_i)

This helps when the k neighbors span a large range of distances. Without weighting, a distant neighbor has the same vote as the nearest one. With weighting, the nearest neighbor dominates — a smooth interpolation between 1-NN (k=1) and unweighted k-NN.

Computational Tricks

For large training sets, brute-force k-NN is expensive (O(mN) per query). Practical speedups:

• KD-trees: Partition space hierarchically. O(log m) average query time in low dimensions.

• Ball trees: Better for high dimensions (d > 20). Partitions into hyperspheres.

• Approximate NN: Locality-Sensitive Hashing (LSH) gives O(1) approximate queries.

• Dimensionality reduction: Project to lower dimensions first (PCA, DFT, random projection).

In Code

python
import numpy as np

def knn_classify(x_new, X_train, y_train, k=5):
    """k-NN classifier."""
    distances = np.sqrt(((X_train - x_new) ** 2).sum(axis=1))
    nearest_k = np.argsort(distances)[:k]
    labels = y_train[nearest_k]
    # Majority vote
    return np.bincount(labels).argmax()

k-NN Decision Boundaries

Adjust k to see how the decision boundary changes. k=1 is jagged (Voronoi). Higher k smooths the boundary. Background color = predicted class at that point.

k 1

Why should k be odd for binary classification?

Odd numbers compute faster Odd k guarantees a majority vote winner with no ties (since exactly one class must have >k/2 votes) Odd k is always more accurate

Chapter 6: Template Matching & Cross-Correlation

Sometimes you don't want to classify between arbitrary classes — you want to find WHERE a known pattern occurs in a longer signal. This is template matching, and it's powered by the cross-correlation function.

Cross-Correlation

The cross-correlation of signals x[n] and template t[n] (length L) at lag τ:

R_xt[τ] = ∑_n=0^L-1 x[n + τ] · t[n]

This slides the template t across x and computes the inner product at each position. Where R_xt[τ] is large, the template matches well.

Think of it this way: Cross-correlation asks "how much does x look like t at position τ?" It's the inner product between the template and the corresponding segment of x. Peak in R_xt = best match location.

Normalized Cross-Correlation

Raw cross-correlation is biased by amplitude. A loud section of x will always have high correlation regardless of shape. Fix: normalize by energy:

NCC[τ] = (∑ x[n+τ] · t[n]) / (||x_τ|| · ||t||)

where ||x_τ|| is the norm of the segment x[τ..τ+L-1]. NCC ∈ [-1, 1], with 1 = perfect match.

Fast Correlation via DFT

Direct computation of R_xt for all lags is O(N·L). But the correlation theorem says:

R_xt = IDFT{X[k] · T*[k]}

Cross-correlation in time = pointwise multiply (with conjugate) in frequency. Using FFT, this is O(N log N) — much faster for long signals.

In Code

python
import numpy as np

def template_match(signal, template):
    """Find template location via FFT-based cross-correlation."""
    N = len(signal)
    L = len(template)
    # Zero-pad to avoid circular correlation
    nfft = N + L - 1
    X = np.fft.fft(signal, n=nfft)
    T = np.fft.fft(template, n=nfft)
    R = np.fft.ifft(X * np.conj(T)).real
    # Peak location = best match position
    best_lag = np.argmax(R[:N - L + 1])
    return best_lag, R

Applications

• Keyword spotting: Store templates of "hey Siri" / "OK Google," cross-correlate with audio stream

• Radar: Transmit known pulse, cross-correlate received signal to find echo delay (= target distance)

• Music synchronization: Find where a clip occurs in a longer recording

• Image registration: Find where a template patch matches in a larger image (same math, 2D)

Autocorrelation: Matching a Signal with Itself

A special case: cross-correlate a signal with ITSELF. This is the autocorrelation:

R_xx[τ] = ∑_n x[n] · x[n + τ]

The autocorrelation is always maximized at lag τ = 0 (a signal matches itself perfectly at zero delay). For periodic signals, R_xx has peaks at the period. This is the basis of pitch detection in speech processing:

1. Compute autocorrelation of speech frame

2. Find the first peak after τ = 0

3. Period T = position of that peak, frequency f₀ = f_s/T

Correlation vs. Convolution

Cross-correlation and convolution are closely related:

Correlation: R_xt[τ] = ∑ x[n+τ] · t[n]

Convolution: (x * t)[τ] = ∑ x[n] · t[τ-n]

The difference: convolution flips the template before sliding. Correlation does not. For symmetric templates (like Gaussians), they're identical. In the DFT domain:

• Correlation: IDFT{X · T*} (conjugate of template)

• Convolution: IDFT{X · T} (no conjugate)

Template Matching via Cross-Correlation

A template (orange) is hidden somewhere in the signal (teal). The cross-correlation (bottom) peaks at the match location.

Why is FFT-based cross-correlation faster than direct computation?

The FFT uses less memory Correlation in time becomes pointwise multiplication in frequency (O(N log N) vs O(N·L)) The FFT compresses the signal first

Chapter 7: Mastery

Let's consolidate the key ideas from this lecture and see how they connect to modern signal processing and machine learning.

The Big Picture

Raw Signals

x[n] ∈ ℝ^N — high-dimensional

↓

Feature Extraction

DFT, spectral descriptors, etc. → compact representation

↓

Distance Computation

d(x, y) in time or frequency domain (Parseval: same result)

↓

Classification

k-NN, template matching, or more complex classifiers

Summary Table

Concept	Formula	Use
Euclidean dist.	\|\|x-y\|\|₂	General-purpose signal comparison
Inner product	<x,y> = ∑x[n]y*[n]	Similarity, projection, correlation
Cosine sim.	<x,y>/(\|\|x\|\|·\|\|y\|\|)	Shape comparison (amplitude-invariant)
Parseval	\|\|x\|\|² = (1/N)\|\|X\|\|²	Compute distance in freq domain
k-NN	Majority of k nearest	Robust nonparametric classification
Cross-correlation	IDFT{X·T*}	Template matching, delay estimation

From k-NN to Deep Learning

Modern classifiers (neural networks) can be viewed as learning a representation where k-NN would work perfectly. The network maps signals into a space where same-class signals are close and different-class signals are far apart. The final linear layer is essentially a 1-NN classifier in the learned feature space.

This is the insight behind contrastive learning (SimCLR, CLIP): train a network to minimize distance between same-class pairs and maximize distance between different-class pairs.

The Curse of Dimensionality

k-NN works well in low dimensions but degrades as dimensionality increases. In high dimensions (N > 100), counterintuitive things happen:

• All points become roughly equidistant from each other (distances concentrate around the mean)

• The "nearest" neighbor is barely closer than the "farthest" — the concept of "nearest" loses meaning

• To maintain coverage, you need exponentially more training data as N grows

This is why feature extraction (DFT, spectral descriptors, PCA) is crucial: reduce dimensionality to the 10–50 most informative features before applying k-NN. Parseval's theorem guarantees we lose no information going to the frequency domain, and then we can select the most discriminative frequency bins.

Error Bounds for k-NN

Cover & Hart (1967) proved a remarkable theorem: as m → ∞, the error rate of the 1-NN classifier R_NN satisfies:

R* ≤ R_NN ≤ 2R* (1 - R*)

where R* is the Bayes-optimal error rate (the best any classifier can achieve). For the k-NN classifier with k → ∞ but k/m → 0, the error rate converges to R* itself. So k-NN is universally consistent — it achieves optimal performance given enough data, regardless of the underlying distribution.

Connections

• Lecture 7 (Spectral Descriptors): Features that make distance meaningful for audio

• Lecture 8 (STFT): Time-varying features for classification of nonstationary signals

• Machine Learning: k-NN is the starting point; SVMs, random forests, and neural nets all refine the idea of "distance in feature space"

Practical Checklist for Signal Classification

When building a distance-based signal classifier:

1. Normalize signals (zero mean, unit variance) to remove amplitude/offset bias

2. Extract features (DFT, spectral descriptors, MFCCs) to reduce dimensionality

3. Choose distance metric (cosine for shape, Euclidean for energy, Manhattan for robustness)

4. Validate k via cross-validation (start with k=5, try k=1,3,5,7,11)

5. Report confusion matrix, not just accuracy (class imbalance can mislead)

Richard Hamming (1950): "The purpose of computing is insight, not numbers." — The Hamming distance (for binary signals) was designed to detect and correct errors in transmitted data. Distance as a tool for reliability.

You have 1000 training signals of length 8000. Using the DFT, you select only 50 frequency bins that contain most of the discriminative information. How does this help k-NN?

It reduces the distance computation from O(8000) to O(50) per pair, AND removes noisy dimensions that hurt classification It doesn't help — Parseval says the distance is the same It only speeds up computation without affecting accuracy

Distance-Based Classification

Chapter 0: The Classification Problem

Formal Setup

Why Distance?

Chapter 1: Distance Metrics

Euclidean Distance (L2)

Manhattan Distance (L1)

Chebyshev Distance (L∞)

General Lp Distance

Comparison

Chapter 2: Nearest Neighbor Classifier

Algorithm

Properties

In Code

Worked Example

Chapter 3: Inner Products & Norms

The Inner Product

The Norm (from the Inner Product)

Distance from Norm

Cosine Similarity

Worked Example: Cosine vs. Euclidean

Code: Inner Product and Friends

Chapter 4: Hilbert Spaces & Parseval

Hilbert Space

Parseval's Theorem

Why This Is Powerful

Proof of Parseval's Theorem

Worked Example

Chapter 5: k-NN & Decision Boundaries

Algorithm

Choosing k

Decision Boundaries

Weighted k-NN

Computational Tricks

In Code

Chapter 6: Template Matching & Cross-Correlation

Cross-Correlation

Normalized Cross-Correlation

Fast Correlation via DFT

In Code

Applications

Autocorrelation: Matching a Signal with Itself

Correlation vs. Convolution

Chapter 7: Mastery

The Big Picture

Summary Table

From k-NN to Deep Learning

The Curse of Dimensionality

Error Bounds for k-NN

Connections

Practical Checklist for Signal Classification

Euclidean Distance (L²)

Manhattan Distance (L¹)

Chebyshev Distance (L^∞)

General L^p Distance