Classifying signals by measuring how "close" they are — nearest neighbors, inner products, and Hilbert spaces.
You have 100 recordings of the word "yes" and 100 recordings of the word "no." A new recording arrives. Is it "yes" or "no"?
Each recording is a signal — a vector of, say, 8000 samples (0.5 seconds at 16 kHz). You need an algorithm that looks at the new signal and assigns it to one of two classes. This is binary classification.
The simplest possible approach: compare the new signal to every training signal. Find the one it's most similar to. Assign the same label. This is the Nearest Neighbor classifier — surprisingly powerful and requires zero model training.
Training set: m labeled signals {(x1, y1), (x2, y2), ..., (xm, ym)} where:
• xi ∈ ℝN is a signal (vector of N samples)
• yi ∈ {0, 1} is its class label
Test signal: xnew ∈ ℝN. Goal: predict ynew.
Classification algorithms range from simple (nearest neighbor) to complex (deep neural networks). But at their core, most rely on some notion of distance or similarity:
• Nearest neighbor: explicitly uses distance
• SVM: finds the hyperplane maximizing distance to nearest points
• Neural networks: learn a representation where classes are far apart
Understanding distance metrics for signals is foundational. The choice of metric can matter more than the classifier itself.
Class A (teal) and Class B (orange) signals. The test signal (white) gets classified based on which class is "closer." Click to see different representations.
A distance metric d(x, y) is a function that measures how "far apart" two vectors are. To be a valid metric, it must satisfy four properties:
1. Non-negativity: d(x, y) ≥ 0
2. Identity: d(x, y) = 0 if and only if x = y
3. Symmetry: d(x, y) = d(y, x)
4. Triangle inequality: d(x, z) ≤ d(x, y) + d(y, z)
The most natural distance — straight-line distance in N-dimensional space:
For signals, this is the root-mean-square difference. Two signals that differ by loud bursts at a few samples can have large Euclidean distance even if they sound similar overall.
Sum of absolute differences. More robust to outliers than Euclidean — a single large difference at one sample doesn't dominate as much (linear vs. quadratic weighting).
The maximum absolute difference across all samples. Only cares about the worst-case sample. Useful when you need to guarantee that NO sample deviates by more than a threshold.
Euclidean is p=2, Manhattan is p=1, Chebyshev is p→∞. As p increases, larger differences are penalized more heavily.
| Metric | Formula | Sensitive To | Use When |
|---|---|---|---|
| L2 (Euclidean) | √(∑|x-y|2) | Overall energy difference | Signals are aligned, similar amplitude |
| L1 (Manhattan) | ∑|x-y| | Average absolute difference | Robust to outlier samples |
| L∞ (Chebyshev) | max|x-y| | Worst-case sample | Bounded-error constraints |
Two signals shown. The "distance" under each metric is visualized differently. L2 emphasizes large deviations; L1 treats all equally; Linf only sees the max.
The Nearest Neighbor (NN) classifier is the simplest classification algorithm possible. It requires no training phase — just store the training data and compare at test time.
No training: The "model" IS the training data. No parameters to learn, no optimization. Just store and compare.
Nonparametric: Makes no assumptions about the shape of the decision boundary. Can capture arbitrarily complex boundaries.
Consistent: As training set size m → ∞, the NN error rate converges to at most twice the optimal (Bayes) error rate. This is a remarkable guarantee for such a simple algorithm.
Expensive at test time: Must compute distance to ALL m training signals. For large m, this is O(mN) per test sample. (Compare to a trained model which is O(1) to evaluate.)
python import numpy as np def nearest_neighbor(x_new, X_train, y_train): """Classify x_new using 1-NN. X_train: (m, N) array of training signals y_train: (m,) array of labels {0, 1} """ distances = np.sqrt(((X_train - x_new) ** 2).sum(axis=1)) nearest_idx = np.argmin(distances) return y_train[nearest_idx]
Training set (2D features for simplicity):
• x1 = [1, 2], y1 = 0 (class A)
• x2 = [3, 1], y2 = 0 (class A)
• x3 = [6, 5], y3 = 1 (class B)
• x4 = [7, 4], y4 = 1 (class B)
Test point: xnew = [4, 3]
Distances (Euclidean):
Nearest neighbor is x2 with label 0 → classify xnew as class A.
Click anywhere to place a test point. It gets classified by the label of its nearest training point. Teal = class A, orange = class B.
Distance metrics are built from norms, and norms (at least the L2 norm) come from inner products. Understanding this hierarchy gives us powerful tools for signal comparison.
For discrete signals x[n] and y[n] of length N, the inner product is:
where y*[n] is the complex conjugate of y[n]. For real signals, this simplifies to the dot product: <x, y> = ∑ x[n] y[n].
The L2 norm (Euclidean length) of a signal is derived from the inner product:
This equals the square root of the signal's energy. A loud signal has large norm; a quiet signal has small norm.
The Euclidean distance is just the norm of the difference:
Expanding this reveals a connection to the inner product:
So distance is small when <x,y> is large relative to the norms — when the signals are well-correlated.
To remove amplitude dependence and compare only "shape," we normalize:
This equals 1 when signals are identical (up to scaling), 0 when orthogonal, -1 when opposite. It's amplitude-invariant: cos(2x, y) = cos(x, y).
Signal a = [1, 1, 0, 0] and signal b = [3, 3, 0, 0] (same direction, different loudness).
• Euclidean: ||a-b|| = √(4+4) = 2√2 ≈ 2.83 (says "quite different")
• Cosine: <a,b>/(||a|| · ||b||) = 6/(√2 · 3√2) = 6/6 = 1.0 (says "identical shape")
Signal c = [0, 0, 1, 1] (orthogonal direction, same energy as a).
• Euclidean: ||a-c|| = √(1+1+1+1) = 2 (says "moderately different")
• Cosine: <a,c>/(||a|| · ||c||) = 0/(√2 · √2) = 0 (says "completely different shape")
Lesson: Euclidean distance is sensitive to amplitude. Cosine similarity only compares direction (shape). For audio classification where volume varies, cosine is often better.
python import numpy as np def inner_product(x, y): """Inner product of two signals.""" return np.sum(x * np.conj(y)) def norm(x): """L2 norm = sqrt(energy).""" return np.sqrt(np.sum(np.abs(x)**2)) def cosine_similarity(x, y): """Cosine of angle between signals.""" return inner_product(x, y).real / (norm(x) * norm(y))
Two 2D vectors. Drag the angle slider to see how inner product, norm of difference, and cosine change. The inner product equals the projection of one onto the other times its length.
We've been computing distances between signals in the time domain. But we also have the DFT. Can we compute distances in the frequency domain instead? Remarkably, the answer is yes — and the distance is IDENTICAL.
A Hilbert space is a vector space equipped with an inner product that is "complete" (every Cauchy sequence converges). The space of finite-length discrete signals ℝN with the inner product <x,y> = ∑ x[n]y*[n] is a Hilbert space.
Why does this matter? Hilbert spaces have a powerful property: they support orthonormal bases (like the DFT basis), and the geometry is preserved when you change basis.
Parseval's theorem states that the DFT preserves the inner product (up to a constant):
More generally, for two signals:
If distance is the same in both domains, why bother with frequency? Because:
• Feature selection: In frequency domain, you can ignore irrelevant bands (e.g., only compare 300–3000 Hz for speech)
• Robustness: Time shifts change the time-domain representation but only rotate the phase in frequency (magnitude is shift-invariant)
• Dimensionality reduction: Most signal energy is in a few frequency bins — compute distance using only those bins
The DFT can be written as a matrix multiply: X = F·x, where F is the N×N DFT matrix with entries F[k,n] = e-j2πkn/N. The key property of F:
(F is N times a unitary matrix.) Therefore:
Dividing both sides by N gives Parseval's theorem. The DFT is an isometry (distance-preserving map) up to the scale factor √N.
Signal: x = [1, -1, 1, -1] (N=4). DFT: X = [0, 0, 4, 0].
Time-domain energy: 1 + 1 + 1 + 1 = 4.
Freq-domain energy: (1/4)(0 + 0 + 16 + 0) = 4. ✓
Signal: y = [1, 1, 1, 1]. DFT: Y = [4, 0, 0, 0].
Time-domain distance: ||x-y||2 = |0|2 + |-2|2 + |0|2 + |-2|2 = 8.
Freq-domain distance: (1/4)(|-4|2 + |0|2 + |4|2 + |0|2) = (1/4)(16 + 16) = 8. ✓
Two signals compared. Left: time-domain distance. Right: frequency-domain distance. They're always equal (up to 1/N scaling). Change the signals to verify.
The 1-NN classifier is sensitive to noise — a single mislabeled training point can flip a decision. The k-Nearest Neighbors (k-NN) algorithm fixes this by voting among the k closest training points.
1. Compute distances d(xnew, xi) for all training points
2. Find the k smallest distances
3. Among those k neighbors, take a majority vote of their labels
4. Assign the majority label to xnew
where Nk(xnew) is the set of k nearest neighbors.
• k = 1: Most sensitive to noise. Jagged decision boundary.
• k = 5–10: Smoother boundary. More robust.
• k = m: Always predicts the majority class (useless).
• Rule of thumb: k ≈ √m (square root of training set size). Use odd k for binary classification (avoids ties).
The decision boundary is the set of points where the classifier switches from one class to another. For 1-NN, the boundary is the Voronoi diagram — the set of points equidistant from two training points of different classes.
For k-NN with k > 1, boundaries are smoother (averaging effect of multiple voters) but can be non-convex and disconnected.
A refinement: weight each neighbor's vote by the inverse of its distance. Closer neighbors get more say:
This helps when the k neighbors span a large range of distances. Without weighting, a distant neighbor has the same vote as the nearest one. With weighting, the nearest neighbor dominates — a smooth interpolation between 1-NN (k=1) and unweighted k-NN.
For large training sets, brute-force k-NN is expensive (O(mN) per query). Practical speedups:
• KD-trees: Partition space hierarchically. O(log m) average query time in low dimensions.
• Ball trees: Better for high dimensions (d > 20). Partitions into hyperspheres.
• Approximate NN: Locality-Sensitive Hashing (LSH) gives O(1) approximate queries.
• Dimensionality reduction: Project to lower dimensions first (PCA, DFT, random projection).
python import numpy as np def knn_classify(x_new, X_train, y_train, k=5): """k-NN classifier.""" distances = np.sqrt(((X_train - x_new) ** 2).sum(axis=1)) nearest_k = np.argsort(distances)[:k] labels = y_train[nearest_k] # Majority vote return np.bincount(labels).argmax()
Adjust k to see how the decision boundary changes. k=1 is jagged (Voronoi). Higher k smooths the boundary. Background color = predicted class at that point.
Sometimes you don't want to classify between arbitrary classes — you want to find WHERE a known pattern occurs in a longer signal. This is template matching, and it's powered by the cross-correlation function.
The cross-correlation of signals x[n] and template t[n] (length L) at lag τ:
This slides the template t across x and computes the inner product at each position. Where Rxt[τ] is large, the template matches well.
Raw cross-correlation is biased by amplitude. A loud section of x will always have high correlation regardless of shape. Fix: normalize by energy:
where ||xτ|| is the norm of the segment x[τ..τ+L-1]. NCC ∈ [-1, 1], with 1 = perfect match.
Direct computation of Rxt for all lags is O(N·L). But the correlation theorem says:
Cross-correlation in time = pointwise multiply (with conjugate) in frequency. Using FFT, this is O(N log N) — much faster for long signals.
python import numpy as np def template_match(signal, template): """Find template location via FFT-based cross-correlation.""" N = len(signal) L = len(template) # Zero-pad to avoid circular correlation nfft = N + L - 1 X = np.fft.fft(signal, n=nfft) T = np.fft.fft(template, n=nfft) R = np.fft.ifft(X * np.conj(T)).real # Peak location = best match position best_lag = np.argmax(R[:N - L + 1]) return best_lag, R
• Keyword spotting: Store templates of "hey Siri" / "OK Google," cross-correlate with audio stream
• Radar: Transmit known pulse, cross-correlate received signal to find echo delay (= target distance)
• Music synchronization: Find where a clip occurs in a longer recording
• Image registration: Find where a template patch matches in a larger image (same math, 2D)
A special case: cross-correlate a signal with ITSELF. This is the autocorrelation:
The autocorrelation is always maximized at lag τ = 0 (a signal matches itself perfectly at zero delay). For periodic signals, Rxx has peaks at the period. This is the basis of pitch detection in speech processing:
1. Compute autocorrelation of speech frame
2. Find the first peak after τ = 0
3. Period T = position of that peak, frequency f0 = fs/T
Cross-correlation and convolution are closely related:
The difference: convolution flips the template before sliding. Correlation does not. For symmetric templates (like Gaussians), they're identical. In the DFT domain:
• Correlation: IDFT{X · T*} (conjugate of template)
• Convolution: IDFT{X · T} (no conjugate)
A template (orange) is hidden somewhere in the signal (teal). The cross-correlation (bottom) peaks at the match location.
Let's consolidate the key ideas from this lecture and see how they connect to modern signal processing and machine learning.
| Concept | Formula | Use |
|---|---|---|
| Euclidean dist. | ||x-y||2 | General-purpose signal comparison |
| Inner product | <x,y> = ∑x[n]y*[n] | Similarity, projection, correlation |
| Cosine sim. | <x,y>/(||x||·||y||) | Shape comparison (amplitude-invariant) |
| Parseval | ||x||2 = (1/N)||X||2 | Compute distance in freq domain |
| k-NN | Majority of k nearest | Robust nonparametric classification |
| Cross-correlation | IDFT{X·T*} | Template matching, delay estimation |
Modern classifiers (neural networks) can be viewed as learning a representation where k-NN would work perfectly. The network maps signals into a space where same-class signals are close and different-class signals are far apart. The final linear layer is essentially a 1-NN classifier in the learned feature space.
This is the insight behind contrastive learning (SimCLR, CLIP): train a network to minimize distance between same-class pairs and maximize distance between different-class pairs.
k-NN works well in low dimensions but degrades as dimensionality increases. In high dimensions (N > 100), counterintuitive things happen:
• All points become roughly equidistant from each other (distances concentrate around the mean)
• The "nearest" neighbor is barely closer than the "farthest" — the concept of "nearest" loses meaning
• To maintain coverage, you need exponentially more training data as N grows
This is why feature extraction (DFT, spectral descriptors, PCA) is crucial: reduce dimensionality to the 10–50 most informative features before applying k-NN. Parseval's theorem guarantees we lose no information going to the frequency domain, and then we can select the most discriminative frequency bins.
Cover & Hart (1967) proved a remarkable theorem: as m → ∞, the error rate of the 1-NN classifier RNN satisfies:
where R* is the Bayes-optimal error rate (the best any classifier can achieve). For the k-NN classifier with k → ∞ but k/m → 0, the error rate converges to R* itself. So k-NN is universally consistent — it achieves optimal performance given enough data, regardless of the underlying distribution.
• Lecture 7 (Spectral Descriptors): Features that make distance meaningful for audio
• Lecture 8 (STFT): Time-varying features for classification of nonstationary signals
• Machine Learning: k-NN is the starting point; SVMs, random forests, and neural nets all refine the idea of "distance in feature space"
When building a distance-based signal classifier:
1. Normalize signals (zero mean, unit variance) to remove amplitude/offset bias
2. Extract features (DFT, spectral descriptors, MFCCs) to reduce dimensionality
3. Choose distance metric (cosine for shape, Euclidean for energy, Manhattan for robustness)
4. Validate k via cross-validation (start with k=5, try k=1,3,5,7,11)
5. Report confusion matrix, not just accuracy (class imbalance can mislead)