Where Fourier meets gradient descent — the mathematical bridge between classical DSP and modern deep learning.
You record someone saying "hey Siri." Your microphone captures 16,000 numbers per second — a raw waveform. You need a machine to recognize the word. How?
You could feed all 16,000 numbers directly into a neural network. But that's absurd. The network would need to discover, from scratch, that sound is made of frequencies, that human speech occupies 300–3400 Hz, that vowels have formant patterns. It would need millions of examples just to learn what a physics student knows from day one.
This is where signal processing enters. Instead of raw samples, you transform the audio into a spectrogram — a 2D image of frequency vs. time. Suddenly your ML model can see patterns that were invisible in the raw waveform. A CNN on a spectrogram outperforms an RNN on raw audio, with 10x less data.
Top: a raw audio signal (hard to interpret). Bottom: its spectrogram (patterns jump out). Click "Generate" to see different signals.
Notice how in the spectrogram, speech shows clear horizontal bands (formants), music shows harmonic ladders, but noise is featureless static. The spectrogram makes structure visible — and what's visible to us is learnable by machines.
A signal is any quantity that varies over time (or space). Your voice is a signal. An image is a 2D signal. A stock price is a signal. An ECG is a signal. Formally: a function x(t) that maps time (or position) to amplitude.
In the digital world, we don't have continuous functions. We have samples — discrete measurements taken at regular intervals. Record audio at 16 kHz and you get 16,000 numbers per second. Each number is the air pressure at that instant.
Continuous signal: x(t) defined for all real t. Exists in nature (sound waves, voltage).
Discrete signal: x[n] defined only at integer indices n = 0, 1, 2, ... What computers actually store.
The sampling rate fs determines how many samples per second we take. Too low and we lose information (aliasing). Too high and we waste storage. The Nyquist theorem tells us: you need at least 2× the highest frequency to avoid aliasing.
The smooth curve is the true signal. The dots are samples. Reduce the sample rate to see aliasing.
When the sample rate drops below 2× the signal frequency, the samples trace out a different frequency — a ghost that doesn't exist in the original. This is aliasing, and it's irreversible. Once you've sampled too slowly, the information is gone forever.
Every signal, no matter how complex, is a sum of sinusoids at different frequencies, amplitudes, and phases. This is Fourier's insight from 1807, and it remains the most important idea in all of signal processing.
The Discrete Fourier Transform (DFT) takes N time-domain samples and produces N frequency-domain coefficients. Each coefficient tells you how much energy the signal has at that frequency.
Don't let the complex exponential scare you. It's just asking: "how much does x[n] correlate with a sinusoid of frequency k/N?" High correlation = large |X[k]| = lots of energy at that frequency.
The DFT gives you frequencies but throws away time. You know which frequencies exist but not when. For audio and speech, timing matters. The Short-Time Fourier Transform (STFT) solves this by chopping the signal into short overlapping windows and taking the DFT of each window.
Mix up to 3 sinusoids and watch the DFT reveal them. Each spike in the magnitude spectrum corresponds to a frequency you added.
We've turned a waveform into a spectrogram. But a spectrogram is still a big matrix — hundreds of frequency bins times hundreds of time frames. We need to compress it into a small feature vector that captures the essential information for classification.
Humans don't perceive frequency linearly. The difference between 100 Hz and 200 Hz sounds huge. The difference between 8000 Hz and 8100 Hz is barely noticeable. The mel scale warps frequency to match human perception:
A mel spectrogram applies triangular filter banks spaced evenly on the mel scale. This compresses 512 frequency bins down to 40-128 mel bands — keeping the perceptually important information and discarding what humans can't hear anyway.
Mel-Frequency Cepstral Coefficients (MFCCs) go one step further. Take the mel spectrogram, apply a log, then take the DCT (Discrete Cosine Transform). The result: 13 numbers per time frame that compactly describe the spectral envelope — the "shape" of the spectrum that encodes what phoneme is being spoken.
Watch how each stage transforms the signal. Click "Step" to advance through the pipeline.
Signal processing + machine learning is not a niche — it's everywhere. Let's tour the major application domains where this marriage produces state-of-the-art results.
Raw audio → mel spectrogram → neural network → text. Modern systems (Whisper, wav2vec) learn features end-to-end, but the initial windowing and spectral decomposition is still signal processing. Even "end-to-end" models use a learned filterbank that mimics the STFT.
Is this sound a dog bark, a siren, or a gunshot? Environmental sound classification uses mel spectrograms as images and applies CNNs. The spectrogram is the image. A 2D convolution on a spectrogram is equivalent to a time-frequency filter — classical DSP in neural clothing.
Images are 2D signals. The 2D DFT reveals spatial frequencies — edges are high-frequency, smooth regions are low-frequency. Convolution in spatial domain = multiplication in frequency domain. Every CNN filter is learning a spatial frequency pattern.
Stock prices, sensor readings, weather data. Spectral analysis reveals periodicity (daily, weekly, seasonal cycles). Wavelet transforms capture multi-scale patterns. These features feed into LSTMs, Transformers, or gradient-boosted trees.
Radar sends a pulse, receives an echo. DSP extracts range (time delay), velocity (Doppler shift), and angle (beamforming). ML then classifies the target: car, pedestrian, cyclist. Self-driving cars fuse radar DSP features with lidar and camera.
| Domain | Signal Type | DSP Transform | ML Model |
|---|---|---|---|
| Speech | 1D audio | STFT / Mel | Transformer / CTC |
| Music | 1D audio | CQT / Chroma | CNN / RNN |
| Images | 2D spatial | 2D-DFT / Gabor | CNN / ViT |
| Radar | 1D RF | Matched filter / CFAR | CNN / PointNet |
| EEG/ECG | 1D bio | Wavelet / Bandpass | LSTM / 1D-CNN |
| Finance | 1D time series | Wavelet / AR | Transformer / GBT |
Click a domain to see its signal, transform, and feature space side by side.
EE269 covers the mathematical foundations that connect signal processing to machine learning. Here's the intellectual arc of the course — each topic builds on the previous.
How do you represent continuous signals with finite bits? Quantization is the act of rounding a real number to the nearest allowed value. It introduces error — but how much? Optimal quantizers (Lloyd-Max) minimize distortion. This connects directly to vector quantization in VQ-VAEs and neural audio codecs.
DFT, FFT, STFT, wavelets. These are the lenses through which we view signals. The FFT computes the DFT in O(N log N) instead of O(N²) — making real-time audio processing possible. Wavelets give multi-resolution analysis: coarse features at large scales, fine details at small scales.
Given features extracted by DSP, how do we classify? Linear classifiers, SVMs, kernel methods. The kernel trick: map data to a high-dimensional space where it becomes linearly separable. Gaussian processes as infinite-width neural networks.
Neural networks as compositions of linear transforms + nonlinearities. Backpropagation as the chain rule. CNNs as learned filterbanks. Autoencoders as nonlinear PCA. GANs as implicit density estimation.
Gradient descent, stochastic variants, momentum, Adam. Convex vs. non-convex optimization. Why does SGD find good minima in non-convex landscapes? Implicit regularization. The loss landscape of neural networks.
Let's see the entire pipeline in action. We'll take a raw signal, transform it step by step, and watch a simple classifier make a decision. This is the full EE269 story in miniature.
Choose a signal class, adjust noise, and watch the pipeline process it in real time. The classifier's confidence bars show how separable each class is after DSP features.
| Stage | Input Shape | Output Shape | Operation |
|---|---|---|---|
| Raw signal | — | [512] | Generate + add noise |
| STFT | [512] | [65 × T] | Window, DFT per frame |
| Mel filter | [65 × T] | [16 × T] | Triangular filterbank |
| Features | [16 × T] | [4] | Mean + variance pooling |
| Classifier | [4] | [4 classes] | Softmax linear layer |
Notice what happens as you increase noise: the raw waveform becomes unrecognizable, but the mel spectrogram still shows the dominant frequency structure. The DSP representation is robust — it preserves the signal's identity even when the raw data looks hopeless.
This introduction sets the stage. Every subsequent lecture in EE269 deepens one piece of the pipeline we've seen. Here's how the remaining lessons connect:
| Topic | What It Deepens | Key Question |
|---|---|---|
| Quantization | How to discretize optimally | Minimum bits for target distortion? |
| FFT / STFT | Spectral decomposition | How to compute DFT fast? |
| Wavelets | Multi-resolution analysis | How to capture both time and frequency? |
| SVMs & Kernels | Classification in feature space | When is data linearly separable? |
| Autoencoders | Learned compression | What's the optimal low-dim representation? |
| NMF | Parts-based decomposition | Can we factor a spectrogram into sources? |
| Dictionary Learning | Sparse representation | Can we represent signals with few atoms? |
| Optimization | Training algorithms | Why does SGD work in non-convex landscapes? |
The next lesson dives into quantization — the first fundamental question: how do you represent a continuous world with discrete numbers, and how little information can you get away with?