Stanford EE269 — Mert Pilanci

Signal Processing for Machine Learning

Where Fourier meets gradient descent — the mathematical bridge between classical DSP and modern deep learning.

Prerequisites: Basic linear algebra + Python familiarity. That's it.
8
Chapters
5+
Simulations
0
Assumed Knowledge

Chapter 0: Why Signal Processing for ML?

You record someone saying "hey Siri." Your microphone captures 16,000 numbers per second — a raw waveform. You need a machine to recognize the word. How?

You could feed all 16,000 numbers directly into a neural network. But that's absurd. The network would need to discover, from scratch, that sound is made of frequencies, that human speech occupies 300–3400 Hz, that vowels have formant patterns. It would need millions of examples just to learn what a physics student knows from day one.

This is where signal processing enters. Instead of raw samples, you transform the audio into a spectrogram — a 2D image of frequency vs. time. Suddenly your ML model can see patterns that were invisible in the raw waveform. A CNN on a spectrogram outperforms an RNN on raw audio, with 10x less data.

The core idea: Signal processing gives ML models the right representation. Raw data is like reading a book character by character. DSP is like reading it word by word. The right representation makes learning trivial.
Raw Waveform vs. Spectrogram

Top: a raw audio signal (hard to interpret). Bottom: its spectrogram (patterns jump out). Click "Generate" to see different signals.

Notice how in the spectrogram, speech shows clear horizontal bands (formants), music shows harmonic ladders, but noise is featureless static. The spectrogram makes structure visible — and what's visible to us is learnable by machines.

Why do we transform raw audio before feeding it to an ML model?

Chapter 1: What Is a Signal?

A signal is any quantity that varies over time (or space). Your voice is a signal. An image is a 2D signal. A stock price is a signal. An ECG is a signal. Formally: a function x(t) that maps time (or position) to amplitude.

In the digital world, we don't have continuous functions. We have samples — discrete measurements taken at regular intervals. Record audio at 16 kHz and you get 16,000 numbers per second. Each number is the air pressure at that instant.

Continuous vs. Discrete

Continuous signal: x(t) defined for all real t. Exists in nature (sound waves, voltage).

Discrete signal: x[n] defined only at integer indices n = 0, 1, 2, ... What computers actually store.

x[n] = x(n · Ts),   where Ts = 1/fs is the sampling period

The sampling rate fs determines how many samples per second we take. Too low and we lose information (aliasing). Too high and we waste storage. The Nyquist theorem tells us: you need at least 2× the highest frequency to avoid aliasing.

Key insight: A signal has two lives — its time-domain self (the wiggly line you see on an oscilloscope) and its frequency-domain self (which sinusoids compose it). Signal processing is largely about switching between these two views.
Sampling a Sine Wave

The smooth curve is the true signal. The dots are samples. Reduce the sample rate to see aliasing.

Signal Freq (Hz)3.0
Sample Rate (Hz)16

When the sample rate drops below 2× the signal frequency, the samples trace out a different frequency — a ghost that doesn't exist in the original. This is aliasing, and it's irreversible. Once you've sampled too slowly, the information is gone forever.

A signal has its highest frequency component at 4 kHz. What's the minimum sampling rate to avoid aliasing?

Chapter 2: Spectral Analysis — Seeing Frequencies

Every signal, no matter how complex, is a sum of sinusoids at different frequencies, amplitudes, and phases. This is Fourier's insight from 1807, and it remains the most important idea in all of signal processing.

The Discrete Fourier Transform (DFT) takes N time-domain samples and produces N frequency-domain coefficients. Each coefficient tells you how much energy the signal has at that frequency.

X[k] = ∑n=0N-1 x[n] · e-j2πkn/N,   k = 0, 1, ..., N-1

Don't let the complex exponential scare you. It's just asking: "how much does x[n] correlate with a sinusoid of frequency k/N?" High correlation = large |X[k]| = lots of energy at that frequency.

The STFT: Frequency Over Time

The DFT gives you frequencies but throws away time. You know which frequencies exist but not when. For audio and speech, timing matters. The Short-Time Fourier Transform (STFT) solves this by chopping the signal into short overlapping windows and taking the DFT of each window.

Signal x[n]
Full time-domain waveform
↓ window of length L
Windowed frame
x[n] · w[n-m] for frame m
↓ DFT of each frame
Spectrogram
|X[k,m]|² — power at freq k, time m
Think of it this way: The DFT is like asking "what notes are in this song?" The STFT is like asking "what notes are being played right now?" It trades frequency resolution for time resolution — shorter windows give better time precision but fuzzier frequency information. This is the Heisenberg uncertainty principle of signal processing.
Interactive DFT

Mix up to 3 sinusoids and watch the DFT reveal them. Each spike in the magnitude spectrum corresponds to a frequency you added.

Freq 13
Freq 27
Freq 30
What does the STFT give you that the plain DFT does not?

Chapter 3: From Spectrogram to Features

We've turned a waveform into a spectrogram. But a spectrogram is still a big matrix — hundreds of frequency bins times hundreds of time frames. We need to compress it into a small feature vector that captures the essential information for classification.

Mel Scale: Hearing Like a Human

Humans don't perceive frequency linearly. The difference between 100 Hz and 200 Hz sounds huge. The difference between 8000 Hz and 8100 Hz is barely noticeable. The mel scale warps frequency to match human perception:

mel(f) = 2595 · log10(1 + f/700)

A mel spectrogram applies triangular filter banks spaced evenly on the mel scale. This compresses 512 frequency bins down to 40-128 mel bands — keeping the perceptually important information and discarding what humans can't hear anyway.

MFCCs: The Classic Feature

Mel-Frequency Cepstral Coefficients (MFCCs) go one step further. Take the mel spectrogram, apply a log, then take the DCT (Discrete Cosine Transform). The result: 13 numbers per time frame that compactly describe the spectral envelope — the "shape" of the spectrum that encodes what phoneme is being spoken.

Waveform
x[n] — 16000 samples/sec
↓ STFT (window=25ms, hop=10ms)
Spectrogram
|X[k,m]|² — 257×T
↓ Mel filterbank (40 bands)
Mel Spectrogram
40×T
↓ log + DCT (keep first 13)
MFCCs
13×T — compact features
Why this pipeline works: Each step removes irrelevant variation. STFT removes phase (irrelevant for speech). Mel scale removes inaudible detail. Log compresses dynamic range. DCT decorrelates the features. What remains is the minimal description of "what sound is this?"
Feature Extraction Pipeline

Watch how each stage transforms the signal. Click "Step" to advance through the pipeline.

Stage: Waveform
Why do we use the mel scale instead of linear frequency spacing?

Chapter 4: Applications — Where DSP Meets ML

Signal processing + machine learning is not a niche — it's everywhere. Let's tour the major application domains where this marriage produces state-of-the-art results.

Speech Recognition

Raw audio → mel spectrogram → neural network → text. Modern systems (Whisper, wav2vec) learn features end-to-end, but the initial windowing and spectral decomposition is still signal processing. Even "end-to-end" models use a learned filterbank that mimics the STFT.

Audio Classification

Is this sound a dog bark, a siren, or a gunshot? Environmental sound classification uses mel spectrograms as images and applies CNNs. The spectrogram is the image. A 2D convolution on a spectrogram is equivalent to a time-frequency filter — classical DSP in neural clothing.

Image Processing

Images are 2D signals. The 2D DFT reveals spatial frequencies — edges are high-frequency, smooth regions are low-frequency. Convolution in spatial domain = multiplication in frequency domain. Every CNN filter is learning a spatial frequency pattern.

Time Series & Finance

Stock prices, sensor readings, weather data. Spectral analysis reveals periodicity (daily, weekly, seasonal cycles). Wavelet transforms capture multi-scale patterns. These features feed into LSTMs, Transformers, or gradient-boosted trees.

Radar & Remote Sensing

Radar sends a pulse, receives an echo. DSP extracts range (time delay), velocity (Doppler shift), and angle (beamforming). ML then classifies the target: car, pedestrian, cyclist. Self-driving cars fuse radar DSP features with lidar and camera.

DomainSignal TypeDSP TransformML Model
Speech1D audioSTFT / MelTransformer / CTC
Music1D audioCQT / ChromaCNN / RNN
Images2D spatial2D-DFT / GaborCNN / ViT
Radar1D RFMatched filter / CFARCNN / PointNet
EEG/ECG1D bioWavelet / BandpassLSTM / 1D-CNN
Finance1D time seriesWavelet / ARTransformer / GBT
The unifying pattern: In every domain, DSP provides a representation that makes the structure explicit. ML then learns the decision boundary in that structured space. DSP = representation. ML = decision.
Domain Comparison

Click a domain to see its signal, transform, and feature space side by side.

What role does DSP play in a modern ML pipeline?

Chapter 5: Course Roadmap — What We'll Learn

EE269 covers the mathematical foundations that connect signal processing to machine learning. Here's the intellectual arc of the course — each topic builds on the previous.

Block 1: Quantization & Representation

How do you represent continuous signals with finite bits? Quantization is the act of rounding a real number to the nearest allowed value. It introduces error — but how much? Optimal quantizers (Lloyd-Max) minimize distortion. This connects directly to vector quantization in VQ-VAEs and neural audio codecs.

Block 2: Spectral Analysis & Transforms

DFT, FFT, STFT, wavelets. These are the lenses through which we view signals. The FFT computes the DFT in O(N log N) instead of O(N²) — making real-time audio processing possible. Wavelets give multi-resolution analysis: coarse features at large scales, fine details at small scales.

Block 3: Statistical Learning & Classification

Given features extracted by DSP, how do we classify? Linear classifiers, SVMs, kernel methods. The kernel trick: map data to a high-dimensional space where it becomes linearly separable. Gaussian processes as infinite-width neural networks.

Block 4: Deep Learning Foundations

Neural networks as compositions of linear transforms + nonlinearities. Backpropagation as the chain rule. CNNs as learned filterbanks. Autoencoders as nonlinear PCA. GANs as implicit density estimation.

Block 5: Optimization & Convergtic

Gradient descent, stochastic variants, momentum, Adam. Convex vs. non-convex optimization. Why does SGD find good minima in non-convex landscapes? Implicit regularization. The loss landscape of neural networks.

Quantization
Continuous → Discrete, optimal codebooks
Spectral Analysis
DFT, FFT, STFT, wavelets
Classification
SVM, kernels, GPs
Deep Learning
CNNs, autoencoders, NMF
Optimization
SGD, convexity, convergence
The thread connecting everything: Representation + Optimization. Every problem in this course asks: (1) What's the right way to represent the data? (2) How do we find the best parameters? These two questions unify DSP, classical ML, and deep learning.
What two meta-questions unify all topics in EE269?

Chapter 6: Showcase — From Waveform to Classification

Let's see the entire pipeline in action. We'll take a raw signal, transform it step by step, and watch a simple classifier make a decision. This is the full EE269 story in miniature.

This interactive demo simulates the complete audio classification pipeline: raw signal → STFT → mel spectrogram → CNN features → classification. Each stage is visualized so you can see exactly what happens to the data.
Full Pipeline: Signal → Classification

Choose a signal class, adjust noise, and watch the pipeline process it in real time. The classifier's confidence bars show how separable each class is after DSP features.

Noise Level0.3
Window Size128
Classification: —

What's Happening at Each Stage

StageInput ShapeOutput ShapeOperation
Raw signal[512]Generate + add noise
STFT[512][65 × T]Window, DFT per frame
Mel filter[65 × T][16 × T]Triangular filterbank
Features[16 × T][4]Mean + variance pooling
Classifier[4][4 classes]Softmax linear layer

Notice what happens as you increase noise: the raw waveform becomes unrecognizable, but the mel spectrogram still shows the dominant frequency structure. The DSP representation is robust — it preserves the signal's identity even when the raw data looks hopeless.

The payoff: Without DSP, a classifier on raw samples fails at noise level > 0.5. With mel features, it stays accurate past noise level 1.5. That's a 3× improvement in robustness, for free, just by choosing the right representation.

Chapter 7: Beyond — Connections

This introduction sets the stage. Every subsequent lecture in EE269 deepens one piece of the pipeline we've seen. Here's how the remaining lessons connect:

TopicWhat It DeepensKey Question
QuantizationHow to discretize optimallyMinimum bits for target distortion?
FFT / STFTSpectral decompositionHow to compute DFT fast?
WaveletsMulti-resolution analysisHow to capture both time and frequency?
SVMs & KernelsClassification in feature spaceWhen is data linearly separable?
AutoencodersLearned compressionWhat's the optimal low-dim representation?
NMFParts-based decompositionCan we factor a spectrogram into sources?
Dictionary LearningSparse representationCan we represent signals with few atoms?
OptimizationTraining algorithmsWhy does SGD work in non-convex landscapes?
The big picture: EE269 teaches you to see ML through the lens of signal processing. Every neural network is a sequence of linear transforms (like DFTs) and pointwise nonlinearities (like rectification). Every loss function is a signal distortion measure. Every training algorithm is an iterative filter converging to a fixed point.

The next lesson dives into quantization — the first fundamental question: how do you represent a continuous world with discrete numbers, and how little information can you get away with?

What analogy best captures the relationship between DSP and deep learning?