EE269 Lecture 24 — Mert Pilanci, Stanford

Deep Learning & CNNs

Stacking layers for hierarchical features, sharing weights for translation equivariance, and conquering spectrograms with convolutions.

Prerequisites: Single-layer neural networks + Backpropagation (chain rule). That's it.
8
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: Why Depth?

A single hidden layer can approximate any function — the universal approximation theorem guarantees it. So why bother with multiple layers?

Because the theorem says nothing about efficiency. Consider a function that detects whether a face is present in an image. A single-layer network would need to learn every possible face pattern directly from pixels. That's an astronomically large number of hidden units.

A deep network, by contrast, builds features hierarchically:

Layer 1: Edges
Detects oriented edges at each location
Layer 2: Parts
Combines edges into eyes, noses, mouths
Layer 3: Objects
Combines parts into faces

Each layer reuses the features from the layer below. An edge detector at layer 1 can be shared across many part detectors at layer 2. This compositionality means deep networks can represent complex functions with exponentially fewer parameters than shallow ones.

Depth = exponential efficiency. Some function classes require O(2n) nodes with 1 hidden layer but only O(n) nodes with O(log n) layers. Parity (XOR over n bits) is the classic example. Shallow is universal but impractical; deep is both universal and efficient for hierarchically structured problems — which is most of the real world.

The catch? Deep networks are harder to train. Gradients must flow through many layers via the chain rule, and they can vanish (multiply by numbers < 1 repeatedly) or explode (multiply by numbers > 1 repeatedly). This lecture addresses both why depth works and how to make it trainable.

Shallow vs. Deep: Learned Representations

A 1-layer (shallow) and a 3-layer (deep) network learn to classify the same XOR-like pattern. The deep network uses far fewer total parameters.

Click Train
The main advantage of depth (multiple layers) over width (many hidden units in one layer) is:

Chapter 1: Multi-Layer Networks

A deep network with L layers computes a sequence of transformations. Layer l takes the output of layer l−1 and produces a new representation:

z(l) = σ(W(l) z(l−1) + b(l))    for l = 1, 2, ..., L

where z(0) = x (the input), W(l) is the weight matrix of layer l, b(l) is the bias vector, and σ is the activation function (typically ReLU for hidden layers).

Data Flow: Shapes and Dimensions

If the input has dimension p, and layers have widths d1, d2, ..., dL:

LayerInput dimOutput dimW(l) shapeParameters
1pd1d1 × pd1(p + 1)
2d1d2d2 × d1d2(d1 + 1)
ldl−1dldl × dl−1dl(dl−1 + 1)
OutputdLKK × dLK(dL + 1)

Backprop Through Multiple Layers

The chain rule extends naturally. Define δ(l) = ∂L/∂a(l) where a(l) = W(l)z(l−1) + b(l) (pre-activation). Then:

δ(l) = (W(l+1))T δ(l+1) ⊙ σ'(a(l))

where ⊙ is element-wise multiplication. The gradient for each weight matrix:

∂L/∂W(l) = δ(l) (z(l−1))T
The vanishing gradient problem. Each backward step multiplies by σ'(a(l)). For sigmoid, |σ'| ≤ 0.25. After L layers: gradient scales as 0.25L. For L = 8: gradient is 0.258 ≈ 0.00002 of the original. Early layers barely learn. ReLU fixes this since σ'(a) = 1 for active neurons, but introduces dead neurons. ResNets (Chapter 6) solve this definitively.

Worked Example: 3-Layer Forward Pass

Input: x = [1, 2]T. Three layers with widths 3, 2, 1. ReLU activation.

Layer 1: W(1) is 3×2, b(1) is 3×1.

a(1) = W(1)x + b(1) = [0.5 0.3; -0.2 0.7; 0.4 -0.1][1; 2] + [0.1; -0.1; 0.2] = [1.2; 1.1; 0.4]
z(1) = ReLU([1.2, 1.1, 0.4]) = [1.2, 1.1, 0.4]

Layer 2: W(2) is 2×3.

a(2) = W(2)z(1) + b(2) = [0.3 -0.5 0.8; 0.6 0.2 -0.4][1.2; 1.1; 0.4] + [0; 0] = [0.11; 0.78]
z(2) = ReLU([0.11, 0.78]) = [0.11, 0.78]

Output: W(3) is 1×2.

ŷ = [0.4 0.6] · [0.11; 0.78] + 0.1 = 0.044 + 0.468 + 0.1 = 0.612
Deep Network Layer-by-Layer Visualization

Adjust the depth. Watch how intermediate representations transform the 2D input into increasingly abstract features.

Depth (layers) 3
In a deep network with L layers using sigmoid activations, the gradient at layer 1 is approximately scaled by:

Chapter 2: Convolutional Layers

A fully connected layer connecting a 224×224 image (150,528 pixels for RGB) to just 1,000 hidden units needs 150 million parameters. That's absurd. Most of those connections are unnecessary because useful image features are local (an edge only depends on nearby pixels) and translation-equivariant (an edge detector should work everywhere in the image).

The Convolution Operation

A convolutional layer applies a small filter (kernel) K of size k × k to every spatial position in the input:

(x * K)[i, j] = ∑m=0k−1n=0k−1 K[m, n] · x[i+m, j+n]

The same filter is applied at every position. This is weight sharing — one filter has only k2 parameters regardless of the input size.

Parameter Savings: The Key Insight

Compare for a 100×100 input producing a 100×100 feature map:

Layer TypeParametersCount
Fully connected10,000 × 10,000100,000,000
Conv (3×3 filter)3 × 3 = 99
Conv (5×5 filter)5 × 5 = 2525

That's a factor of 10 million reduction! In practice we use Cout filters (one per output channel), each looking at Cin input channels, so the parameter count is Cout × Cin × k × k. Still vastly smaller than fully connected.

Translation Equivariance

If you shift the input by (dx, dy), the output shifts by exactly (dx, dy). Formally: Tshift(x * K) = Tshift(x) * K. This means the network doesn't need separate detectors for "cat in top-left" and "cat in bottom-right" — one filter handles both.

Why CNNs for signals. Audio signals share the same property: a phoneme sounds the same whether it starts at t = 0.5s or t = 2.3s. A 1D convolution (filter sliding along time) is translation equivariant for time-domain signals. For spectrograms (2D: time × frequency), 2D convolution captures patterns that repeat at different times and frequency bands.

1D Convolution for Signals

For a 1D signal x[n] and filter h of length k:

(x * h)[n] = ∑m=0k−1 h[m] · x[n + m]

This is exactly the FIR filter from earlier lectures! A CNN layer for audio is a bank of learnable FIR filters, trained end-to-end.

2D Convolution Sliding Window

A 3×3 filter slides over the input. Watch the dot products produce the output feature map. Change the filter to see different edge detectors.

Filter type Horizontal edge
A single 5×5 convolutional filter applied to a 1000×1000 input has how many learnable parameters?

Chapter 3: Pooling & Architecture

Convolution preserves spatial resolution (with padding). But for classification, we eventually need a fixed-size representation regardless of input size. Pooling progressively reduces spatial dimensions.

Max Pooling

Divide the feature map into non-overlapping 2×2 blocks. Take the maximum in each block:

pool(x)[i, j] = max(x[2i, 2j], x[2i+1, 2j], x[2i, 2j+1], x[2i+1, 2j+1])

This halves the spatial dimensions in each direction (100×100 → 50×50). It also provides a small amount of translation invariance: shifting the input by 1 pixel often doesn't change which element is the max.

Average Pooling

Same structure, but takes the mean instead of max. Smoother but loses the "which was strongest" information that max pooling preserves.

The Standard CNN Architecture

Conv + ReLU (C1 filters)
Detect local features
Pool (2×2)
Reduce spatial size by 2×
Conv + ReLU (C2 filters)
Combine into higher-level features
Pool (2×2)
Reduce again
Flatten + FC + Softmax
Classify based on all features

A common pattern: as spatial dimensions decrease, the number of channels increases. This maintains roughly constant computational cost per layer.

Worked Example: LeNet-5 Dimensions

Input: 32×32×1 (grayscale digit). Architecture:

LayerOperationOutput shapeParameters
1Conv 5×5, 6 filters28×28×66(1×25 + 1) = 156
2Pool 2×214×14×60
3Conv 5×5, 16 filters10×10×1616(6×25 + 1) = 2,416
4Pool 2×25×5×160
5FC 400→12012048,120
6FC 120→10101,210
Total~52K

Compare: a fully connected network from 32×32 to 120 hidden units would need 32×32×120 = 122,880 parameters in the first layer alone — more than double the entire CNN.

CNN Dimension Reduction

Watch how spatial dimensions shrink and channel count grows through a CNN. Each rectangle represents a feature map.

After applying 2×2 max pooling to a 64×64×32 feature map, the output shape is:

Chapter 4: CNNs on Spectrograms

A spectrogram is a 2D image: the x-axis is time, the y-axis is frequency, and pixel intensity is energy. This is the representation we built in Lecture 8 (STFT). The crucial insight: once you have a spectrogram, audio classification becomes image classification, and CNNs excel at images.

Why CNNs Are Perfect for Spectrograms

Consider classifying speech commands ("yes", "no", "stop", "go"). Each word has a distinctive spectro-temporal pattern — specific frequency bands active at specific relative times. A 2D CNN filter detects exactly these kinds of localized time-frequency patterns:

Filter orientationWhat it detectsExample
Horizontal (time)Sustained frequency bandVowel formant
Vertical (frequency)Broadband transientPlosive consonant (p, t, k)
Diagonal (rising)Rising pitch/formantQuestion intonation
Diagonal (falling)Falling pitchStatement ending

Pipeline: Raw Audio to Classification

Raw audio x[n]
16kHz, 1 second = 16,000 samples
↓ STFT
Spectrogram S(t, f)
e.g., 64 mel bins × 100 time frames = 64×100 image
↓ CNN
Conv1: 32 filters (3×3)
Detect local time-frequency patterns
↓ Pool + Conv2 + Pool
Feature vector
Flatten: compact representation
↓ FC + Softmax
Class probabilities
P(yes), P(no), P(stop), ...
Translation equivariance for audio. The word "yes" has the same spectral pattern whether it starts at t=0.1s or t=0.5s. A CNN automatically handles this — the same filters slide across time. Frequency-axis translation equivariance handles pitch variation (same word spoken higher or lower). This is why CNNs dominate audio classification.

What Each Layer "Sees"

Layer 1 filters (3×3): Small time-frequency patches. Horizontal edges = onsets/offsets of frequency bands. Vertical edges = spectral transitions.

Layer 2 filters (receptive field 7×7): Combinations of layer-1 features. A rising formant transition. A voiced-to-unvoiced boundary.

Layer 3 filters (receptive field 15×15): Entire phonemes or phoneme sequences. The specific pattern that makes "sh" different from "s".

CNN on a Spectrogram

A synthetic spectrogram is processed by conv filters. Click through filter types to see what each detects. The highlighted regions show where the filter fires strongest.

Filter Horizontal edge
A CNN applied to a spectrogram with 2D filters achieves translation equivariance in:

Chapter 5: BatchNorm & Dropout

Deep networks are powerful but fragile. Two techniques make them dramatically easier to train.

Batch Normalization

Problem: as training progresses, the distribution of each layer's inputs shifts (internal covariate shift). Layer 5 has to constantly readjust to the changing outputs of layer 4.

Solution: normalize each layer's pre-activations to zero mean and unit variance, computed over the current mini-batch:

i = (zi − μB) / √(σB2 + ε)

where μB = (1/B)∑zi and σB2 = (1/B)∑(zi − μB)2 are the batch statistics, and ε is a small constant for numerical stability.

Then apply learnable scale and shift:

yi = γ ẑi + β

where γ and β are learned parameters. This lets the network "undo" the normalization if needed, but starts from a normalized baseline.

Why BatchNorm works. (1) Keeps activations in the regime where gradients flow well (near 0, where ReLU/sigmoid derivatives are largest). (2) Acts as regularization since batch statistics are noisy. (3) Allows much higher learning rates without divergence. (4) At test time, use running averages (not batch stats).

Dropout

Problem: networks with millions of parameters can memorize training data perfectly (overfitting). We need regularization.

Solution: during training, randomly set each hidden unit to 0 with probability p (typically p = 0.5):

m = zm · rm / (1 − p)    where rm ~ Bernoulli(1 − p)

The division by (1 − p) ensures the expected output magnitude is unchanged (inverted dropout).

At test time, use all units (no dropping). The network has learned to be robust — no single unit can memorize a pattern alone, because it might be dropped. This forces distributed representations.

Dropout as ensemble. Each training step uses a different random subnetwork. A network with M hidden units has 2M possible subnetworks. Dropout approximately averages predictions over all of them — an exponentially large ensemble, trained for the cost of one network.
Dropout Visualization

Each frame shows a different dropout mask. Dropped units (gray) don't contribute. The network must learn redundant representations.

Drop rate p 0.5
During test time (inference), dropout:

Chapter 6: Residual Connections

By 2014, people could train networks with ~20 layers. Going deeper (50, 100, 150 layers) caused degradation: training error actually increased with more layers, even without overfitting. This shouldn't happen — a deeper network could at least learn the identity function for extra layers.

The problem: it's hard to learn the identity mapping f(x) = x through a stack of convolutions and ReLUs. The optimization landscape makes it easier to learn small perturbations than exact pass-through.

The Residual Block

Instead of learning f(x) directly, learn the residual F(x) = f(x) − x. The output becomes:

y = F(x) + x = (Conv → BN → ReLU → Conv → BN)(x) + x

If the optimal transformation is close to identity, F(x) is close to zero — and learning small values near zero is much easier than learning an identity mapping through nonlinear layers.

Skip connections fix gradient flow. During backprop, the gradient of the loss with respect to early layers passes through the skip connection unchanged:

∂L/∂x = (∂L/∂y) · (∂y/∂x) = (∂L/∂y) · (1 + ∂F/∂x)

The "1" term means gradients flow directly through the skip — they can't vanish regardless of what ∂F/∂x is. This is why ResNets can train 152+ layers.

Why It Works: A River Analogy

Think of the data flow as a river. Each residual block is a tributary: it can add something useful to the main current, but the river (skip connection) keeps flowing regardless. Without skip connections, the river would have to pass through a series of dams (nonlinear layers), losing momentum at each one. With skip connections, the river is unobstructed, and each block just enriches it.

ResNet Architecture

ModelLayersParametersTop-5 Error (ImageNet)
VGG-1616138M7.3%
ResNet-181811M10.9%
ResNet-505025M6.7%
ResNet-15215260M5.7%

ResNet-152 is 10× deeper than VGG-16 but uses fewer parameters (60M vs 138M) and achieves lower error. Depth + residual connections + fewer parameters per layer = a powerful combination.

Residual Block Visualization

Data flows through the main path (transformations) and the skip connection simultaneously. The output is their sum. Toggle the skip to see how gradient flow changes.

Skip: ON
The key insight of residual connections is that learning F(x) = f(x) − x is easier than learning f(x) directly because:

Chapter 7: Mastery

We've gone from single-layer networks to deep architectures that dominate modern signal processing. Let's consolidate the pieces.

ConceptWhat It SolvesKey Formula
DepthExponential efficiency for compositional functionsz(l) = σ(W(l)z(l−1))
ConvolutionParameter sharing + translation equivariance(x * K)[n] = ∑ K[m]x[n+m]
PoolingSpatial reduction + local invariancemax over 2×2 blocks
BatchNormStable training + regularizationẑ = (z − μ)/σ
DropoutOverfitting prevention via ensemblez̃ = z · Bernoulli(1−p)
ResidualGradient flow in deep networksy = F(x) + x

The Modern Recipe

A production CNN for audio classification in 2024: input mel-spectrogram → [Conv + BN + ReLU + Pool] × 4 → Global Average Pool → FC + Softmax. With residual connections if depth > 20. Trained with Adam optimizer, cosine learning rate schedule, dropout = 0.3 before FC layers.

What Comes Next

CNNs assume translation equivariance is the right inductive bias. But what about sequences where context matters (the meaning of a word depends on surrounding words)? The next lecture introduces attention — a mechanism that dynamically focuses on relevant parts of the input, no matter how far away they are.

Related lessons.
Lecture 23: Neural Networks — single hidden layer foundations
Lecture 8: STFT — how spectrograms are computed
Lecture 25: Attention & Transformers — beyond convolution

"Anything that a human can do with less than one second of thought, we can probably now or soon automate with deep learning." — Andrew Ng