EE269 Lecture 24 — Deep Learning & CNNs

Chapter 0: Why Depth?

A single hidden layer can approximate any function — the universal approximation theorem guarantees it. So why bother with multiple layers?

Because the theorem says nothing about efficiency. Consider a function that detects whether a face is present in an image. A single-layer network would need to learn every possible face pattern directly from pixels. That's an astronomically large number of hidden units.

A deep network, by contrast, builds features hierarchically:

Layer 1: Edges

Detects oriented edges at each location

↓

Layer 2: Parts

Combines edges into eyes, noses, mouths

↓

Layer 3: Objects

Combines parts into faces

Each layer reuses the features from the layer below. An edge detector at layer 1 can be shared across many part detectors at layer 2. This compositionality means deep networks can represent complex functions with exponentially fewer parameters than shallow ones.

Depth = exponential efficiency. Some function classes require O(2ⁿ) nodes with 1 hidden layer but only O(n) nodes with O(log n) layers. Parity (XOR over n bits) is the classic example. Shallow is universal but impractical; deep is both universal and efficient for hierarchically structured problems — which is most of the real world.

The catch? Deep networks are harder to train. Gradients must flow through many layers via the chain rule, and they can vanish (multiply by numbers < 1 repeatedly) or explode (multiply by numbers > 1 repeatedly). This lecture addresses both why depth works and how to make it trainable.

Shallow vs. Deep: Learned Representations

A 1-layer (shallow) and a 3-layer (deep) network learn to classify the same XOR-like pattern. The deep network uses far fewer total parameters.

Click Train

The main advantage of depth (multiple layers) over width (many hidden units in one layer) is:

Deeper networks always have more parameters Hierarchical feature reuse allows exponentially more efficient representations of compositional functions Deep networks don't need backpropagation

Chapter 1: Multi-Layer Networks

A deep network with L layers computes a sequence of transformations. Layer l takes the output of layer l−1 and produces a new representation:

z^(l) = σ(W^(l) z^(l−1) + b^(l)) for l = 1, 2, ..., L

where z⁽⁰⁾ = x (the input), W^(l) is the weight matrix of layer l, b^(l) is the bias vector, and σ is the activation function (typically ReLU for hidden layers).

Data Flow: Shapes and Dimensions

If the input has dimension p, and layers have widths d₁, d₂, ..., d_L:

Layer	Input dim	Output dim	W^(l) shape	Parameters
1	p	d₁	d₁ × p	d₁(p + 1)
2	d₁	d₂	d₂ × d₁	d₂(d₁ + 1)
l	d_l−1	d_l	d_l × d_l−1	d_l(d_l−1 + 1)
Output	d_L	K	K × d_L	K(d_L + 1)

Backprop Through Multiple Layers

The chain rule extends naturally. Define δ^(l) = ∂L/∂a^(l) where a^(l) = W^(l)z^(l−1) + b^(l) (pre-activation). Then:

δ^(l) = (W^(l+1))^T δ^(l+1) ⊙ σ'(a^(l))

where ⊙ is element-wise multiplication. The gradient for each weight matrix:

∂L/∂W^(l) = δ^(l) (z^(l−1))^T

The vanishing gradient problem. Each backward step multiplies by σ'(a^(l)). For sigmoid, |σ'| ≤ 0.25. After L layers: gradient scales as 0.25^L. For L = 8: gradient is 0.25⁸ ≈ 0.00002 of the original. Early layers barely learn. ReLU fixes this since σ'(a) = 1 for active neurons, but introduces dead neurons. ResNets (Chapter 6) solve this definitively.

Worked Example: 3-Layer Forward Pass

Input: x = [1, 2]^T. Three layers with widths 3, 2, 1. ReLU activation.

Layer 1: W⁽¹⁾ is 3×2, b⁽¹⁾ is 3×1.

a⁽¹⁾ = W⁽¹⁾x + b⁽¹⁾ = [0.5 0.3; -0.2 0.7; 0.4 -0.1][1; 2] + [0.1; -0.1; 0.2] = [1.2; 1.1; 0.4]

z⁽¹⁾ = ReLU([1.2, 1.1, 0.4]) = [1.2, 1.1, 0.4]

Layer 2: W⁽²⁾ is 2×3.

a⁽²⁾ = W⁽²⁾z⁽¹⁾ + b⁽²⁾ = [0.3 -0.5 0.8; 0.6 0.2 -0.4][1.2; 1.1; 0.4] + [0; 0] = [0.11; 0.78]

z⁽²⁾ = ReLU([0.11, 0.78]) = [0.11, 0.78]

Output: W⁽³⁾ is 1×2.

ŷ = [0.4 0.6] · [0.11; 0.78] + 0.1 = 0.044 + 0.468 + 0.1 = 0.612

Deep Network Layer-by-Layer Visualization

Adjust the depth. Watch how intermediate representations transform the 2D input into increasingly abstract features.

Depth (layers) 3

In a deep network with L layers using sigmoid activations, the gradient at layer 1 is approximately scaled by:

0.25^L, because each sigmoid derivative is at most 0.25 — this is the vanishing gradient problem L × 0.25, growing linearly with depth 1, because gradients are normalized at each layer

Chapter 2: Convolutional Layers

A fully connected layer connecting a 224×224 image (150,528 pixels for RGB) to just 1,000 hidden units needs 150 million parameters. That's absurd. Most of those connections are unnecessary because useful image features are local (an edge only depends on nearby pixels) and translation-equivariant (an edge detector should work everywhere in the image).

The Convolution Operation

A convolutional layer applies a small filter (kernel) K of size k × k to every spatial position in the input:

(x * K)[i, j] = ∑_m=0^k−1 ∑_n=0^k−1 K[m, n] · x[i+m, j+n]

The same filter is applied at every position. This is weight sharing — one filter has only k² parameters regardless of the input size.

Parameter Savings: The Key Insight

Compare for a 100×100 input producing a 100×100 feature map:

Layer Type	Parameters	Count
Fully connected	10,000 × 10,000	100,000,000
Conv (3×3 filter)	3 × 3 = 9	9
Conv (5×5 filter)	5 × 5 = 25	25

That's a factor of 10 million reduction! In practice we use C_out filters (one per output channel), each looking at C_in input channels, so the parameter count is C_out × C_in × k × k. Still vastly smaller than fully connected.

Translation Equivariance

If you shift the input by (dx, dy), the output shifts by exactly (dx, dy). Formally: T_shift(x * K) = T_shift(x) * K. This means the network doesn't need separate detectors for "cat in top-left" and "cat in bottom-right" — one filter handles both.

Why CNNs for signals. Audio signals share the same property: a phoneme sounds the same whether it starts at t = 0.5s or t = 2.3s. A 1D convolution (filter sliding along time) is translation equivariant for time-domain signals. For spectrograms (2D: time × frequency), 2D convolution captures patterns that repeat at different times and frequency bands.

1D Convolution for Signals

For a 1D signal x[n] and filter h of length k:

(x * h)[n] = ∑_m=0^k−1 h[m] · x[n + m]

This is exactly the FIR filter from earlier lectures! A CNN layer for audio is a bank of learnable FIR filters, trained end-to-end.

2D Convolution Sliding Window

A 3×3 filter slides over the input. Watch the dot products produce the output feature map. Change the filter to see different edge detectors.

Filter type Horizontal edge

A single 5×5 convolutional filter applied to a 1000×1000 input has how many learnable parameters?

1,000,000 25 (just the 5×5 filter weights, shared across all positions) 5,000

Chapter 3: Pooling & Architecture

Convolution preserves spatial resolution (with padding). But for classification, we eventually need a fixed-size representation regardless of input size. Pooling progressively reduces spatial dimensions.

Max Pooling

Divide the feature map into non-overlapping 2×2 blocks. Take the maximum in each block:

pool(x)[i, j] = max(x[2i, 2j], x[2i+1, 2j], x[2i, 2j+1], x[2i+1, 2j+1])

This halves the spatial dimensions in each direction (100×100 → 50×50). It also provides a small amount of translation invariance: shifting the input by 1 pixel often doesn't change which element is the max.

Average Pooling

Same structure, but takes the mean instead of max. Smoother but loses the "which was strongest" information that max pooling preserves.

The Standard CNN Architecture

Conv + ReLU (C₁ filters)

Detect local features

↓

Pool (2×2)

Reduce spatial size by 2×

↓

Conv + ReLU (C₂ filters)

Combine into higher-level features

↓

Pool (2×2)

Reduce again

↓

Flatten + FC + Softmax

Classify based on all features

A common pattern: as spatial dimensions decrease, the number of channels increases. This maintains roughly constant computational cost per layer.

Worked Example: LeNet-5 Dimensions

Input: 32×32×1 (grayscale digit). Architecture:

Layer	Operation	Output shape	Parameters
1	Conv 5×5, 6 filters	28×28×6	6(1×25 + 1) = 156
2	Pool 2×2	14×14×6	0
3	Conv 5×5, 16 filters	10×10×16	16(6×25 + 1) = 2,416
4	Pool 2×2	5×5×16	0
5	FC 400→120	120	48,120
6	FC 120→10	10	1,210
	Total		~52K

Compare: a fully connected network from 32×32 to 120 hidden units would need 32×32×120 = 122,880 parameters in the first layer alone — more than double the entire CNN.

CNN Dimension Reduction

Watch how spatial dimensions shrink and channel count grows through a CNN. Each rectangle represents a feature map.

After applying 2×2 max pooling to a 64×64×32 feature map, the output shape is:

32×32×32 (spatial halved, channels unchanged) 32×32×16 (everything halved) 64×64×16 (only channels reduced)

Chapter 4: CNNs on Spectrograms

A spectrogram is a 2D image: the x-axis is time, the y-axis is frequency, and pixel intensity is energy. This is the representation we built in Lecture 8 (STFT). The crucial insight: once you have a spectrogram, audio classification becomes image classification, and CNNs excel at images.

Why CNNs Are Perfect for Spectrograms

Consider classifying speech commands ("yes", "no", "stop", "go"). Each word has a distinctive spectro-temporal pattern — specific frequency bands active at specific relative times. A 2D CNN filter detects exactly these kinds of localized time-frequency patterns:

Filter orientation	What it detects	Example
Horizontal (time)	Sustained frequency band	Vowel formant
Vertical (frequency)	Broadband transient	Plosive consonant (p, t, k)
Diagonal (rising)	Rising pitch/formant	Question intonation
Diagonal (falling)	Falling pitch	Statement ending

Pipeline: Raw Audio to Classification

Raw audio x[n]

16kHz, 1 second = 16,000 samples

↓ STFT

Spectrogram S(t, f)

e.g., 64 mel bins × 100 time frames = 64×100 image

↓ CNN

Conv1: 32 filters (3×3)

Detect local time-frequency patterns

↓ Pool + Conv2 + Pool

Feature vector

Flatten: compact representation

↓ FC + Softmax

Class probabilities

P(yes), P(no), P(stop), ...

Translation equivariance for audio. The word "yes" has the same spectral pattern whether it starts at t=0.1s or t=0.5s. A CNN automatically handles this — the same filters slide across time. Frequency-axis translation equivariance handles pitch variation (same word spoken higher or lower). This is why CNNs dominate audio classification.

What Each Layer "Sees"

Layer 1 filters (3×3): Small time-frequency patches. Horizontal edges = onsets/offsets of frequency bands. Vertical edges = spectral transitions.

Layer 2 filters (receptive field 7×7): Combinations of layer-1 features. A rising formant transition. A voiced-to-unvoiced boundary.

Layer 3 filters (receptive field 15×15): Entire phonemes or phoneme sequences. The specific pattern that makes "sh" different from "s".

CNN on a Spectrogram

A synthetic spectrogram is processed by conv filters. Click through filter types to see what each detects. The highlighted regions show where the filter fires strongest.

Filter Horizontal edge

A CNN applied to a spectrogram with 2D filters achieves translation equivariance in:

Both time and frequency — the same filter detects a pattern regardless of when or at what pitch it occurs Only time (horizontal axis) Only frequency (vertical axis)

Chapter 5: BatchNorm & Dropout

Deep networks are powerful but fragile. Two techniques make them dramatically easier to train.

Batch Normalization

Problem: as training progresses, the distribution of each layer's inputs shifts (internal covariate shift). Layer 5 has to constantly readjust to the changing outputs of layer 4.

Solution: normalize each layer's pre-activations to zero mean and unit variance, computed over the current mini-batch:

ẑ_i = (z_i − μ_B) / √(σ_B² + ε)

where μ_B = (1/B)∑z_i and σ_B² = (1/B)∑(z_i − μ_B)² are the batch statistics, and ε is a small constant for numerical stability.

Then apply learnable scale and shift:

y_i = γ ẑ_i + β

where γ and β are learned parameters. This lets the network "undo" the normalization if needed, but starts from a normalized baseline.

Why BatchNorm works. (1) Keeps activations in the regime where gradients flow well (near 0, where ReLU/sigmoid derivatives are largest). (2) Acts as regularization since batch statistics are noisy. (3) Allows much higher learning rates without divergence. (4) At test time, use running averages (not batch stats).

Dropout

Problem: networks with millions of parameters can memorize training data perfectly (overfitting). We need regularization.

Solution: during training, randomly set each hidden unit to 0 with probability p (typically p = 0.5):

z̃_m = z_m · r_m / (1 − p) where r_m ~ Bernoulli(1 − p)

The division by (1 − p) ensures the expected output magnitude is unchanged (inverted dropout).

At test time, use all units (no dropping). The network has learned to be robust — no single unit can memorize a pattern alone, because it might be dropped. This forces distributed representations.

Dropout as ensemble. Each training step uses a different random subnetwork. A network with M hidden units has 2^M possible subnetworks. Dropout approximately averages predictions over all of them — an exponentially large ensemble, trained for the cost of one network.

Dropout Visualization

Each frame shows a different dropout mask. Dropped units (gray) don't contribute. The network must learn redundant representations.

Drop rate p 0.5

During test time (inference), dropout:

Continues to randomly drop units Is turned off — all units are active (with inverted dropout, no scaling needed at test time) Drops 50% more units than during training

Chapter 6: Residual Connections

By 2014, people could train networks with ~20 layers. Going deeper (50, 100, 150 layers) caused degradation: training error actually increased with more layers, even without overfitting. This shouldn't happen — a deeper network could at least learn the identity function for extra layers.

The problem: it's hard to learn the identity mapping f(x) = x through a stack of convolutions and ReLUs. The optimization landscape makes it easier to learn small perturbations than exact pass-through.

The Residual Block

Instead of learning f(x) directly, learn the residual F(x) = f(x) − x. The output becomes:

y = F(x) + x = (Conv → BN → ReLU → Conv → BN)(x) + x

If the optimal transformation is close to identity, F(x) is close to zero — and learning small values near zero is much easier than learning an identity mapping through nonlinear layers.

Skip connections fix gradient flow. During backprop, the gradient of the loss with respect to early layers passes through the skip connection unchanged:

∂L/∂x = (∂L/∂y) · (∂y/∂x) = (∂L/∂y) · (1 + ∂F/∂x)

The "1" term means gradients flow directly through the skip — they can't vanish regardless of what ∂F/∂x is. This is why ResNets can train 152+ layers.

Why It Works: A River Analogy

Think of the data flow as a river. Each residual block is a tributary: it can add something useful to the main current, but the river (skip connection) keeps flowing regardless. Without skip connections, the river would have to pass through a series of dams (nonlinear layers), losing momentum at each one. With skip connections, the river is unobstructed, and each block just enriches it.

ResNet Architecture

Model	Layers	Parameters	Top-5 Error (ImageNet)
VGG-16	16	138M	7.3%
ResNet-18	18	11M	10.9%
ResNet-50	50	25M	6.7%
ResNet-152	152	60M	5.7%

ResNet-152 is 10× deeper than VGG-16 but uses fewer parameters (60M vs 138M) and achieves lower error. Depth + residual connections + fewer parameters per layer = a powerful combination.

Residual Block Visualization

Data flows through the main path (transformations) and the skip connection simultaneously. The output is their sum. Toggle the skip to see how gradient flow changes.

Skip: ON

The key insight of residual connections is that learning F(x) = f(x) − x is easier than learning f(x) directly because:

When the optimal f(x) ≈ x (identity), F(x) ≈ 0, and learning near-zero weights is easier than learning the identity through nonlinear layers Residual networks have fewer parameters Subtraction makes the activation functions unnecessary

Chapter 7: Mastery

We've gone from single-layer networks to deep architectures that dominate modern signal processing. Let's consolidate the pieces.

Concept	What It Solves	Key Formula
Depth	Exponential efficiency for compositional functions	z^(l) = σ(W^(l)z^(l−1))
Convolution	Parameter sharing + translation equivariance	(x * K)[n] = ∑ K[m]x[n+m]
Pooling	Spatial reduction + local invariance	max over 2×2 blocks
BatchNorm	Stable training + regularization	ẑ = (z − μ)/σ
Dropout	Overfitting prevention via ensemble	z̃ = z · Bernoulli(1−p)
Residual	Gradient flow in deep networks	y = F(x) + x

The Modern Recipe

A production CNN for audio classification in 2024: input mel-spectrogram → [Conv + BN + ReLU + Pool] × 4 → Global Average Pool → FC + Softmax. With residual connections if depth > 20. Trained with Adam optimizer, cosine learning rate schedule, dropout = 0.3 before FC layers.

What Comes Next

CNNs assume translation equivariance is the right inductive bias. But what about sequences where context matters (the meaning of a word depends on surrounding words)? The next lecture introduces attention — a mechanism that dynamically focuses on relevant parts of the input, no matter how far away they are.

Related lessons.
• Lecture 23: Neural Networks — single hidden layer foundations
• Lecture 8: STFT — how spectrograms are computed
• Lecture 25: Attention & Transformers — beyond convolution

"Anything that a human can do with less than one second of thought, we can probably now or soon automate with deep learning." — Andrew Ng