Stacking layers for hierarchical features, sharing weights for translation equivariance, and conquering spectrograms with convolutions.
A single hidden layer can approximate any function — the universal approximation theorem guarantees it. So why bother with multiple layers?
Because the theorem says nothing about efficiency. Consider a function that detects whether a face is present in an image. A single-layer network would need to learn every possible face pattern directly from pixels. That's an astronomically large number of hidden units.
A deep network, by contrast, builds features hierarchically:
Each layer reuses the features from the layer below. An edge detector at layer 1 can be shared across many part detectors at layer 2. This compositionality means deep networks can represent complex functions with exponentially fewer parameters than shallow ones.
The catch? Deep networks are harder to train. Gradients must flow through many layers via the chain rule, and they can vanish (multiply by numbers < 1 repeatedly) or explode (multiply by numbers > 1 repeatedly). This lecture addresses both why depth works and how to make it trainable.
A 1-layer (shallow) and a 3-layer (deep) network learn to classify the same XOR-like pattern. The deep network uses far fewer total parameters.
Click TrainA deep network with L layers computes a sequence of transformations. Layer l takes the output of layer l−1 and produces a new representation:
where z(0) = x (the input), W(l) is the weight matrix of layer l, b(l) is the bias vector, and σ is the activation function (typically ReLU for hidden layers).
If the input has dimension p, and layers have widths d1, d2, ..., dL:
| Layer | Input dim | Output dim | W(l) shape | Parameters |
|---|---|---|---|---|
| 1 | p | d1 | d1 × p | d1(p + 1) |
| 2 | d1 | d2 | d2 × d1 | d2(d1 + 1) |
| l | dl−1 | dl | dl × dl−1 | dl(dl−1 + 1) |
| Output | dL | K | K × dL | K(dL + 1) |
The chain rule extends naturally. Define δ(l) = ∂L/∂a(l) where a(l) = W(l)z(l−1) + b(l) (pre-activation). Then:
where ⊙ is element-wise multiplication. The gradient for each weight matrix:
Input: x = [1, 2]T. Three layers with widths 3, 2, 1. ReLU activation.
Layer 1: W(1) is 3×2, b(1) is 3×1.
Layer 2: W(2) is 2×3.
Output: W(3) is 1×2.
Adjust the depth. Watch how intermediate representations transform the 2D input into increasingly abstract features.
A fully connected layer connecting a 224×224 image (150,528 pixels for RGB) to just 1,000 hidden units needs 150 million parameters. That's absurd. Most of those connections are unnecessary because useful image features are local (an edge only depends on nearby pixels) and translation-equivariant (an edge detector should work everywhere in the image).
A convolutional layer applies a small filter (kernel) K of size k × k to every spatial position in the input:
The same filter is applied at every position. This is weight sharing — one filter has only k2 parameters regardless of the input size.
Compare for a 100×100 input producing a 100×100 feature map:
| Layer Type | Parameters | Count |
|---|---|---|
| Fully connected | 10,000 × 10,000 | 100,000,000 |
| Conv (3×3 filter) | 3 × 3 = 9 | 9 |
| Conv (5×5 filter) | 5 × 5 = 25 | 25 |
That's a factor of 10 million reduction! In practice we use Cout filters (one per output channel), each looking at Cin input channels, so the parameter count is Cout × Cin × k × k. Still vastly smaller than fully connected.
If you shift the input by (dx, dy), the output shifts by exactly (dx, dy). Formally: Tshift(x * K) = Tshift(x) * K. This means the network doesn't need separate detectors for "cat in top-left" and "cat in bottom-right" — one filter handles both.
For a 1D signal x[n] and filter h of length k:
This is exactly the FIR filter from earlier lectures! A CNN layer for audio is a bank of learnable FIR filters, trained end-to-end.
A 3×3 filter slides over the input. Watch the dot products produce the output feature map. Change the filter to see different edge detectors.
Convolution preserves spatial resolution (with padding). But for classification, we eventually need a fixed-size representation regardless of input size. Pooling progressively reduces spatial dimensions.
Divide the feature map into non-overlapping 2×2 blocks. Take the maximum in each block:
This halves the spatial dimensions in each direction (100×100 → 50×50). It also provides a small amount of translation invariance: shifting the input by 1 pixel often doesn't change which element is the max.
Same structure, but takes the mean instead of max. Smoother but loses the "which was strongest" information that max pooling preserves.
A common pattern: as spatial dimensions decrease, the number of channels increases. This maintains roughly constant computational cost per layer.
Input: 32×32×1 (grayscale digit). Architecture:
| Layer | Operation | Output shape | Parameters |
|---|---|---|---|
| 1 | Conv 5×5, 6 filters | 28×28×6 | 6(1×25 + 1) = 156 |
| 2 | Pool 2×2 | 14×14×6 | 0 |
| 3 | Conv 5×5, 16 filters | 10×10×16 | 16(6×25 + 1) = 2,416 |
| 4 | Pool 2×2 | 5×5×16 | 0 |
| 5 | FC 400→120 | 120 | 48,120 |
| 6 | FC 120→10 | 10 | 1,210 |
| Total | ~52K |
Compare: a fully connected network from 32×32 to 120 hidden units would need 32×32×120 = 122,880 parameters in the first layer alone — more than double the entire CNN.
Watch how spatial dimensions shrink and channel count grows through a CNN. Each rectangle represents a feature map.
A spectrogram is a 2D image: the x-axis is time, the y-axis is frequency, and pixel intensity is energy. This is the representation we built in Lecture 8 (STFT). The crucial insight: once you have a spectrogram, audio classification becomes image classification, and CNNs excel at images.
Consider classifying speech commands ("yes", "no", "stop", "go"). Each word has a distinctive spectro-temporal pattern — specific frequency bands active at specific relative times. A 2D CNN filter detects exactly these kinds of localized time-frequency patterns:
| Filter orientation | What it detects | Example |
|---|---|---|
| Horizontal (time) | Sustained frequency band | Vowel formant |
| Vertical (frequency) | Broadband transient | Plosive consonant (p, t, k) |
| Diagonal (rising) | Rising pitch/formant | Question intonation |
| Diagonal (falling) | Falling pitch | Statement ending |
Layer 1 filters (3×3): Small time-frequency patches. Horizontal edges = onsets/offsets of frequency bands. Vertical edges = spectral transitions.
Layer 2 filters (receptive field 7×7): Combinations of layer-1 features. A rising formant transition. A voiced-to-unvoiced boundary.
Layer 3 filters (receptive field 15×15): Entire phonemes or phoneme sequences. The specific pattern that makes "sh" different from "s".
A synthetic spectrogram is processed by conv filters. Click through filter types to see what each detects. The highlighted regions show where the filter fires strongest.
Deep networks are powerful but fragile. Two techniques make them dramatically easier to train.
Problem: as training progresses, the distribution of each layer's inputs shifts (internal covariate shift). Layer 5 has to constantly readjust to the changing outputs of layer 4.
Solution: normalize each layer's pre-activations to zero mean and unit variance, computed over the current mini-batch:
where μB = (1/B)∑zi and σB2 = (1/B)∑(zi − μB)2 are the batch statistics, and ε is a small constant for numerical stability.
Then apply learnable scale and shift:
where γ and β are learned parameters. This lets the network "undo" the normalization if needed, but starts from a normalized baseline.
Problem: networks with millions of parameters can memorize training data perfectly (overfitting). We need regularization.
Solution: during training, randomly set each hidden unit to 0 with probability p (typically p = 0.5):
The division by (1 − p) ensures the expected output magnitude is unchanged (inverted dropout).
At test time, use all units (no dropping). The network has learned to be robust — no single unit can memorize a pattern alone, because it might be dropped. This forces distributed representations.
Each frame shows a different dropout mask. Dropped units (gray) don't contribute. The network must learn redundant representations.
By 2014, people could train networks with ~20 layers. Going deeper (50, 100, 150 layers) caused degradation: training error actually increased with more layers, even without overfitting. This shouldn't happen — a deeper network could at least learn the identity function for extra layers.
The problem: it's hard to learn the identity mapping f(x) = x through a stack of convolutions and ReLUs. The optimization landscape makes it easier to learn small perturbations than exact pass-through.
Instead of learning f(x) directly, learn the residual F(x) = f(x) − x. The output becomes:
If the optimal transformation is close to identity, F(x) is close to zero — and learning small values near zero is much easier than learning an identity mapping through nonlinear layers.
Think of the data flow as a river. Each residual block is a tributary: it can add something useful to the main current, but the river (skip connection) keeps flowing regardless. Without skip connections, the river would have to pass through a series of dams (nonlinear layers), losing momentum at each one. With skip connections, the river is unobstructed, and each block just enriches it.
| Model | Layers | Parameters | Top-5 Error (ImageNet) |
|---|---|---|---|
| VGG-16 | 16 | 138M | 7.3% |
| ResNet-18 | 18 | 11M | 10.9% |
| ResNet-50 | 50 | 25M | 6.7% |
| ResNet-152 | 152 | 60M | 5.7% |
ResNet-152 is 10× deeper than VGG-16 but uses fewer parameters (60M vs 138M) and achieves lower error. Depth + residual connections + fewer parameters per layer = a powerful combination.
Data flows through the main path (transformations) and the skip connection simultaneously. The output is their sum. Toggle the skip to see how gradient flow changes.
Skip: ONWe've gone from single-layer networks to deep architectures that dominate modern signal processing. Let's consolidate the pieces.
| Concept | What It Solves | Key Formula |
|---|---|---|
| Depth | Exponential efficiency for compositional functions | z(l) = σ(W(l)z(l−1)) |
| Convolution | Parameter sharing + translation equivariance | (x * K)[n] = ∑ K[m]x[n+m] |
| Pooling | Spatial reduction + local invariance | max over 2×2 blocks |
| BatchNorm | Stable training + regularization | ẑ = (z − μ)/σ |
| Dropout | Overfitting prevention via ensemble | z̃ = z · Bernoulli(1−p) |
| Residual | Gradient flow in deep networks | y = F(x) + x |
A production CNN for audio classification in 2024: input mel-spectrogram → [Conv + BN + ReLU + Pool] × 4 → Global Average Pool → FC + Softmax. With residual connections if depth > 20. Trained with Adam optimizer, cosine learning rate schedule, dropout = 0.3 before FC layers.
CNNs assume translation equivariance is the right inductive bias. But what about sequences where context matters (the meaning of a word depends on surrounding words)? The next lecture introduces attention — a mechanism that dynamically focuses on relevant parts of the input, no matter how far away they are.
"Anything that a human can do with less than one second of thought, we can probably now or soon automate with deep learning." — Andrew Ng