Ch 9: Convolutional Networks — Goodfellow Deep Learning

Chapter 0: Why Convolution?

Imagine trying to recognize a cat in a 256×256 image using a fully-connected network. The input is 256 × 256 × 3 = 196,608 pixels. If the first hidden layer has 1000 neurons, that is 196 million weights in the first layer alone. This is wildly impractical and guaranteed to overfit.

The key insight: images have spatial structure. A pixel is strongly related to its neighbors, weakly related to distant pixels. A cat's ear looks the same whether it appears in the top-left or bottom-right. We can exploit these two properties — locality and translation invariance — to build dramatically more efficient networks.

Three key ideas of CNNs:
1. Sparse connectivity: Each neuron connects only to a small local patch, not the entire input.
2. Parameter sharing: The same filter weights are used at every spatial position.
3. Equivariant representation: If the input shifts, the output shifts by the same amount.

These three properties reduce the number of parameters by orders of magnitude. A 3×3 convolution filter has just 9 weights, regardless of the input size. The same 9 weights are reused across every position in the image, detecting the same local pattern everywhere.

Fully Connected

196M parameters for one layer. Overfits. Ignores spatial structure.

↓

Convolutional

9 parameters per filter. Exploits locality. Translation equivariant.

Why do CNNs use far fewer parameters than fully-connected networks for images?

Each filter connects only to a small local patch (sparse connectivity) and the same filter weights are reused across all spatial positions (parameter sharing) CNNs use smaller images CNNs have fewer layers

Chapter 1: The Convolution Operation

A convolution slides a small filter (kernel) across the input and computes a dot product at each position. For a 2D input I and a 2D kernel K:

S(i, j) = (I * K)(i, j) = ∑_m ∑_n I(i + m, j + n) · K(m, n)

Technically, neural networks use cross-correlation (no kernel flip), but everyone calls it "convolution." The kernel slides over the input, computing a weighted sum at each position. The output is a feature map that activates wherever the input matches the pattern encoded by the kernel.

What kernels detect: A horizontal edge kernel like [[-1,-1,-1],[0,0,0],[1,1,1]] computes the vertical gradient — it activates wherever the image transitions from dark above to bright below. Different kernels detect edges, corners, textures, and eventually high-level features like eyes and wheels.

In practice, a convolutional layer has multiple filters, each producing its own feature map. The first layer might have 64 filters detecting 64 different low-level patterns. The input to the next layer is a stack of these 64 feature maps, and the next layer's filters operate on all of them simultaneously.

2D Convolution

Watch a 3×3 kernel slide across a 7×7 input. The output value at each position is the dot product of the kernel with the local patch.

Kernel

Slide position0

What does a convolutional layer compute at each spatial position?

A dot product between the kernel weights and the local input patch, producing one value in the output feature map The maximum value in the local patch The average of all input values

Chapter 2: Sparse Connectivity

In a fully-connected layer, every output is connected to every input. In a convolutional layer, each output depends on only a small receptive field of the input — the kernel size. A 3×3 kernel means each output sees only 9 input values.

But deeper layers see more. If layer 1 has a 3×3 kernel (receptive field = 3), and layer 2 also has 3×3, then each layer-2 output effectively "sees" a 5×5 region of the original input. After k layers with 3×3 kernels, the effective receptive field is (2k + 1) × (2k + 1).

Key insight: Stacking many small filters (3×3) is better than using one large filter (7×7). Three 3×3 layers have a 7×7 receptive field but with more nonlinearities and fewer parameters: 3 × (3×3) = 27 vs 7×7 = 49. This is a central insight of VGGNet (Simonyan & Zisserman, 2014).

The indirect connections through multiple layers create a hierarchy: early layers detect local features (edges, textures), middle layers combine them into parts (eyes, wheels), and deep layers recognize objects (faces, cars). Each layer adds abstraction.

Receptive Field Growth

Each layer of 3×3 convolutions increases the receptive field by 2 in each direction. The highlighted region shows what one output neuron can "see."

Number of layers1

Why is stacking three 3×3 conv layers better than one 7×7 layer?

Same receptive field (7×7) but with more nonlinearities (better expressiveness) and fewer parameters (27 vs 49) The 7×7 layer is too slow to compute The 3×3 layers use more memory

Chapter 3: Parameter Sharing

The defining feature of convolution: the same kernel weights are used at every spatial position. A 3×3 kernel has 9 weights. Whether it is applied at position (0,0) or (100,100), the same 9 weights compute the output. This is a radical form of regularization.

Parameter sharing encodes the prior belief that if a feature is useful to detect at one location, it is useful at all locations. A vertical edge detector is equally valuable in every part of the image. This is translation equivariance: if the input shifts by (dx, dy), the feature map shifts by (dx, dy).

f(shift(x)) = shift(f(x))

Parameter sharing is the most powerful regularizer in CNNs. Without it, a network processing a 224×224 image with 64 filters of size 3×3 would need 224 × 224 × 9 × 64 = ~29 million parameters in the first layer alone. With sharing, it needs 9 × 64 = 576. That is a 50,000× reduction.

Not all problems benefit from full translation equivariance. Face detection, for instance, might want different features at different positions (eyes appear in the upper third, mouth in the lower third). In these cases, locally connected layers use different filters at each position but still enforce sparse connectivity. They are rarely used in practice because the efficiency gains of sharing are too large to give up.

What prior belief does parameter sharing encode?

That if a feature is useful to detect at one location, it is equally useful at all locations (translation equivariance) That all features have the same importance That the image is always centered

Chapter 4: Pooling

Pooling reduces the spatial dimensions of a feature map by summarizing local regions. The most common variant, max pooling, takes the maximum value in each non-overlapping window (typically 2×2 with stride 2), halving the spatial dimensions.

MaxPool_2×2: H × W → H/2 × W/2

Pooling provides local translation invariance. If a feature shifts by a pixel or two within the pooling window, the max value stays the same. This makes the representation robust to small translations, which is exactly what we want for recognition tasks.

Pooling vs strided convolution: Modern architectures often replace pooling with strided convolutions (convolution with stride 2). Instead of applying a fixed max operation, the network learns how to downsample. This is more flexible and often performs equally well. ResNets and most modern architectures use strided convolutions for downsampling.

Average pooling takes the mean instead of the max. It is less common in intermediate layers but widely used at the end of a network: global average pooling averages the entire feature map to a single value per channel, replacing the fully-connected classifier. This eliminates a huge number of parameters and acts as a structural regularizer.

Max Pooling vs Average Pooling

See how 2×2 pooling reduces a feature map. Max pooling preserves the strongest activation; average pooling smooths.

Pool type

Why does max pooling provide local translation invariance?

If a feature shifts by a small amount within the pooling window, the max value stays the same, making the output robust to exact position It removes all spatial information It increases the number of parameters

Chapter 5: Convolution Variants

The basic convolution has several important variations that expand its capabilities.

Stride: Instead of sliding the kernel one pixel at a time, stride-s convolution skips s positions. Stride 2 halves the output dimensions. This is an efficient way to downsample without a separate pooling layer.

Padding: Without padding, each convolution layer shrinks the spatial dimensions by (kernel_size − 1). Same padding (padding = k/2) preserves the spatial size. Valid padding (no padding) lets the output shrink. Same padding is standard in most architectures.

Output size formula: For input size W, kernel size k, padding p, and stride s:
Output = floor((W + 2p − k) / s) + 1
Example: W=32, k=3, p=1, s=1 → (32+2−3)/1+1 = 32 (same size)
Example: W=32, k=3, p=0, s=2 → (32+0−3)/2+1 = 15 (halved, no padding)

Dilated (atrous) convolution: Inserts gaps between kernel elements. A 3×3 kernel with dilation 2 covers a 5×5 area with only 9 weights. This expands the receptive field without increasing parameters or reducing resolution. Used in semantic segmentation (DeepLab) and WaveNet.

Depthwise separable convolution: Factorizes a standard convolution into a depthwise convolution (one filter per input channel) followed by a pointwise convolution (1×1, mixing channels). This reduces computation by a factor of ~k². Used in MobileNet, EfficientNet, and Xception for efficient inference.

1×1 convolutions: A 1×1 kernel does not look at spatial neighbors at all — it mixes channels at each position. Think of it as a fully-connected layer applied independently to each pixel. Used to change the number of channels cheaply (e.g., 256 channels → 64 channels). Central to GoogLeNet/Inception and ResNet bottleneck blocks.

What does a depthwise separable convolution do differently from a standard convolution?

It factorizes the operation into per-channel spatial filtering (depthwise) followed by channel mixing (1×1 pointwise), reducing computation by ~k^2 It uses larger kernels It removes the bias term

Chapter 6: Classic Architectures

The history of CNNs is a story of going deeper and smarter. Each milestone architecture introduced a key innovation.

LeNet-5 (LeCun, 1998): The original CNN. Two conv layers, two pooling layers, three fully-connected layers. Designed for digit recognition (MNIST). Demonstrated that learned features beat hand-engineered ones.

AlexNet (Krizhevsky et al., 2012): The breakthrough. Won ImageNet by a landslide with 5 conv layers + 3 FC layers. Key innovations: ReLU activation, dropout, data augmentation, GPU training. Showed that deep CNNs scale to real-world images.

VGGNet (Simonyan & Zisserman, 2014): Proved that depth matters. Used only 3×3 filters stacked to 16-19 layers. Simple, uniform architecture. Showed that stacking small filters with more nonlinearities beats fewer large filters.

The depth revolution: Going from 8 layers (AlexNet) to 19 layers (VGG) to 152 layers (ResNet) dramatically improved accuracy. But naive depth fails — gradients vanish and training degrades. ResNet's skip connections solved this, enabling essentially unlimited depth.

GoogLeNet / Inception (Szegedy et al., 2015): Instead of choosing one filter size, use multiple (1×1, 3×3, 5×5) in parallel and concatenate. 1×1 convolutions reduce the channel count before expensive 3×3 and 5×5 operations. 22 layers but fewer parameters than AlexNet.

ResNet (He et al., 2015): The most influential CNN architecture. Added skip connections (residual connections) that let the gradient flow directly through the network: y = F(x) + x. Instead of learning the mapping H(x), the network learns the residual F(x) = H(x) − x. This makes it easier to learn identity-like mappings, enabling 152+ layers without degradation.

What problem do ResNet's skip connections solve?

They allow gradients to flow directly through the network, preventing the degradation problem where deeper networks have higher training error than shallower ones They reduce the number of parameters They eliminate the need for pooling

Chapter 7: Modern Advances

CNNs continue to evolve beyond the textbook coverage. Here are the key developments since.

Batch normalization (Ioffe & Szegedy, 2015) normalizes activations within each mini-batch, enabling higher learning rates and faster convergence. Now standard in virtually all CNN architectures. (Covered in detail in Chapters 7 and 8.)

DenseNet (Huang et al., 2017) connects every layer to every other layer: each layer receives feature maps from all preceding layers. This maximizes feature reuse and gradient flow, but increases memory usage.

EfficientNet (Tan & Le, 2019) systematically scales width, depth, and resolution together using a compound scaling coefficient. Instead of ad hoc scaling, it finds the optimal balance for a given compute budget. Uses depthwise separable convolutions throughout.

CNNs vs Transformers: Vision Transformers (ViT) challenged CNNs' dominance starting in 2020. ViT splits images into patches and processes them with self-attention (no convolutions). With enough data, ViT matches or exceeds CNNs. Modern architectures like ConvNeXt (2022) show that CNNs can match ViT performance when modernized with transformer-era training recipes. The debate continues, but both paradigms remain relevant.

Transfer learning is the practical superpower of CNNs. Train on ImageNet (1.3M images, 1000 classes), then fine-tune on your specific task with far less data. The early layers learn universal features (edges, textures) that transfer across nearly all visual tasks. This is why you almost never train a CNN from scratch — always start from a pretrained backbone.

Why does transfer learning work so well for CNNs?

Early convolutional layers learn universal visual features (edges, textures, patterns) that transfer across tasks; only the later task-specific layers need retraining Pretrained models have fewer parameters Transfer learning uses a different loss function

Chapter 8: Convolution Playground

Apply different kernels to a sample image and see the resulting feature maps. Try edge detection, sharpening, and blurring.

Interactive Convolution

Select a kernel and watch it slide across the input. The output feature map lights up wherever the pattern matches.

Kernel

Experiments: (1) Try the horizontal edge kernel — see how it activates on horizontal boundaries. (2) Switch to vertical edge — different boundaries light up. (3) Apply the blur kernel — the output is a smoothed version. (4) Try sharpen — edges become more pronounced. These are the building blocks that CNN layers learn automatically.

Why do CNN layers learn edge-detecting kernels in the first layer?

Edges are the most fundamental visual features; they are the building blocks from which all higher-level patterns (corners, textures, objects) are composed Edge kernels are hard-coded into the architecture The learning rate is set to prefer edges

Chapter 9: Connections

CNNs are the backbone of modern computer vision and have influenced architectures across all domains:

Concept	Where It Appears
Convolution	Image classification, object detection, segmentation, video understanding. Also 1D convolutions in audio (WaveNet) and text (TextCNN).
Skip connections	ResNet → Transformers (every transformer block has a residual connection). Also U-Net for segmentation, DenseNet.
Pooling / downsampling	Feature pyramids (FPN) for multi-scale detection. Global average pooling replaces FC layers in modern nets.
Depthwise separable	MobileNet, EfficientNet for mobile/edge deployment. Foundation of efficient neural architectures.
1×1 convolutions	Bottleneck blocks (ResNet), channel attention (SE-Net), feature compression everywhere.
Transfer learning	The default paradigm for all of applied deep learning. Pretrain on large data, fine-tune on your task.
Batch normalization	Standard in CNNs (Ch 7-8). LayerNorm variant used in transformers and RNNs (Ch 10).

What you should take away: CNNs exploit spatial structure through sparse connectivity, parameter sharing, and pooling. Modern CNNs stack small (3×3) filters with residual connections to arbitrary depth. Transfer learning from pretrained backbones is the default approach for any visual task.

Up next: Chapter 10: Recurrence & Sequence Modeling — networks that process sequential data by maintaining hidden state over time.

What is the most important practical innovation for applying CNNs to real-world tasks?

Transfer learning: pretrain on a large dataset (ImageNet), then fine-tune on the target task with less data Using the largest possible batch size Always training from random initialization