The convolution operation, sparse connectivity, parameter sharing, pooling, and the architectures that revolutionized computer vision.
Imagine trying to recognize a cat in a 256×256 image using a fully-connected network. The input is 256 × 256 × 3 = 196,608 pixels. If the first hidden layer has 1000 neurons, that is 196 million weights in the first layer alone. This is wildly impractical and guaranteed to overfit.
The key insight: images have spatial structure. A pixel is strongly related to its neighbors, weakly related to distant pixels. A cat's ear looks the same whether it appears in the top-left or bottom-right. We can exploit these two properties — locality and translation invariance — to build dramatically more efficient networks.
These three properties reduce the number of parameters by orders of magnitude. A 3×3 convolution filter has just 9 weights, regardless of the input size. The same 9 weights are reused across every position in the image, detecting the same local pattern everywhere.
A convolution slides a small filter (kernel) across the input and computes a dot product at each position. For a 2D input I and a 2D kernel K:
Technically, neural networks use cross-correlation (no kernel flip), but everyone calls it "convolution." The kernel slides over the input, computing a weighted sum at each position. The output is a feature map that activates wherever the input matches the pattern encoded by the kernel.
In practice, a convolutional layer has multiple filters, each producing its own feature map. The first layer might have 64 filters detecting 64 different low-level patterns. The input to the next layer is a stack of these 64 feature maps, and the next layer's filters operate on all of them simultaneously.
Watch a 3×3 kernel slide across a 7×7 input. The output value at each position is the dot product of the kernel with the local patch.
In a fully-connected layer, every output is connected to every input. In a convolutional layer, each output depends on only a small receptive field of the input — the kernel size. A 3×3 kernel means each output sees only 9 input values.
But deeper layers see more. If layer 1 has a 3×3 kernel (receptive field = 3), and layer 2 also has 3×3, then each layer-2 output effectively "sees" a 5×5 region of the original input. After k layers with 3×3 kernels, the effective receptive field is (2k + 1) × (2k + 1).
The indirect connections through multiple layers create a hierarchy: early layers detect local features (edges, textures), middle layers combine them into parts (eyes, wheels), and deep layers recognize objects (faces, cars). Each layer adds abstraction.
Each layer of 3×3 convolutions increases the receptive field by 2 in each direction. The highlighted region shows what one output neuron can "see."
The defining feature of convolution: the same kernel weights are used at every spatial position. A 3×3 kernel has 9 weights. Whether it is applied at position (0,0) or (100,100), the same 9 weights compute the output. This is a radical form of regularization.
Parameter sharing encodes the prior belief that if a feature is useful to detect at one location, it is useful at all locations. A vertical edge detector is equally valuable in every part of the image. This is translation equivariance: if the input shifts by (dx, dy), the feature map shifts by (dx, dy).
Not all problems benefit from full translation equivariance. Face detection, for instance, might want different features at different positions (eyes appear in the upper third, mouth in the lower third). In these cases, locally connected layers use different filters at each position but still enforce sparse connectivity. They are rarely used in practice because the efficiency gains of sharing are too large to give up.
Pooling reduces the spatial dimensions of a feature map by summarizing local regions. The most common variant, max pooling, takes the maximum value in each non-overlapping window (typically 2×2 with stride 2), halving the spatial dimensions.
Pooling provides local translation invariance. If a feature shifts by a pixel or two within the pooling window, the max value stays the same. This makes the representation robust to small translations, which is exactly what we want for recognition tasks.
Average pooling takes the mean instead of the max. It is less common in intermediate layers but widely used at the end of a network: global average pooling averages the entire feature map to a single value per channel, replacing the fully-connected classifier. This eliminates a huge number of parameters and acts as a structural regularizer.
See how 2×2 pooling reduces a feature map. Max pooling preserves the strongest activation; average pooling smooths.
The basic convolution has several important variations that expand its capabilities.
Stride: Instead of sliding the kernel one pixel at a time, stride-s convolution skips s positions. Stride 2 halves the output dimensions. This is an efficient way to downsample without a separate pooling layer.
Padding: Without padding, each convolution layer shrinks the spatial dimensions by (kernel_size − 1). Same padding (padding = k/2) preserves the spatial size. Valid padding (no padding) lets the output shrink. Same padding is standard in most architectures.
Dilated (atrous) convolution: Inserts gaps between kernel elements. A 3×3 kernel with dilation 2 covers a 5×5 area with only 9 weights. This expands the receptive field without increasing parameters or reducing resolution. Used in semantic segmentation (DeepLab) and WaveNet.
Depthwise separable convolution: Factorizes a standard convolution into a depthwise convolution (one filter per input channel) followed by a pointwise convolution (1×1, mixing channels). This reduces computation by a factor of ~k2. Used in MobileNet, EfficientNet, and Xception for efficient inference.
1×1 convolutions: A 1×1 kernel does not look at spatial neighbors at all — it mixes channels at each position. Think of it as a fully-connected layer applied independently to each pixel. Used to change the number of channels cheaply (e.g., 256 channels → 64 channels). Central to GoogLeNet/Inception and ResNet bottleneck blocks.
The history of CNNs is a story of going deeper and smarter. Each milestone architecture introduced a key innovation.
LeNet-5 (LeCun, 1998): The original CNN. Two conv layers, two pooling layers, three fully-connected layers. Designed for digit recognition (MNIST). Demonstrated that learned features beat hand-engineered ones.
AlexNet (Krizhevsky et al., 2012): The breakthrough. Won ImageNet by a landslide with 5 conv layers + 3 FC layers. Key innovations: ReLU activation, dropout, data augmentation, GPU training. Showed that deep CNNs scale to real-world images.
VGGNet (Simonyan & Zisserman, 2014): Proved that depth matters. Used only 3×3 filters stacked to 16-19 layers. Simple, uniform architecture. Showed that stacking small filters with more nonlinearities beats fewer large filters.
GoogLeNet / Inception (Szegedy et al., 2015): Instead of choosing one filter size, use multiple (1×1, 3×3, 5×5) in parallel and concatenate. 1×1 convolutions reduce the channel count before expensive 3×3 and 5×5 operations. 22 layers but fewer parameters than AlexNet.
ResNet (He et al., 2015): The most influential CNN architecture. Added skip connections (residual connections) that let the gradient flow directly through the network: y = F(x) + x. Instead of learning the mapping H(x), the network learns the residual F(x) = H(x) − x. This makes it easier to learn identity-like mappings, enabling 152+ layers without degradation.
CNNs continue to evolve beyond the textbook coverage. Here are the key developments since.
Batch normalization (Ioffe & Szegedy, 2015) normalizes activations within each mini-batch, enabling higher learning rates and faster convergence. Now standard in virtually all CNN architectures. (Covered in detail in Chapters 7 and 8.)
DenseNet (Huang et al., 2017) connects every layer to every other layer: each layer receives feature maps from all preceding layers. This maximizes feature reuse and gradient flow, but increases memory usage.
EfficientNet (Tan & Le, 2019) systematically scales width, depth, and resolution together using a compound scaling coefficient. Instead of ad hoc scaling, it finds the optimal balance for a given compute budget. Uses depthwise separable convolutions throughout.
Transfer learning is the practical superpower of CNNs. Train on ImageNet (1.3M images, 1000 classes), then fine-tune on your specific task with far less data. The early layers learn universal features (edges, textures) that transfer across nearly all visual tasks. This is why you almost never train a CNN from scratch — always start from a pretrained backbone.
Apply different kernels to a sample image and see the resulting feature maps. Try edge detection, sharpening, and blurring.
Select a kernel and watch it slide across the input. The output feature map lights up wherever the pattern matches.
CNNs are the backbone of modern computer vision and have influenced architectures across all domains:
| Concept | Where It Appears |
|---|---|
| Convolution | Image classification, object detection, segmentation, video understanding. Also 1D convolutions in audio (WaveNet) and text (TextCNN). |
| Skip connections | ResNet → Transformers (every transformer block has a residual connection). Also U-Net for segmentation, DenseNet. |
| Pooling / downsampling | Feature pyramids (FPN) for multi-scale detection. Global average pooling replaces FC layers in modern nets. |
| Depthwise separable | MobileNet, EfficientNet for mobile/edge deployment. Foundation of efficient neural architectures. |
| 1×1 convolutions | Bottleneck blocks (ResNet), channel attention (SE-Net), feature compression everywhere. |
| Transfer learning | The default paradigm for all of applied deep learning. Pretrain on large data, fine-tune on your task. |
| Batch normalization | Standard in CNNs (Ch 7-8). LayerNorm variant used in transformers and RNNs (Ch 10). |
Up next: Chapter 10: Recurrence & Sequence Modeling — networks that process sequential data by maintaining hidden state over time.