Fully-connected layers destroy the spatial structure of images. Convolutions respect it — and that single change launched the deep learning revolution in computer vision.
You have a 32×32 color image — a photograph of a cat, say. You want to classify it. The approach from Lecture 4 was simple: stretch the image into a long vector of 32 × 32 × 3 = 3,072 numbers, then multiply by a weight matrix. That's a fully-connected layer.
It works, technically. But it throws away something precious: the spatial structure of the image. Pixel (0, 0) ends up right next to pixel (0, 1) in the vector, sure — but also right next to pixel (31, 31), which is on the opposite corner of the image. The network has no idea that nearby pixels are related.
A fully-connected layer connecting a 32×32×3 image to 100 hidden units needs 3,072 × 100 = 307,200 parameters — just for one layer. For a 224×224 image (standard ImageNet size), that explodes to 150,528 × 100 = 15 million parameters. Most of these parameters are wasted learning that distant pixels rarely interact.
There's a deeper problem too. A fully-connected layer learns one giant template per output unit — a single 32×32 pattern that it matches against the whole image. If the cat moves from the left side to the right side, a new template is needed. The network can't reuse what it learned about "cat ears" in one position for another position.
An operation is translation equivariant if shifting the input shifts the output by the same amount. If you move a cat 10 pixels right, the "cat detector" output should also shift 10 pixels right. Fully-connected layers don't have this property. Convolutions do.
Three properties would make image processing dramatically more efficient:
1. Local connectivity. Each output unit should only look at a small patch of the input (say 5×5), not the entire image. Nearby pixels matter most.
2. Weight sharing. The same small filter should be applied at every position. If it detects a vertical edge at position (3, 7), it should detect the same edge at position (20, 15).
3. Spatial preservation. The output should still be a 2D grid, preserving the spatial relationships of the input.
A 5×5 filter on a 3-channel image has 5 × 5 × 3 + 1 = 76 parameters. It slides across the entire 32×32 image, producing a 28×28 output map. Versus a fully-connected layer connecting 3,072 inputs to one output: 3,073 parameters — 40× more, for a single output pixel. And the conv filter works at every position, so it's really 3,073 × 784 = 2.4M equivalent parameters if we wanted the FC layer to produce the same-sized output.
This is the idea behind convolutional layers: small learned filters that slide across the image, detecting local patterns everywhere. One filter might learn to detect horizontal edges. Another detects blobs of color. Another detects corners. Stack enough of these, and you build a hierarchy from edges to textures to parts to whole objects.
Hubel and Wiesel (1959) discovered that neurons in the cat's visual cortex respond to small local regions of the visual field — receptive fields. Some neurons detect edges at specific orientations. Deeper neurons respond to more complex patterns. The visual cortex is organized hierarchically, just like a CNN. Fukushima's Neocognitron (1980) formalized this into the first convolutional architecture.
Here's the core operation. You have an input image (or feature map) and a small filter (also called a kernel). The filter slides across the image, one position at a time. At each position, you compute the dot product between the filter and the patch of the image it covers. That dot product becomes one pixel in the output.
Given an input I of size H × W and a filter K of size KH × KW, the output at position (i, j) is:
In words: place the filter's top-left corner at position (i, j). Multiply each filter element by the corresponding image element. Sum all the products. Add the bias. That sum is the output at (i, j).
Before learning anything, let's see convolution in action with hand-crafted filters. Consider this 5×5 grayscale image (0 = black, 255 = white):
Input (5×5): a bright region on the left, dark on the right.
[[255, 255, 0, 0, 0],
[255, 255, 0, 0, 0],
[255, 255, 0, 0, 0],
[255, 255, 0, 0, 0],
[255, 255, 0, 0, 0]]
Filter (3×3) — the Sobel vertical edge detector:
[[-1, 0, 1],
[-2, 0, 2],
[-1, 0, 1]]
Position (0,0): 255×(-1) + 255×0 + 0×1 + 255×(-2) + 255×0 + 0×2 + 255×(-1) + 255×0 + 0×1 = -1020
Position (0,1): 255×(-1) + 0×0 + 0×1 + 255×(-2) + 0×0 + 0×2 + 255×(-1) + 0×0 + 0×1 = -1020
Position (0,2): 0×(-1) + 0×0 + 0×1 + 0×(-2) + 0×0 + 0×2 + 0×(-1) + 0×0 + 0×1 = 0
Large negative values appear exactly at the vertical edge between bright and dark. The filter detected it!
The key insight: different filters detect different features. A horizontal edge detector uses a different 3×3 pattern. A corner detector uses yet another. In a CNN, the network learns what filters to use — hundreds of them — by backpropagation.
Weight sharing: the same 3×3 filter (9 parameters) is used at every position. If it detects a vertical edge at the top-left, it detects the same edge at the bottom-right. Local connectivity: each output pixel depends on only a 3×3 patch of the input, not the entire image. These two properties are why CNNs are so parameter-efficient.
Mathematically, convolution flips the filter before sliding. In deep learning, we skip the flip and use cross-correlation instead. Every framework (PyTorch, TensorFlow) calls it "convolution" but implements cross-correlation. Since the filter weights are learned, flipping doesn't matter — the network just learns the already-flipped version.
So far the filter moves one pixel at a time. But we have two knobs to turn: how far the filter jumps (stride) and whether we pad the edges of the input (padding).
The number of pixels the filter moves between positions. Stride 1 means slide one pixel at a time (the default). Stride 2 means skip every other position, producing an output roughly half the size. Stride is a hyperparameter — not learned.
Extra pixels (usually zeros) added around the border of the input before convolution. Without padding, the output is always smaller than the input. With the right amount of padding, the output can be the same size.
This is the single most important formula in CNN design. You'll use it constantly.
Let's see where this comes from. Without padding or stride, a filter of size K on an input of size W can be placed at positions 0, 1, 2, ..., W − K. That's W − K + 1 positions. Adding P zeros on each side makes the effective input size W + 2P, so we get (W + 2P) − K + 1 = W − K + 2P + 1 positions. With stride S, we only use every S-th position, giving ⌊(W − K + 2P) / S⌋ + 1.
Case 1: W = 7, K = 3, P = 0, S = 1. Output = (7 − 3 + 0)/1 + 1 = 5.
Case 2: W = 7, K = 3, P = 1, S = 1. Output = (7 − 3 + 2)/1 + 1 = 7. Same size! This is "same" padding.
Case 3: W = 7, K = 3, P = 0, S = 2. Output = (7 − 3 + 0)/2 + 1 = 3. Downsampled by ~2×.
Case 4: W = 32, K = 5, P = 2, S = 1. Output = (32 − 5 + 4)/1 + 1 = 32. Same size with K=5.
| Name | Rule | Effect |
|---|---|---|
| Valid (no padding) | P = 0 | Output shrinks by K−1 pixels per layer |
| Same padding | P = (K − 1) / 2 | Output same size as input (when S = 1) |
| Full padding | P = K − 1 | Output larger than input (rarely used in practice) |
Same padding is by far the most common. With K = 3, use P = 1. With K = 5, use P = 2. This keeps the spatial dimensions constant through the conv layer, and you control downsampling explicitly with stride or pooling.
For any odd kernel size K, setting P = (K − 1)/2 with stride 1 gives output = input size. This is why almost all modern CNNs use odd-sized filters (3×3, 5×5, 7×7). Even-sized filters (2×2, 4×4) make "same" padding asymmetric and awkward.
If (W − K + 2P) is not evenly divisible by S, the formula gives a non-integer. Most frameworks handle this by truncating (floor), but it means the filter doesn't cleanly cover the input. Best practice: choose hyperparameters so the division is exact.
So far we've been convolving a single 2D filter over a single-channel (grayscale) input. Real images have 3 channels (RGB), and real conv layers use many filters. Let's handle both.
A color image has shape Cin × H × W (e.g., 3 × 32 × 32). The filter must match the depth of the input: it's Cin × KH × KW (e.g., 3 × 5 × 5). At each spatial position, you compute a dot product across all channels simultaneously — a 3 × 5 × 5 = 75-dimensional dot product — to produce a single number.
One filter detects one kind of feature (say, vertical edges). But we want to detect many features. So we use Cout different filters, each producing its own 2D output. Stack them, and the output is a 3D volume: Cout × H' × W'.
Input: 3 × 32 × 32 (RGB image)
Layer: 10 filters, each 5×5, stride 1, padding 2
Each filter shape: 3 × 5 × 5 = 75 weights + 1 bias = 76 parameters
Total parameters: 10 × 76 = 760
Output size: (32 − 5 + 4)/1 + 1 = 32. So output is 10 × 32 × 32.
Compare: An FC layer from 3,072 to 10,240 (same output size) would need 31+ million parameters. Conv uses 760. That's a 41,000× reduction.
Each output pixel requires one dot product of size Cin × K × K. The total number of multiply-add operations:
The 2D output produced by one filter. If a layer has Cout filters, it produces Cout feature maps. Each map is a spatial grid where bright regions indicate where that filter's pattern was detected. At each spatial position, you can also think of the Cout values as a Cout-dimensional feature vector describing that location.
| Quantity | Formula | Example (3×32×32, 10 filters 5×5, P=2, S=1) |
|---|---|---|
| Filter shape | Cout × Cin × K × K | 10 × 3 × 5 × 5 |
| Bias shape | Cout | 10 |
| Parameters | Cout(Cin × K × K + 1) | 760 |
| Output shape | Cout × H' × W' | 10 × 32 × 32 |
| FLOPs | Cout × H' × W' × Cin × K × K | 768,000 |
In practice, inputs come in batches: N × Cin × H × W. The conv layer applies the same filters to every image in the batch independently, producing N × Cout × H' × W'. The filter weights are shared across all images in the batch — another level of weight sharing.
Conv layers can downsample using stride > 1. But there's another way to reduce spatial dimensions: pooling. Unlike conv layers, pooling has no learnable parameters — it just applies a fixed operation.
A pooling layer slides a window over each feature map independently and replaces each window with a single summary value. The most common: max pooling (take the maximum value in the window). Also used: average pooling (take the mean).
Input (4×4 feature map):
[[1, 1, 2, 4],
[5, 6, 7, 8],
[3, 2, 1, 0],
[1, 2, 3, 4]]
2×2 max pool, stride 2:
Top-left 2×2: max(1,1,5,6) = 6
Top-right 2×2: max(2,4,7,8) = 8
Bottom-left 2×2: max(3,2,1,2) = 3
Bottom-right 2×2: max(1,0,3,4) = 4
Output: [[6, 8], [3, 4]] — from 4×4 down to 2×2.
1. Reduce computation. A 2×2 pool with stride 2 halves both H and W, cutting the number of activations (and thus the compute for the next layer) by 4×.
2. Translation invariance. If a feature (like an edge) shifts by one pixel, the max pool output stays the same — the max value is still captured. This small amount of invariance helps the network focus on "what" features are present, not exactly "where" they are.
3. Increase receptive field. After pooling, each pixel in the next layer effectively "sees" a larger region of the original input.
A pooling layer operates on each feature map independently. If the input has 64 channels, the output also has 64 channels — each one downsampled separately. The number of channels never changes through pooling. Only H and W shrink.
A special case: the window covers the entire feature map. If the input is C × 7 × 7, global average pooling produces a C × 1 × 1 vector (or equivalently, a C-dimensional vector). This is often used just before the final classification layer, replacing a large fully-connected layer.
| Pooling Type | Operation | Learnable? | Use Case |
|---|---|---|---|
| Max Pool | Maximum in window | No | Standard downsampling |
| Average Pool | Mean of window | No | Smoother downsampling |
| Global Average Pool | Mean of entire map | No | Replace FC before classifier |
| Strided Conv | Conv with S > 1 | Yes | Modern alternative to pooling |
Modern architectures increasingly replace pooling with strided convolutions (stride 2). The argument: if we're going to downsample, we might as well learn how to downsample, rather than using a fixed max/average rule. ResNets and many modern CNNs use strided conv for downsampling.
Now we have all the building blocks. A CNN stacks them in a specific pattern:
The historical notation: [(CONV-RELU)*N - POOL?]*M - (FC-RELU)*K - SOFTMAX where N is usually 1–5 conv-relu pairs between each pool, M is the number of pool stages (typically 3–5), and K is 0–2 FC layers at the end.
Something remarkable happens as you stack more layers. The first conv layer learns to detect edges and color contrasts — simple local patterns. The second layer combines those edges into textures and corners. The third detects parts (eyes, wheels, windows). Deeper layers detect entire objects or scenes.
Layer 1: Edges, color gradients. Layer 2: Corners, textures, simple shapes. Layer 3: Object parts (eyes, ears, wheels). Layer 4+: Whole objects, scenes. This hierarchy emerges automatically from training — nobody programs it. It mirrors the hierarchical organization of the primate visual cortex.
Input: 3 × 32 × 32
Conv1: 6 filters 5×5, P=0, S=1. Output: 6 × 28 × 28. Params: 6 × (3×5×5+1) = 456.
Pool1: 2×2 max, S=2. Output: 6 × 14 × 14. Params: 0.
Conv2: 16 filters 5×5, P=0, S=1. Output: 16 × 10 × 10. Params: 16 × (6×5×5+1) = 2,416.
Pool2: 2×2 max, S=2. Output: 16 × 5 × 5 = 400 values.
Flatten: 400-dim vector.
FC1: 400 → 120. Params: 400×120 + 120 = 48,120.
FC2: 120 → 84. Params: 120×84 + 84 = 10,164.
FC3: 84 → 10 (classes). Params: 84×10 + 10 = 850.
Total: 62,006 parameters. (This is essentially LeNet-5, circa 1998.)
In the LeNet example above, the conv layers have 456 + 2,416 = 2,872 parameters (4.6% of total). The FC layers have 48,120 + 10,164 + 850 = 59,134 parameters (95.4%). FC layers dominate the parameter count. This is why modern architectures try to minimize or eliminate FC layers — using global average pooling directly before the final classifier.
Watch what happens to the spatial dimensions and channel count as data flows through:
| Layer | Output Shape | Spatial | Channels |
|---|---|---|---|
| Input | 3 × 32 × 32 | Large | Few (3) |
| Conv + Pool | 32 × 16 × 16 | ↓ | ↑ |
| Conv + Pool | 64 × 8 × 8 | ↓ | ↑ |
| Conv + Pool | 128 × 4 × 4 | ↓ | ↑ |
| Conv + Pool | 256 × 2 × 2 | Small | Many (256) |
The pattern is universal: spatial dimensions shrink, channel dimensions grow. Early layers have large spatial maps with few channels (capturing where). Deep layers have small spatial maps with many channels (capturing what). This is the fundamental trade-off of CNNs.
Think of a CNN as a funnel that compresses spatial information (32×32 → 1×1) while expanding semantic information (3 channels → 256 channels). The input tells you the color of every pixel. The output tells you what object is there. The conv layers gradually trade "where" for "what."
When you convolve a 3×3 filter over the input, each output pixel depends on a 3×3 patch of the input. Now stack a second 3×3 conv. Each output pixel of the second layer depends on a 3×3 patch of the first layer's output — and each of those depends on a 3×3 patch of the input. So each second-layer output pixel effectively "sees" a 5×5 region of the original input.
The region of the original input that influences a particular output neuron. For the first conv layer, it's just the filter size. For deeper layers, it grows with each layer because each neuron's input depends on a region that itself depends on a region.
All 3×3 filters, stride 1, same padding:
After 1 layer: RF = 1 + 1×2 = 3×3
After 2 layers: RF = 1 + 2×2 = 5×5
After 3 layers: RF = 1 + 3×2 = 7×7
After 5 layers: RF = 1 + 5×2 = 11×11
With a 224×224 input, you'd need (224 − 1)/2 = 112 layers of 3×3 conv for a single neuron to "see" the whole image. That's a lot of layers!
For image classification, the final output needs to consider the entire image. If the receptive field of the final layer neurons doesn't cover the whole input, the network is making decisions based on only a portion of the image — it might miss the cat's tail on the other side.
Three strategies: (1) Larger filters — 5×5 or 7×7 instead of 3×3 (more parameters). (2) Strided convolutions or pooling — downsampling before the next conv effectively magnifies each pixel's receptive field (the most common approach). (3) Dilated convolutions — skip pixels in the filter pattern, expanding the receptive field without extra parameters or downsampling.
When layers have different strides, the receptive field calculation becomes recursive. For layer ℓ with kernel Kℓ and stride Sℓ:
Layer 1: K=3, S=1. Layer 2: K=3, S=2. Layer 3: K=3, S=1.
RF1 = 1 + (3−1)×1 = 3
RF2 = 3 + (3−1)×1 = 5
RF3 = 5 + (3−1)×(1×2) = 5 + 4 = 9
The stride-2 layer at layer 2 made layer 3's growth count double. Stride multiplies all future RF growth.
After a stride-2 layer, each pixel in the output corresponds to a 2-pixel jump in the input. So when the next 3×3 filter covers 3 output pixels, those 3 pixels span 3×2 − 1 = 5 input pixels. The stride-2 acts like a magnifying glass: everything after it operates in a "zoomed out" coordinate system where each step covers more ground.
Standard convolution works well, but researchers have developed several important variants that trade off parameters, compute, and expressiveness in clever ways.
At first, a 1×1 filter sounds pointless — it covers just a single pixel. But remember: filters extend across all input channels. A 1×1 conv on a 64-channel input computes a 64-dimensional dot product at each spatial position. It's a per-pixel fully-connected layer across channels.
A convolution with K = 1. It doesn't mix spatial information — it mixes channels. If the input has Cin channels and you use Cout 1×1 filters, it's equivalent to applying a Cin → Cout linear projection independently at each spatial position. Parameters: Cout × Cin + Cout.
Input: 256 × 56 × 56 (256 channels, 56×56 spatial).
1×1 conv with 64 filters: Output = 64 × 56 × 56. Parameters: 64 × 256 + 64 = 16,448.
We reduced channels from 256 to 64 — a 4× reduction — without touching the spatial dimensions. This is a bottleneck: it compresses the channel dimension before an expensive 3×3 conv, saving massive compute.
Without the bottleneck, a 3×3 conv from 256→256 channels costs 256 × 256 × 9 = 589,824 params. With a 1×1 bottleneck: 256→64 (16,448) + 3×3 64→64 (36,928) + 1×1 64→256 (16,640) = 70,016 params. An 8.4× reduction.
Standard convolution: every filter sees all Cin input channels. Grouped convolution splits the input channels into G groups, and each filter only sees Cin/G channels from its group.
Standard conv: Cin=64, Cout=64, K=3. Params: 64 × 64 × 3 × 3 = 36,864.
Grouped conv (G=4): 4 groups, each: 16 input channels, 16 output channels. Params per group: 16 × 16 × 3 × 3 = 2,304. Total: 4 × 2,304 = 9,216. A 4× reduction.
Fun fact: AlexNet (2012) used G=2 because the model had to be split across two GPUs, each handling half the channels. An engineering hack that turned out to be a useful architectural principle!
Take grouped convolution to the extreme: G = Cin. Each filter operates on a single channel independently. This is called a depthwise convolution. It captures spatial patterns within each channel but doesn't mix information across channels.
To mix channels, follow the depthwise conv with a 1×1 pointwise convolution. The combination — depthwise + pointwise — is a depthwise separable convolution.
Standard: Cin=64, Cout=64, K=3. Params: 64 × 64 × 9 = 36,864.
Depthwise separable: Depthwise: 64 × 9 = 576. Pointwise: 64 × 64 = 4,096. Total: 4,672.
Reduction: 36,864 / 4,672 ≈ 7.9× fewer parameters. Nearly 8× more efficient!
MobileNet (Howard et al., 2017) used depthwise separable convolutions throughout, achieving competitive accuracy with far fewer parameters — small enough to run on phones.
Modern efficient CNNs follow a common recipe: 1×1 pointwise (expand/compress channels) + depthwise K×K (spatial filtering per channel) + 1×1 pointwise (project back). This "inverted bottleneck" pattern appears in MobileNetV2, EfficientNet, and ConvNeXt. It separates "what to mix across channels" from "how to process spatially" — a factorization that's remarkably efficient.
| Conv Type | Params (C=64, K=3) | Relative Cost | Used In |
|---|---|---|---|
| Standard | 36,864 | 1.0× | VGG, early layers |
| Grouped (G=4) | 9,216 | 0.25× | AlexNet, ResNeXt |
| Depthwise Sep. | 4,672 | 0.13× | MobileNet, EfficientNet |
| 1×1 only | 4,096 | 0.11× | NiN, Inception bottlenecks |
Time to see the full pipeline in action. This interactive visualizer shows an input image flowing through a conv layer, ReLU activation, and max pooling. Watch how each stage transforms the data.
Select different filter types to see how each one detects different features. The ReLU clips negative values to zero. The pooling downsamples by taking the max in each 2×2 window.
Vertical edge filter on vertical edge input: strong activations along the edge. On horizontal edge input: almost no response — the filter and the pattern are orthogonal. After ReLU: all negative activations become zero (dark cells disappear). After pooling: the spatial size halves but the strongest features survive. This is how a CNN builds invariance while preserving the important stuff.
We've built a complete understanding of convolutional neural networks for image classification. Here's the whole picture in one view.
| Formula | Purpose |
|---|---|
| O = (W − K + 2P) / S + 1 | Output spatial size |
| Params = Cout(Cin × K2 + 1) | Conv layer parameters |
| FLOPs = Cout × H' × W' × Cin × K2 | Compute cost |
| RF = 1 + L(K − 1) | Receptive field (uniform stride=1) |
| RFℓ = RFℓ-1 + (Kℓ−1) ∏ Si | General receptive field |
| Component | What It Does | Learnable? |
|---|---|---|
| Convolution | Detects local spatial patterns; weight sharing | Yes (filters + bias) |
| ReLU | Introduces nonlinearity; clips negatives to 0 | No |
| Pooling | Downsamples spatial dimensions; adds invariance | No |
| Fully Connected | Global reasoning; classification at the end | Yes (weights + bias) |
| 1×1 Conv | Channel mixing / dimension reduction | Yes |
| Depthwise Sep. | Efficient spatial + channel factorization | Yes |
This lecture covered the primitives of CNNs. The next lectures build on them:
Lecture 6 — CNN Architectures: How to arrange these building blocks. AlexNet, VGG, GoogLeNet (Inception), ResNet, and the evolution of modern architectures. The key innovation: skip connections and very deep networks.
Lecture 8 — Vision Transformers: Starting around 2021, transformers began replacing CNNs for vision tasks. The core idea: treat image patches as tokens and use self-attention instead of convolution. But even ViTs borrow from CNNs — patch embedding is essentially a strided convolution.
Despite the transformer revolution, convolutional layers remain fundamental. ConvNeXt (2022) showed that a pure CNN, modernized with transformer-era training techniques, matches Vision Transformer performance. Many hybrid architectures use conv layers in early stages and attention in later stages. The inductive biases of convolution — locality, translation equivariance, weight sharing — remain powerful, especially with limited data.
Convolution and pooling are both translation equivariant: Conv(Translate(X)) = Translate(Conv(X)). If you shift the input, the output shifts by the same amount. This is the formal reason CNNs work for vision — features of images don't depend on their absolute position. A cat in the top-left corner has the same ears as a cat in the bottom-right.
From LeCun's handwritten digit recognizer (1998) to AlexNet's ImageNet breakthrough (2012) to MobileNet on your phone — the convolution operation is one of the most impactful ideas in all of deep learning. You now understand exactly how it works, why it works, and where it's headed.