How a small sliding window learns to see edges, textures, and objects.
Imagine you have a 256×256 color image and you want to classify it. If you flatten every pixel into a vector and feed it to a fully-connected neural network, the first layer alone needs 256 × 256 × 3 = 196,608 input weights per neuron. With 1,000 neurons in that layer, you're looking at ~200 million parameters. For a single layer. That's absurd.
Worse, a fully-connected layer treats pixel (0, 0) and pixel (255, 255) as equally related. But vision is local. An edge is a few adjacent pixels. A texture is a small patch. A whisker is a narrow region. A fully-connected network ignores this structure completely.
Left: every input connects to every neuron (explosion of wires). Right: a small filter slides across, reusing the same weights. Far fewer parameters, same spatial awareness.
Convolutional networks exploit two properties of images. Locality: useful patterns (edges, corners, textures) are small. Translation invariance: an edge in the top-left is still an edge in the bottom-right. By sharing weights across space, convnets learn to detect features regardless of position.
Here's how convolution works. You have a small grid of numbers called a kernel (or filter) — typically 3×3 or 5×5. You place it over a patch of the image, multiply each kernel value by the pixel underneath, sum everything up, and write the result to the output. Then you slide the kernel one step to the right and repeat.
In math, for a 2D input I and kernel K of size k × k, the output at position (i, j) is:
That's it. It's just "element-wise multiply, then sum" — a dot product between the kernel and the image patch. The kernel slides across the entire image, producing a 2D grid of outputs called a feature map (or activation map).
A 3-element kernel slides across a 1D signal. Watch the element-wise multiply and sum at each step.
In practice, convolution is applied to 3D volumes. An RGB image is W × H × 3, so the kernel is also 3D: k × k × 3. The kernel still produces a single number per position — it sums over all three channels. To detect multiple features, we use multiple kernels, each producing its own feature map.
Different kernels detect different patterns. This isn't magic — it falls directly out of the math. Let's look at three classic hand-crafted kernels to build intuition about what learned kernels will discover on their own.
Edge detection. The kernel [[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]] computes the difference between pixels to the right and pixels to the left. Where the image is uniform, the sum is ~0. Where there's a sharp brightness change — an edge — the sum is large.
Blur. The kernel [[1,1,1],[1,1,1],[1,1,1]] (divided by 9) averages a 3×3 neighborhood. Each output pixel becomes the mean of its neighbors, smoothing out noise and sharp transitions.
Sharpen. The kernel [[0,-1,0],[-1,5,-1],[0,-1,0]] amplifies the center pixel and subtracts its neighbors. This enhances contrast at edges, making the image crisper.
Pick a kernel and see its effect on a sample image. The output feature map highlights different structures.
When researchers visualize the first-layer filters of a trained CNN (like AlexNet), they see exactly what you'd expect: Gabor-like edge detectors at various orientations and frequencies, plus color-contrast detectors. The network reinvented what neuroscientists found in the visual cortex decades earlier.
So far we've slid the kernel one pixel at a time. But we have two knobs that control how the kernel moves and what happens at the edges.
Stride is how many pixels the kernel jumps between positions. Stride 1 means one pixel at a time (dense overlap). Stride 2 means skip every other position, cutting the output size roughly in half. Larger strides produce smaller outputs and are a way to downsample without pooling.
Padding adds extra pixels (usually zeros) around the border of the input. Without padding, a 3×3 kernel on a 5×5 input only fits in 3×3 positions — the output shrinks. With one pixel of zero-padding, the input becomes 7×7, and the output stays 5×5 — same size as the input. This is called "same" padding.
Adjust stride and padding to see how they affect the output size. Blue = input, orange = kernel position, teal = output.
After convolution + ReLU, we often apply a pooling layer. Pooling takes a small window (typically 2×2) and replaces it with a single summary value. The most common variant, max pooling, keeps only the maximum value in each window.
Why discard information? Three reasons. First, it reduces spatial size — a 2×2 max pool with stride 2 halves both width and height, cutting the number of values by 75%. Second, it introduces a small amount of translation invariance: if a feature shifts by one pixel, the max in the 2×2 window stays the same. Third, fewer values means fewer parameters in subsequent layers.
A 4×4 grid reduced to 2×2 by 2×2 max pooling. The maximum in each colored region becomes the output.
Average pooling replaces the max with the mean. It's less common in hidden layers but widely used as global average pooling (GAP) at the very end of modern networks. GAP takes an entire feature map (say 7×7) and averages it to a single number, replacing the massive fully-connected layers that older architectures used.
There's one formula you'll use constantly when designing conv layers. Given an input of size W, kernel size K, padding P, and stride S, the output size is:
Let's unpack it. W − K is how much room the kernel has to slide (it needs K pixels to fit). Adding 2P accounts for padding on both sides. Dividing by S counts how many stride-length jumps fit. The floor ⌊·⌋ discards fractional positions where the kernel would hang off the edge. Adding 1 counts the starting position.
Set input size, kernel, padding, and stride. The output size updates in real time.
For a full conv layer, also count the number of parameters. Each filter has K × K × Cin weights plus 1 bias (where Cin is input channels). With N filters, that's N × (K × K × Cin + 1) total parameters. A 3×3 conv with 64 input and 128 output channels: 128 × (3 × 3 × 64 + 1) = 73,856 parameters. Compare that to a fully-connected layer connecting the same tensors — orders of magnitude smaller.
A full convolutional network stacks three types of layers in a repeating pattern:
After several CONV-ReLU-POOL blocks, the spatial dimensions are small but the channel depth is large. The final feature maps encode high-level concepts: "there's an eye here," "there's fur texture there." These are flattened into a vector and passed through one or two fully-connected layers to produce class scores.
Watch how spatial size shrinks while depth (number of channels) grows through the network. This is the fundamental tradeoff in conv architectures.
python import torch.nn as nn model = nn.Sequential( nn.Conv2d(3, 32, 3, padding=1), # 32x32x3 → 32x32x32 nn.ReLU(), nn.MaxPool2d(2), # 32x32x32 → 16x16x32 nn.Conv2d(32, 64, 3, padding=1), # 16x16x32 → 16x16x64 nn.ReLU(), nn.MaxPool2d(2), # 16x16x64 → 8x8x64 nn.Flatten(), # 8*8*64 = 4096 nn.Linear(4096, 10), # 10 classes )
Now let's put it all together. Below is a small image grid. Pick a kernel, set your stride and padding, and watch the convolution happen step by step. The kernel slides across the input, computing the dot product at each position, building the output feature map one cell at a time.
Pick a kernel. Hit Step to advance one position, or Play to animate. The orange overlay shows the kernel position. The teal grid is the output feature map.
The history of deep learning is largely the history of ConvNet architectures. Each breakthrough came from a simple idea about how to stack layers better.
LeNet-5 (1998). Yann LeCun's pioneer. Two conv layers, two pooling layers, three FC layers. Designed for 32×32 grayscale handwritten digits. Just ~60K parameters. Proved that learned features beat hand-crafted ones.
AlexNet (2012). The revolution. Won ImageNet by a massive margin. Same idea as LeNet but bigger (5 conv layers, 60M parameters), trained on GPUs for the first time, and used ReLU instead of tanh. Showed that scale + data + compute = breakthrough.
VGGNet (2014). The "deeper is better" insight. Used only 3×3 kernels everywhere, stacked very deep (16-19 layers). Two 3×3 layers have the same receptive field as one 5×5 but with fewer parameters and more nonlinearity. Clean, uniform design.
ResNet (2015). The biggest idea since convolution itself. Added skip connections (residual connections) that let the input bypass a layer: y = F(x) + x. This solved the degradation problem — deeper networks were performing worse than shallower ones because gradients vanished. With skip connections, networks of 50, 101, even 152 layers trained easily.
| Network | Year | Layers | Params | Key Idea |
|---|---|---|---|---|
| LeNet-5 | 1998 | 5 | 60K | Learned conv filters |
| AlexNet | 2012 | 8 | 60M | GPU training, ReLU, dropout |
| VGG-16 | 2014 | 16 | 138M | Small 3×3 filters, deeper |
| GoogLeNet | 2014 | 22 | 6.8M | Inception modules |
| ResNet-50 | 2015 | 50 | 25M | Skip connections |
In ResNet, the input x bypasses two conv layers. The output is F(x) + x. If F(x) = 0, the block passes input through unchanged.
Convolutional networks dominated computer vision for a decade, and they remain fundamental. But the landscape has evolved.
Vision Transformers (ViT) split images into patches and process them with self-attention instead of convolution. They achieve state-of-the-art results with enough data, but require more training data than ConvNets to learn spatial structure (since they lack the inductive bias of locality and weight sharing).
Hybrid architectures combine conv layers for early feature extraction with transformer layers for global reasoning. ConvNeXt showed that modernizing a pure ConvNet with transformer-era training tricks (larger kernels, LayerNorm, GELU) closes much of the gap.
| Approach | Locality Bias | Global Context | Data Efficiency |
|---|---|---|---|
| Pure ConvNet | Strong (built-in) | Limited (stacking) | High |
| Vision Transformer | None | Full (self-attention) | Low |
| Hybrid (Conv + Attn) | Early layers | Late layers | Medium |
Related lessons: Image Classification covers the data-driven approach and k-NN. Neural Networks covers backpropagation and optimization. GPT and Transformers cover the attention-based alternatives.
"What I cannot create, I do not understand." — Richard Feynman