Deep Learning Foundations

Convolutional Networks
From Zero

How a small sliding window learns to see edges, textures, and objects.

Prerequisites: Basic linear algebra + Neural network basics. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Convolutions?

Imagine you have a 256×256 color image and you want to classify it. If you flatten every pixel into a vector and feed it to a fully-connected neural network, the first layer alone needs 256 × 256 × 3 = 196,608 input weights per neuron. With 1,000 neurons in that layer, you're looking at ~200 million parameters. For a single layer. That's absurd.

Worse, a fully-connected layer treats pixel (0, 0) and pixel (255, 255) as equally related. But vision is local. An edge is a few adjacent pixels. A texture is a small patch. A whisker is a narrow region. A fully-connected network ignores this structure completely.

The core idea: Instead of connecting every pixel to every neuron, use a small sliding window — a filter — that looks at one patch at a time. The same filter slides across the entire image, detecting the same pattern everywhere. This is convolution, and it slashes parameters while respecting spatial structure.
Fully-Connected vs Convolution

Left: every input connects to every neuron (explosion of wires). Right: a small filter slides across, reusing the same weights. Far fewer parameters, same spatial awareness.

Convolutional networks exploit two properties of images. Locality: useful patterns (edges, corners, textures) are small. Translation invariance: an edge in the top-left is still an edge in the bottom-right. By sharing weights across space, convnets learn to detect features regardless of position.

Why is a fully-connected layer wasteful for image input?

Chapter 1: The Convolution Operation

Here's how convolution works. You have a small grid of numbers called a kernel (or filter) — typically 3×3 or 5×5. You place it over a patch of the image, multiply each kernel value by the pixel underneath, sum everything up, and write the result to the output. Then you slide the kernel one step to the right and repeat.

In math, for a 2D input I and kernel K of size k × k, the output at position (i, j) is:

O(i, j) = ∑m=0k-1n=0k-1 K(m, n) · I(i+m, j+n)

That's it. It's just "element-wise multiply, then sum" — a dot product between the kernel and the image patch. The kernel slides across the entire image, producing a 2D grid of outputs called a feature map (or activation map).

Key analogy: Think of the kernel as a stencil. It asks one question everywhere: "Does this patch match my pattern?" A high output means "yes, strong match." A low output means "no match here." The feature map is a heat map of where the pattern appears.
1D Convolution Step by Step

A 3-element kernel slides across a 1D signal. Watch the element-wise multiply and sum at each step.

In practice, convolution is applied to 3D volumes. An RGB image is W × H × 3, so the kernel is also 3D: k × k × 3. The kernel still produces a single number per position — it sums over all three channels. To detect multiple features, we use multiple kernels, each producing its own feature map.

What operation does the kernel perform at each position?

Chapter 2: Filters — What the Network Learns to See

Different kernels detect different patterns. This isn't magic — it falls directly out of the math. Let's look at three classic hand-crafted kernels to build intuition about what learned kernels will discover on their own.

Edge detection. The kernel [[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]] computes the difference between pixels to the right and pixels to the left. Where the image is uniform, the sum is ~0. Where there's a sharp brightness change — an edge — the sum is large.

Blur. The kernel [[1,1,1],[1,1,1],[1,1,1]] (divided by 9) averages a 3×3 neighborhood. Each output pixel becomes the mean of its neighbors, smoothing out noise and sharp transitions.

Sharpen. The kernel [[0,-1,0],[-1,5,-1],[0,-1,0]] amplifies the center pixel and subtracts its neighbors. This enhances contrast at edges, making the image crisper.

The key realization: In a convolutional network, these kernels are not hand-crafted. They start as random numbers and are learned through backpropagation. The network discovers the most useful patterns for the task — edges in early layers, textures in middle layers, object parts in deep layers.
Kernel Effects

Pick a kernel and see its effect on a sample image. The output feature map highlights different structures.

When researchers visualize the first-layer filters of a trained CNN (like AlexNet), they see exactly what you'd expect: Gabor-like edge detectors at various orientations and frequencies, plus color-contrast detectors. The network reinvented what neuroscientists found in the visual cortex decades earlier.

In a trained CNN, what do the first-layer filters typically detect?

Chapter 3: Stride and Padding

So far we've slid the kernel one pixel at a time. But we have two knobs that control how the kernel moves and what happens at the edges.

Stride is how many pixels the kernel jumps between positions. Stride 1 means one pixel at a time (dense overlap). Stride 2 means skip every other position, cutting the output size roughly in half. Larger strides produce smaller outputs and are a way to downsample without pooling.

Padding adds extra pixels (usually zeros) around the border of the input. Without padding, a 3×3 kernel on a 5×5 input only fits in 3×3 positions — the output shrinks. With one pixel of zero-padding, the input becomes 7×7, and the output stays 5×5 — same size as the input. This is called "same" padding.

Why padding matters: Without it, every conv layer shrinks the spatial dimensions. Stack 10 layers and your feature maps might vanish. Padding preserves size, letting you build deep networks without spatial collapse. It also lets border pixels contribute to as many outputs as center pixels.
Stride and Padding

Adjust stride and padding to see how they affect the output size. Blue = input, orange = kernel position, teal = output.

Stride 1
Padding 0
A 7×7 input, 3×3 kernel, stride 2, no padding. What is the output width?

Chapter 4: Pooling — Compress and Summarize

After convolution + ReLU, we often apply a pooling layer. Pooling takes a small window (typically 2×2) and replaces it with a single summary value. The most common variant, max pooling, keeps only the maximum value in each window.

Why discard information? Three reasons. First, it reduces spatial size — a 2×2 max pool with stride 2 halves both width and height, cutting the number of values by 75%. Second, it introduces a small amount of translation invariance: if a feature shifts by one pixel, the max in the 2×2 window stays the same. Third, fewer values means fewer parameters in subsequent layers.

Max pool vs Average pool: Max pooling asks "is this feature present anywhere in this region?" Average pooling asks "how strongly is this feature present on average?" Max pooling tends to work better for classification because the presence of a feature matters more than its average strength.
Max Pooling in Action

A 4×4 grid reduced to 2×2 by 2×2 max pooling. The maximum in each colored region becomes the output.

Average pooling replaces the max with the mean. It's less common in hidden layers but widely used as global average pooling (GAP) at the very end of modern networks. GAP takes an entire feature map (say 7×7) and averages it to a single number, replacing the massive fully-connected layers that older architectures used.

What does 2×2 max pooling with stride 2 do to spatial dimensions?

Chapter 5: Output Sizes — The Formula

There's one formula you'll use constantly when designing conv layers. Given an input of size W, kernel size K, padding P, and stride S, the output size is:

O = ⌊(W − K + 2P) / S⌋ + 1

Let's unpack it. W − K is how much room the kernel has to slide (it needs K pixels to fit). Adding 2P accounts for padding on both sides. Dividing by S counts how many stride-length jumps fit. The floor ⌊·⌋ discards fractional positions where the kernel would hang off the edge. Adding 1 counts the starting position.

Common recipes: To preserve spatial size with a 3×3 kernel at stride 1, set P = 1. For 5×5, set P = 2. The general rule: P = (K − 1) / 2 for odd-sized kernels.
Output Size Calculator

Set input size, kernel, padding, and stride. The output size updates in real time.

Input W 7
Kernel K 3
Padding P 0
Stride S 1

For a full conv layer, also count the number of parameters. Each filter has K × K × Cin weights plus 1 bias (where Cin is input channels). With N filters, that's N × (K × K × Cin + 1) total parameters. A 3×3 conv with 64 input and 128 output channels: 128 × (3 × 3 × 64 + 1) = 73,856 parameters. Compare that to a fully-connected layer connecting the same tensors — orders of magnitude smaller.

Input 32×32, kernel 5×5, padding 2, stride 1. Output size?

Chapter 6: Building a ConvNet

A full convolutional network stacks three types of layers in a repeating pattern:

CONV
Apply K learned filters → K feature maps
ReLU
max(0, x) — introduce nonlinearity
POOL
Downsample spatial dimensions
↻ repeat N times

After several CONV-ReLU-POOL blocks, the spatial dimensions are small but the channel depth is large. The final feature maps encode high-level concepts: "there's an eye here," "there's fur texture there." These are flattened into a vector and passed through one or two fully-connected layers to produce class scores.

The hierarchy of abstraction: Early layers detect edges (3×3 regions). Middle layers combine edges into textures and parts (effective ~40×40 regions through stacking). Deep layers respond to whole objects. This hierarchy emerges automatically from training — nobody programs it.
ConvNet Architecture

Watch how spatial size shrinks while depth (number of channels) grows through the network. This is the fundamental tradeoff in conv architectures.

python
import torch.nn as nn

model = nn.Sequential(
    nn.Conv2d(3, 32, 3, padding=1),   # 32x32x3  → 32x32x32
    nn.ReLU(),
    nn.MaxPool2d(2),                     # 32x32x32 → 16x16x32
    nn.Conv2d(32, 64, 3, padding=1),  # 16x16x32 → 16x16x64
    nn.ReLU(),
    nn.MaxPool2d(2),                     # 16x16x64 → 8x8x64
    nn.Flatten(),                          # 8*8*64 = 4096
    nn.Linear(4096, 10),                # 10 classes
)
What is the typical pattern of spatial size and channel depth through a ConvNet?

Chapter 7: Interactive Convolution Explorer

Now let's put it all together. Below is a small image grid. Pick a kernel, set your stride and padding, and watch the convolution happen step by step. The kernel slides across the input, computing the dot product at each position, building the output feature map one cell at a time.

The showcase: This is the payoff. You're going to watch convolution happen. See the kernel land on a patch, multiply element by element, sum the products, and write one output pixel. Then slide and repeat. Adjust the controls and notice how the output size changes.
2D Convolution — Step by Step

Pick a kernel. Hit Step to advance one position, or Play to animate. The orange overlay shows the kernel position. The teal grid is the output feature map.

Stride 1
Padding 0
Kernel: Edge | Output: 6×6 | Position: (0,0)

Chapter 8: Famous Architectures

The history of deep learning is largely the history of ConvNet architectures. Each breakthrough came from a simple idea about how to stack layers better.

LeNet-5 (1998). Yann LeCun's pioneer. Two conv layers, two pooling layers, three FC layers. Designed for 32×32 grayscale handwritten digits. Just ~60K parameters. Proved that learned features beat hand-crafted ones.

AlexNet (2012). The revolution. Won ImageNet by a massive margin. Same idea as LeNet but bigger (5 conv layers, 60M parameters), trained on GPUs for the first time, and used ReLU instead of tanh. Showed that scale + data + compute = breakthrough.

VGGNet (2014). The "deeper is better" insight. Used only 3×3 kernels everywhere, stacked very deep (16-19 layers). Two 3×3 layers have the same receptive field as one 5×5 but with fewer parameters and more nonlinearity. Clean, uniform design.

ResNet (2015). The biggest idea since convolution itself. Added skip connections (residual connections) that let the input bypass a layer: y = F(x) + x. This solved the degradation problem — deeper networks were performing worse than shallower ones because gradients vanished. With skip connections, networks of 50, 101, even 152 layers trained easily.

Why ResNet matters: Without skip connections, a layer must learn the full desired mapping. With skip connections, it only needs to learn the residual — the difference from identity. Learning "change nothing" (all zeros) is easy. Learning "do a complex transform" is hard. Skip connections make the easy case the default.
NetworkYearLayersParamsKey Idea
LeNet-51998560KLearned conv filters
AlexNet2012860MGPU training, ReLU, dropout
VGG-16201416138MSmall 3×3 filters, deeper
GoogLeNet2014226.8MInception modules
ResNet-5020155025MSkip connections
Skip Connection Diagram

In ResNet, the input x bypasses two conv layers. The output is F(x) + x. If F(x) = 0, the block passes input through unchanged.

What problem do skip connections solve in deep networks?

Chapter 9: Beyond — Where ConvNets Go from Here

Convolutional networks dominated computer vision for a decade, and they remain fundamental. But the landscape has evolved.

Vision Transformers (ViT) split images into patches and process them with self-attention instead of convolution. They achieve state-of-the-art results with enough data, but require more training data than ConvNets to learn spatial structure (since they lack the inductive bias of locality and weight sharing).

Hybrid architectures combine conv layers for early feature extraction with transformer layers for global reasoning. ConvNeXt showed that modernizing a pure ConvNet with transformer-era training tricks (larger kernels, LayerNorm, GELU) closes much of the gap.

ApproachLocality BiasGlobal ContextData Efficiency
Pure ConvNetStrong (built-in)Limited (stacking)High
Vision TransformerNoneFull (self-attention)Low
Hybrid (Conv + Attn)Early layersLate layersMedium
The lasting legacy: Even if transformers replace ConvNets for some tasks, the ideas — local receptive fields, weight sharing, hierarchical feature extraction, residual connections — are permanent contributions to deep learning. You can't understand modern vision without understanding convolutions.

Related lessons: Image Classification covers the data-driven approach and k-NN. Neural Networks covers backpropagation and optimization. GPT and Transformers cover the attention-based alternatives.

"What I cannot create, I do not understand." — Richard Feynman