← Gleams
Stanford CS 231n · Lecture 5 · Image Classification with CNNs

Image Classification with Convolutional Neural Networks

Fully-connected layers destroy the spatial structure of images. Convolutions respect it — and that single change launched the deep learning revolution in computer vision.

Convolution operation Stride & padding Pooling layers Feature hierarchies
Roadmap

What You'll Master

Chapter 01

Why Convolutions?

You have a 32×32 color image — a photograph of a cat, say. You want to classify it. The approach from Lecture 4 was simple: stretch the image into a long vector of 32 × 32 × 3 = 3,072 numbers, then multiply by a weight matrix. That's a fully-connected layer.

It works, technically. But it throws away something precious: the spatial structure of the image. Pixel (0, 0) ends up right next to pixel (0, 1) in the vector, sure — but also right next to pixel (31, 31), which is on the opposite corner of the image. The network has no idea that nearby pixels are related.

The Core Problem

A fully-connected layer connecting a 32×32×3 image to 100 hidden units needs 3,072 × 100 = 307,200 parameters — just for one layer. For a 224×224 image (standard ImageNet size), that explodes to 150,528 × 100 = 15 million parameters. Most of these parameters are wasted learning that distant pixels rarely interact.

There's a deeper problem too. A fully-connected layer learns one giant template per output unit — a single 32×32 pattern that it matches against the whole image. If the cat moves from the left side to the right side, a new template is needed. The network can't reuse what it learned about "cat ears" in one position for another position.

Definition
Translation Equivariance

An operation is translation equivariant if shifting the input shifts the output by the same amount. If you move a cat 10 pixels right, the "cat detector" output should also shift 10 pixels right. Fully-connected layers don't have this property. Convolutions do.

What We Want Instead

Three properties would make image processing dramatically more efficient:

1. Local connectivity. Each output unit should only look at a small patch of the input (say 5×5), not the entire image. Nearby pixels matter most.

2. Weight sharing. The same small filter should be applied at every position. If it detects a vertical edge at position (3, 7), it should detect the same edge at position (20, 15).

3. Spatial preservation. The output should still be a 2D grid, preserving the spatial relationships of the input.

Parameter Savings

A 5×5 filter on a 3-channel image has 5 × 5 × 3 + 1 = 76 parameters. It slides across the entire 32×32 image, producing a 28×28 output map. Versus a fully-connected layer connecting 3,072 inputs to one output: 3,073 parameters — 40× more, for a single output pixel. And the conv filter works at every position, so it's really 3,073 × 784 = 2.4M equivalent parameters if we wanted the FC layer to produce the same-sized output.

This is the idea behind convolutional layers: small learned filters that slide across the image, detecting local patterns everywhere. One filter might learn to detect horizontal edges. Another detects blobs of color. Another detects corners. Stack enough of these, and you build a hierarchy from edges to textures to parts to whole objects.

The Biological Inspiration

Hubel and Wiesel (1959) discovered that neurons in the cat's visual cortex respond to small local regions of the visual field — receptive fields. Some neurons detect edges at specific orientations. Deeper neurons respond to more complex patterns. The visual cortex is organized hierarchically, just like a CNN. Fukushima's Neocognitron (1980) formalized this into the first convolutional architecture.

Chapter 02

The Convolution Operation

Here's the core operation. You have an input image (or feature map) and a small filter (also called a kernel). The filter slides across the image, one position at a time. At each position, you compute the dot product between the filter and the patch of the image it covers. That dot product becomes one pixel in the output.

Definition
2D Convolution (Discrete)

Given an input I of size H × W and a filter K of size KH × KW, the output at position (i, j) is:

Convolution Formula O(i, j) = ∑m=0KH-1n=0KW-1 I(i+m, j+n) · K(m, n) + b
Where b is a scalar bias term added to every output position.

In words: place the filter's top-left corner at position (i, j). Multiply each filter element by the corresponding image element. Sum all the products. Add the bias. That sum is the output at (i, j).

Edge Detection: A Concrete Example

Before learning anything, let's see convolution in action with hand-crafted filters. Consider this 5×5 grayscale image (0 = black, 255 = white):

Worked Example — Vertical Edge Detector

Input (5×5): a bright region on the left, dark on the right.

[[255, 255, 0, 0, 0],
 [255, 255, 0, 0, 0],
 [255, 255, 0, 0, 0],
 [255, 255, 0, 0, 0],
 [255, 255, 0, 0, 0]]

Filter (3×3) — the Sobel vertical edge detector:

[[-1, 0, 1],
 [-2, 0, 2],
 [-1, 0, 1]]

Position (0,0): 255×(-1) + 255×0 + 0×1 + 255×(-2) + 255×0 + 0×2 + 255×(-1) + 255×0 + 0×1 = -1020

Position (0,1): 255×(-1) + 0×0 + 0×1 + 255×(-2) + 0×0 + 0×2 + 255×(-1) + 0×0 + 0×1 = -1020

Position (0,2): 0×(-1) + 0×0 + 0×1 + 0×(-2) + 0×0 + 0×2 + 0×(-1) + 0×0 + 0×1 = 0

Large negative values appear exactly at the vertical edge between bright and dark. The filter detected it!

The key insight: different filters detect different features. A horizontal edge detector uses a different 3×3 pattern. A corner detector uses yet another. In a CNN, the network learns what filters to use — hundreds of them — by backpropagation.

Convolution Operation
Watch a 3×3 filter slide across a 7×7 input, computing the dot product at each position. Click "Step" to advance, or "Play" to animate.
What Makes Convolution Special

Weight sharing: the same 3×3 filter (9 parameters) is used at every position. If it detects a vertical edge at the top-left, it detects the same edge at the bottom-right. Local connectivity: each output pixel depends on only a 3×3 patch of the input, not the entire image. These two properties are why CNNs are so parameter-efficient.

Convolution vs. Cross-Correlation

Mathematically, convolution flips the filter before sliding. In deep learning, we skip the flip and use cross-correlation instead. Every framework (PyTorch, TensorFlow) calls it "convolution" but implements cross-correlation. Since the filter weights are learned, flipping doesn't matter — the network just learns the already-flipped version.

Chapter 03

Stride, Padding & Output Size

So far the filter moves one pixel at a time. But we have two knobs to turn: how far the filter jumps (stride) and whether we pad the edges of the input (padding).

Definition
Stride

The number of pixels the filter moves between positions. Stride 1 means slide one pixel at a time (the default). Stride 2 means skip every other position, producing an output roughly half the size. Stride is a hyperparameter — not learned.

Definition
Padding

Extra pixels (usually zeros) added around the border of the input before convolution. Without padding, the output is always smaller than the input. With the right amount of padding, the output can be the same size.

The Output Size Formula

This is the single most important formula in CNN design. You'll use it constantly.

Output Size Formula O = ⌊(W − K + 2P) / S⌋ + 1

W = input size    K = kernel (filter) size    P = padding    S = stride

Let's see where this comes from. Without padding or stride, a filter of size K on an input of size W can be placed at positions 0, 1, 2, ..., W − K. That's W − K + 1 positions. Adding P zeros on each side makes the effective input size W + 2P, so we get (W + 2P) − K + 1 = W − K + 2P + 1 positions. With stride S, we only use every S-th position, giving ⌊(W − K + 2P) / S⌋ + 1.

Worked Example — Output Size Calculations

Case 1: W = 7, K = 3, P = 0, S = 1. Output = (7 − 3 + 0)/1 + 1 = 5.

Case 2: W = 7, K = 3, P = 1, S = 1. Output = (7 − 3 + 2)/1 + 1 = 7. Same size! This is "same" padding.

Case 3: W = 7, K = 3, P = 0, S = 2. Output = (7 − 3 + 0)/2 + 1 = 3. Downsampled by ~2×.

Case 4: W = 32, K = 5, P = 2, S = 1. Output = (32 − 5 + 4)/1 + 1 = 32. Same size with K=5.

Common Padding Strategies

NameRuleEffect
Valid (no padding)P = 0Output shrinks by K−1 pixels per layer
Same paddingP = (K − 1) / 2Output same size as input (when S = 1)
Full paddingP = K − 1Output larger than input (rarely used in practice)

Same padding is by far the most common. With K = 3, use P = 1. With K = 5, use P = 2. This keeps the spatial dimensions constant through the conv layer, and you control downsampling explicitly with stride or pooling.

Stride & Padding Explorer
Adjust stride and padding to see how the output size changes. The formula updates in real time.
The "Same" Padding Trick

For any odd kernel size K, setting P = (K − 1)/2 with stride 1 gives output = input size. This is why almost all modern CNNs use odd-sized filters (3×3, 5×5, 7×7). Even-sized filters (2×2, 4×4) make "same" padding asymmetric and awkward.

Invalid Configurations

If (W − K + 2P) is not evenly divisible by S, the formula gives a non-integer. Most frameworks handle this by truncating (floor), but it means the filter doesn't cleanly cover the input. Best practice: choose hyperparameters so the division is exact.

Chapter 04

Multiple Filters & Channels

So far we've been convolving a single 2D filter over a single-channel (grayscale) input. Real images have 3 channels (RGB), and real conv layers use many filters. Let's handle both.

Multi-Channel Input

A color image has shape Cin × H × W (e.g., 3 × 32 × 32). The filter must match the depth of the input: it's Cin × KH × KW (e.g., 3 × 5 × 5). At each spatial position, you compute a dot product across all channels simultaneously — a 3 × 5 × 5 = 75-dimensional dot product — to produce a single number.

Multi-Channel Convolution O(i, j) = ∑c=0Cin-1m=0K-1n=0K-1 I(c, i+m, j+n) · K(c, m, n) + b
The filter extends the full depth of the input. One filter → one 2D output map.

Multiple Filters = Multiple Feature Maps

One filter detects one kind of feature (say, vertical edges). But we want to detect many features. So we use Cout different filters, each producing its own 2D output. Stack them, and the output is a 3D volume: Cout × H' × W'.

Worked Example — Parameter Count

Input: 3 × 32 × 32 (RGB image)

Layer: 10 filters, each 5×5, stride 1, padding 2

Each filter shape: 3 × 5 × 5 = 75 weights + 1 bias = 76 parameters

Total parameters: 10 × 76 = 760

Output size: (32 − 5 + 4)/1 + 1 = 32. So output is 10 × 32 × 32.

Compare: An FC layer from 3,072 to 10,240 (same output size) would need 31+ million parameters. Conv uses 760. That's a 41,000× reduction.

Compute Cost

Each output pixel requires one dot product of size Cin × K × K. The total number of multiply-add operations:

FLOPs for One Conv Layer FLOPs = Cout × H' × W' × Cin × K × K
For our example: 10 × 32 × 32 × 3 × 5 × 5 = 768,000 multiply-adds.
Definition
Feature Map (Activation Map)

The 2D output produced by one filter. If a layer has Cout filters, it produces Cout feature maps. Each map is a spatial grid where bright regions indicate where that filter's pattern was detected. At each spatial position, you can also think of the Cout values as a Cout-dimensional feature vector describing that location.

Summary Table

QuantityFormulaExample (3×32×32, 10 filters 5×5, P=2, S=1)
Filter shapeCout × Cin × K × K10 × 3 × 5 × 5
Bias shapeCout10
ParametersCout(Cin × K × K + 1)760
Output shapeCout × H' × W'10 × 32 × 32
FLOPsCout × H' × W' × Cin × K × K768,000
Thinking About Batches

In practice, inputs come in batches: N × Cin × H × W. The conv layer applies the same filters to every image in the batch independently, producing N × Cout × H' × W'. The filter weights are shared across all images in the batch — another level of weight sharing.

Chapter 05

Pooling Layers

Conv layers can downsample using stride > 1. But there's another way to reduce spatial dimensions: pooling. Unlike conv layers, pooling has no learnable parameters — it just applies a fixed operation.

Definition
Pooling Layer

A pooling layer slides a window over each feature map independently and replaces each window with a single summary value. The most common: max pooling (take the maximum value in the window). Also used: average pooling (take the mean).

Max Pooling O(i, j) = maxm,n ∈ window I(i·S + m, j·S + n)
Typical setting: 2×2 window with stride 2 → halves H and W.
Worked Example — 2×2 Max Pooling

Input (4×4 feature map):

[[1, 1, 2, 4],
 [5, 6, 7, 8],
 [3, 2, 1, 0],
 [1, 2, 3, 4]]

2×2 max pool, stride 2:

Top-left 2×2: max(1,1,5,6) = 6

Top-right 2×2: max(2,4,7,8) = 8

Bottom-left 2×2: max(3,2,1,2) = 3

Bottom-right 2×2: max(1,0,3,4) = 4

Output: [[6, 8], [3, 4]] — from 4×4 down to 2×2.

Max Pooling Visualizer
Watch 2×2 max pooling in action. Each colored window picks the maximum value. Click "New Data" for a different feature map.

Why Pool?

1. Reduce computation. A 2×2 pool with stride 2 halves both H and W, cutting the number of activations (and thus the compute for the next layer) by 4×.

2. Translation invariance. If a feature (like an edge) shifts by one pixel, the max pool output stays the same — the max value is still captured. This small amount of invariance helps the network focus on "what" features are present, not exactly "where" they are.

3. Increase receptive field. After pooling, each pixel in the next layer effectively "sees" a larger region of the original input.

Pooling is Applied Per-Channel

A pooling layer operates on each feature map independently. If the input has 64 channels, the output also has 64 channels — each one downsampled separately. The number of channels never changes through pooling. Only H and W shrink.

Global Average Pooling

A special case: the window covers the entire feature map. If the input is C × 7 × 7, global average pooling produces a C × 1 × 1 vector (or equivalently, a C-dimensional vector). This is often used just before the final classification layer, replacing a large fully-connected layer.

Pooling Output Size H' = (H − K) / S + 1      W' = (W − K) / S + 1
No padding term — pooling rarely uses padding. No learnable parameters.
Pooling TypeOperationLearnable?Use Case
Max PoolMaximum in windowNoStandard downsampling
Average PoolMean of windowNoSmoother downsampling
Global Average PoolMean of entire mapNoReplace FC before classifier
Strided ConvConv with S > 1YesModern alternative to pooling
The Trend Away from Pooling

Modern architectures increasingly replace pooling with strided convolutions (stride 2). The argument: if we're going to downsample, we might as well learn how to downsample, rather than using a fixed max/average rule. ResNets and many modern CNNs use strided conv for downsampling.

Chapter 06

Full CNN Architecture

Now we have all the building blocks. A CNN stacks them in a specific pattern:

Classic CNN Architecture Pattern
  1. Input: Cin × H × W image (e.g., 3 × 32 × 32)
  2. Conv Block (repeat N times): Convolution → ReLU → (optional) Pooling
  3. Flatten: Reshape 3D feature volume into 1D vector
  4. FC Layers (repeat K times): Fully-Connected → ReLU
  5. Output: FC to C classes → Softmax

The historical notation: [(CONV-RELU)*N - POOL?]*M - (FC-RELU)*K - SOFTMAX where N is usually 1–5 conv-relu pairs between each pool, M is the number of pool stages (typically 3–5), and K is 0–2 FC layers at the end.

The Feature Hierarchy

Something remarkable happens as you stack more layers. The first conv layer learns to detect edges and color contrasts — simple local patterns. The second layer combines those edges into textures and corners. The third detects parts (eyes, wheels, windows). Deeper layers detect entire objects or scenes.

The Hierarchy

Layer 1: Edges, color gradients. Layer 2: Corners, textures, simple shapes. Layer 3: Object parts (eyes, ears, wheels). Layer 4+: Whole objects, scenes. This hierarchy emerges automatically from training — nobody programs it. It mirrors the hierarchical organization of the primate visual cortex.

Worked Example — A Simple CNN

Input: 3 × 32 × 32

Conv1: 6 filters 5×5, P=0, S=1. Output: 6 × 28 × 28. Params: 6 × (3×5×5+1) = 456.

Pool1: 2×2 max, S=2. Output: 6 × 14 × 14. Params: 0.

Conv2: 16 filters 5×5, P=0, S=1. Output: 16 × 10 × 10. Params: 16 × (6×5×5+1) = 2,416.

Pool2: 2×2 max, S=2. Output: 16 × 5 × 5 = 400 values.

Flatten: 400-dim vector.

FC1: 400 → 120. Params: 400×120 + 120 = 48,120.

FC2: 120 → 84. Params: 120×84 + 84 = 10,164.

FC3: 84 → 10 (classes). Params: 84×10 + 10 = 850.

Total: 62,006 parameters. (This is essentially LeNet-5, circa 1998.)

Where the Parameters Live

In the LeNet example above, the conv layers have 456 + 2,416 = 2,872 parameters (4.6% of total). The FC layers have 48,120 + 10,164 + 850 = 59,134 parameters (95.4%). FC layers dominate the parameter count. This is why modern architectures try to minimize or eliminate FC layers — using global average pooling directly before the final classifier.

Spatial Dimensions Through a CNN

Watch what happens to the spatial dimensions and channel count as data flows through:

LayerOutput ShapeSpatialChannels
Input3 × 32 × 32LargeFew (3)
Conv + Pool32 × 16 × 16
Conv + Pool64 × 8 × 8
Conv + Pool128 × 4 × 4
Conv + Pool256 × 2 × 2SmallMany (256)

The pattern is universal: spatial dimensions shrink, channel dimensions grow. Early layers have large spatial maps with few channels (capturing where). Deep layers have small spatial maps with many channels (capturing what). This is the fundamental trade-off of CNNs.

The Funnel Metaphor

Think of a CNN as a funnel that compresses spatial information (32×32 → 1×1) while expanding semantic information (3 channels → 256 channels). The input tells you the color of every pixel. The output tells you what object is there. The conv layers gradually trade "where" for "what."

Chapter 07

Receptive Field

When you convolve a 3×3 filter over the input, each output pixel depends on a 3×3 patch of the input. Now stack a second 3×3 conv. Each output pixel of the second layer depends on a 3×3 patch of the first layer's output — and each of those depends on a 3×3 patch of the input. So each second-layer output pixel effectively "sees" a 5×5 region of the original input.

Definition
Receptive Field

The region of the original input that influences a particular output neuron. For the first conv layer, it's just the filter size. For deeper layers, it grows with each layer because each neuron's input depends on a region that itself depends on a region.

Receptive Field with Stride 1 RF = 1 + L × (K − 1)
Where L = number of conv layers, K = kernel size (all same). With L=2, K=3: RF = 1 + 2×2 = 5.
Worked Example — Receptive Field Growth

All 3×3 filters, stride 1, same padding:

After 1 layer: RF = 1 + 1×2 = 3×3

After 2 layers: RF = 1 + 2×2 = 5×5

After 3 layers: RF = 1 + 3×2 = 7×7

After 5 layers: RF = 1 + 5×2 = 11×11

With a 224×224 input, you'd need (224 − 1)/2 = 112 layers of 3×3 conv for a single neuron to "see" the whole image. That's a lot of layers!

Why Receptive Field Matters

For image classification, the final output needs to consider the entire image. If the receptive field of the final layer neurons doesn't cover the whole input, the network is making decisions based on only a portion of the image — it might miss the cat's tail on the other side.

Growing the Receptive Field Faster

Three strategies: (1) Larger filters — 5×5 or 7×7 instead of 3×3 (more parameters). (2) Strided convolutions or pooling — downsampling before the next conv effectively magnifies each pixel's receptive field (the most common approach). (3) Dilated convolutions — skip pixels in the filter pattern, expanding the receptive field without extra parameters or downsampling.

The General Formula

When layers have different strides, the receptive field calculation becomes recursive. For layer ℓ with kernel K and stride S:

General Receptive Field (Recursive) RF = RFℓ-1 + (K − 1) × ∏i=1ℓ-1 Si
RF0 = 1. Stride in previous layers multiplies the RF growth of later layers.
Worked Example — With Stride

Layer 1: K=3, S=1. Layer 2: K=3, S=2. Layer 3: K=3, S=1.

RF1 = 1 + (3−1)×1 = 3

RF2 = 3 + (3−1)×1 = 5

RF3 = 5 + (3−1)×(1×2) = 5 + 4 = 9

The stride-2 layer at layer 2 made layer 3's growth count double. Stride multiplies all future RF growth.

Why Stride Multiplies Growth

After a stride-2 layer, each pixel in the output corresponds to a 2-pixel jump in the input. So when the next 3×3 filter covers 3 output pixels, those 3 pixels span 3×2 − 1 = 5 input pixels. The stride-2 acts like a magnifying glass: everything after it operates in a "zoomed out" coordinate system where each step covers more ground.

Chapter 08

Advanced Convolutions

Standard convolution works well, but researchers have developed several important variants that trade off parameters, compute, and expressiveness in clever ways.

1×1 Convolutions

At first, a 1×1 filter sounds pointless — it covers just a single pixel. But remember: filters extend across all input channels. A 1×1 conv on a 64-channel input computes a 64-dimensional dot product at each spatial position. It's a per-pixel fully-connected layer across channels.

Definition
1×1 Convolution (Pointwise Convolution)

A convolution with K = 1. It doesn't mix spatial information — it mixes channels. If the input has Cin channels and you use Cout 1×1 filters, it's equivalent to applying a Cin → Cout linear projection independently at each spatial position. Parameters: Cout × Cin + Cout.

Worked Example — Dimension Reduction with 1×1 Conv

Input: 256 × 56 × 56 (256 channels, 56×56 spatial).

1×1 conv with 64 filters: Output = 64 × 56 × 56. Parameters: 64 × 256 + 64 = 16,448.

We reduced channels from 256 to 64 — a 4× reduction — without touching the spatial dimensions. This is a bottleneck: it compresses the channel dimension before an expensive 3×3 conv, saving massive compute.

Without the bottleneck, a 3×3 conv from 256→256 channels costs 256 × 256 × 9 = 589,824 params. With a 1×1 bottleneck: 256→64 (16,448) + 3×3 64→64 (36,928) + 1×1 64→256 (16,640) = 70,016 params. An 8.4× reduction.

Grouped Convolution

Standard convolution: every filter sees all Cin input channels. Grouped convolution splits the input channels into G groups, and each filter only sees Cin/G channels from its group.

Grouped Convolution Parameters Standard: Cout × Cin × K × K
Grouped (G groups): Cout × (Cin/G) × K × K
Parameter and compute reduction: G×. Each group processes independently.
Worked Example — Groups

Standard conv: Cin=64, Cout=64, K=3. Params: 64 × 64 × 3 × 3 = 36,864.

Grouped conv (G=4): 4 groups, each: 16 input channels, 16 output channels. Params per group: 16 × 16 × 3 × 3 = 2,304. Total: 4 × 2,304 = 9,216. A 4× reduction.

Fun fact: AlexNet (2012) used G=2 because the model had to be split across two GPUs, each handling half the channels. An engineering hack that turned out to be a useful architectural principle!

Depthwise Separable Convolution

Take grouped convolution to the extreme: G = Cin. Each filter operates on a single channel independently. This is called a depthwise convolution. It captures spatial patterns within each channel but doesn't mix information across channels.

To mix channels, follow the depthwise conv with a 1×1 pointwise convolution. The combination — depthwise + pointwise — is a depthwise separable convolution.

Depthwise Separable Convolution Depthwise params: Cin × K × K
Pointwise params: Cout × Cin
Total: Cin × K × K + Cout × Cin

Reduction vs standard: (Cin × K × K + Cout × Cin) / (Cout × Cin × K × K) ≈ 1/Cout + 1/K2
Worked Example — Separable vs Standard

Standard: Cin=64, Cout=64, K=3. Params: 64 × 64 × 9 = 36,864.

Depthwise separable: Depthwise: 64 × 9 = 576. Pointwise: 64 × 64 = 4,096. Total: 4,672.

Reduction: 36,864 / 4,672 ≈ 7.9× fewer parameters. Nearly 8× more efficient!

MobileNet (Howard et al., 2017) used depthwise separable convolutions throughout, achieving competitive accuracy with far fewer parameters — small enough to run on phones.

The Design Pattern

Modern efficient CNNs follow a common recipe: 1×1 pointwise (expand/compress channels) + depthwise K×K (spatial filtering per channel) + 1×1 pointwise (project back). This "inverted bottleneck" pattern appears in MobileNetV2, EfficientNet, and ConvNeXt. It separates "what to mix across channels" from "how to process spatially" — a factorization that's remarkably efficient.

Conv TypeParams (C=64, K=3)Relative CostUsed In
Standard36,8641.0×VGG, early layers
Grouped (G=4)9,2160.25×AlexNet, ResNeXt
Depthwise Sep.4,6720.13×MobileNet, EfficientNet
1×1 only4,0960.11×NiN, Inception bottlenecks
Chapter 09

Convolution Visualizer

Time to see the full pipeline in action. This interactive visualizer shows an input image flowing through a conv layer, ReLU activation, and max pooling. Watch how each stage transforms the data.

Select different filter types to see how each one detects different features. The ReLU clips negative values to zero. The pooling downsamples by taking the max in each 2×2 window.

Full CNN Pipeline: Conv → ReLU → Pool
Choose a filter and input pattern. Watch activations propagate through the stages. Colors represent activation magnitude: bright = high, dark = low.
What to Notice

Vertical edge filter on vertical edge input: strong activations along the edge. On horizontal edge input: almost no response — the filter and the pattern are orthogonal. After ReLU: all negative activations become zero (dark cells disappear). After pooling: the spatial size halves but the strongest features survive. This is how a CNN builds invariance while preserving the important stuff.

Chapter 10

Summary & Connections

We've built a complete understanding of convolutional neural networks for image classification. Here's the whole picture in one view.

Key Formulas

FormulaPurpose
O = (W − K + 2P) / S + 1Output spatial size
Params = Cout(Cin × K2 + 1)Conv layer parameters
FLOPs = Cout × H' × W' × Cin × K2Compute cost
RF = 1 + L(K − 1)Receptive field (uniform stride=1)
RF = RFℓ-1 + (K−1) ∏ SiGeneral receptive field

The Building Blocks

ComponentWhat It DoesLearnable?
ConvolutionDetects local spatial patterns; weight sharingYes (filters + bias)
ReLUIntroduces nonlinearity; clips negatives to 0No
PoolingDownsamples spatial dimensions; adds invarianceNo
Fully ConnectedGlobal reasoning; classification at the endYes (weights + bias)
1×1 ConvChannel mixing / dimension reductionYes
Depthwise Sep.Efficient spatial + channel factorizationYes

What Comes Next

This lecture covered the primitives of CNNs. The next lectures build on them:

Lecture 6 — CNN Architectures: How to arrange these building blocks. AlexNet, VGG, GoogLeNet (Inception), ResNet, and the evolution of modern architectures. The key innovation: skip connections and very deep networks.

Lecture 8 — Vision Transformers: Starting around 2021, transformers began replacing CNNs for vision tasks. The core idea: treat image patches as tokens and use self-attention instead of convolution. But even ViTs borrow from CNNs — patch embedding is essentially a strided convolution.

CNNs Are Not Dead

Despite the transformer revolution, convolutional layers remain fundamental. ConvNeXt (2022) showed that a pure CNN, modernized with transformer-era training techniques, matches Vision Transformer performance. Many hybrid architectures use conv layers in early stages and attention in later stages. The inductive biases of convolution — locality, translation equivariance, weight sharing — remain powerful, especially with limited data.

Key Concept
Translation Equivariance (Revisited)

Convolution and pooling are both translation equivariant: Conv(Translate(X)) = Translate(Conv(X)). If you shift the input, the output shifts by the same amount. This is the formal reason CNNs work for vision — features of images don't depend on their absolute position. A cat in the top-left corner has the same ears as a cat in the bottom-right.

From LeCun's handwritten digit recognizer (1998) to AlexNet's ImageNet breakthrough (2012) to MobileNet on your phone — the convolution operation is one of the most impactful ideas in all of deep learning. You now understand exactly how it works, why it works, and where it's headed.