← Gleams
Stanford CS 231n · Lecture 6 · Training CNNs and CNN Architectures

Training CNNs & the Architecture Revolution

You know what a convolution does. But how do you actually train a 152-layer network — and why does adding more layers sometimes make things worse? This is the story of the tricks and architectures that solved it.

Data augmentation Batch normalization Residual connections AlexNet → ConvNeXt
Roadmap

What You'll Master

Chapter 01

The Gap Between Theory and Practice

You understand convolutions, pooling, fully-connected layers. You can draw a CNN on a whiteboard. So you stack some layers, write model.fit(), and wait. The training loss barely moves. Or it plummets on the training set and the model hallucinates on test images. What went wrong?

Nothing conceptual. The architecture is fine. The problem is that deep neural networks are remarkably finicky to train. A 20-layer plain network can outperform an 8-layer one — but a 56-layer network often performs worse than the 20-layer network, even on the training set. This isn't overfitting. This is an optimization failure: deeper networks are harder for gradient descent to navigate.

The Central Tension

Deeper networks have strictly more representational capacity — a 56-layer network can represent everything a 20-layer one can, by setting the extra layers to identity mappings. But gradient descent can't find these solutions. The gap between what a network can represent and what training can actually reach is the entire subject of this lecture.

This lecture covers two sides of the same coin: training tricks that help gradient descent do its job (augmentation, dropout, batch normalization), and architectural innovations that reshape the loss landscape to make optimization easier (VGG's small filters, Inception's parallel paths, ResNet's skip connections).

The ImageNet Story

From 2012 to 2017, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) drove a revolution. Error rates dropped from 26% (hand-crafted features) to 16% (AlexNet, 8 layers) to 7.3% (VGG, 19 layers) to 6.7% (GoogLeNet, 22 layers) to 3.6% (ResNet, 152 layers) — surpassing human performance (~5%). Each leap came from a specific insight about training or architecture.

YearModelLayersTop-5 ErrorKey Insight
2012AlexNet816.4%GPUs + ReLU + dropout
2013ZFNet811.7%Visualization + tuning
2014VGG197.3%Small 3×3 filters, depth
2014GoogLeNet226.7%Inception modules, 1×1 conv
2015ResNet1523.6%Skip connections
2017SENet1522.3%Channel attention
Why Not Just Go Deeper?

He et al. (2015) showed that a 56-layer plain CNN has higher training error than a 20-layer one. Not higher test error — higher training error. The network isn't overfitting; it's failing to optimize at all. This is the degradation problem, and it motivated the most important architectural innovation since the convolutional layer itself: the residual connection.

Chapter 02

Data Augmentation

Your dataset has 50,000 images. Your model has 25 million parameters. This mismatch is a recipe for overfitting. One of the most effective remedies is embarrassingly simple: show the network slightly different versions of the same image every time.

Definition
Data Augmentation

Applying random, label-preserving transformations to training images before feeding them to the network. A flipped cat is still a cat. A slightly darker photo of a dog is still a dog. Each epoch, the network sees a different version of each image — effectively multiplying the dataset size without collecting new data.

The Standard Toolkit

Horizontal flip. Mirror the image left-to-right. This doubles your effective dataset. A bird facing left and a bird facing right should both be classified as "bird." This is the single most common augmentation and is almost always beneficial.

Random crop and scale. Resize the image so the short side is a random length (say 256–480 pixels), then extract a random 224×224 patch. This teaches the network to recognize objects at different scales and positions. ResNet's training recipe: pick random L ∈ [256, 480], resize, crop 224×224.

Color jitter. Randomly adjust brightness, contrast, and saturation. The lighting conditions in your training set shouldn't dictate what the network learns. A slightly blue-tinted car is still a car.

Cutout. Mask a random rectangular region of the image with zeros. This forces the network to rely on multiple parts of the object, not just one discriminative patch. Works especially well on small datasets like CIFAR-10.

Mixup. Blend two training images and their labels: x̄ = λxi + (1−λ)xj, ȳ = λyi + (1−λ)yj, where λ ~ Beta(0.2, 0.2). A 70/30 blend of "cat" and "dog" gets the label [0.7 cat, 0.3 dog]. This regularizes the network by training it on convex combinations of examples.

CutMix. Instead of blending pixel values (Mixup), cut a rectangular patch from one image and paste it onto another. The label is proportional to the area ratio. This preserves local image statistics better than Mixup's ghostly blends.

The Regularization Pattern

All augmentations follow the same template: add randomness during training, remove it at test time. During training, each image is randomly transformed. At test time, you either use the original image or average predictions over a fixed set of crops (test-time augmentation). The randomness prevents the network from memorizing specific pixel patterns.

Data Augmentation Gallery Interactive

Click an augmentation type to see how it transforms an image. Each click generates a new random transformation.

Worked Example — Test-Time Augmentation

ResNet at test time uses 10-crop evaluation: resize the image to 5 scales {224, 256, 384, 480, 640}. At each scale, take 4 corner crops + 1 center crop, plus their horizontal flips = 10 crops per scale × 5 scales = 50 forward passes. Average the predictions. This ensemble of crops improves accuracy by ~1% over a single center crop.

When Augmentation Hurts

Not all augmentations are label-preserving. Vertical flip makes sense for satellite images (no "up" or "down") but not for digit recognition (a flipped 6 becomes a 9). Always verify that your augmentations don't change the correct label.

Chapter 03

Regularization Revisited

Data augmentation is one form of regularization. But CNNs have other tricks in their regularization arsenal, all following the same principle: inject noise during training, average it out at test time.

Dropout

During each forward pass, randomly set each neuron's output to zero with probability p (typically p = 0.5 for FC layers). At test time, keep all neurons active but multiply outputs by (1−p) to compensate for the missing activations.

Definition
Dropout

A regularization technique that randomly "drops" (zeros out) neurons during training. Think of it two ways: (1) it forces redundant representations — no single neuron can be essential, because it might be dropped. (2) It trains an exponentially large ensemble of sub-networks that share parameters. An FC layer with 4096 units has 24096 possible dropout masks — more configurations than atoms in the universe.

Dropout Training:   ĥ = m ⊙ h,   mi ~ Bernoulli(1 − p)
Test:   ĥ = (1 − p) · h
m is a binary mask, ⊙ is element-wise multiply, p is the drop probability

Why does it work? Consider a "cat" classifier. Without dropout, the network might learn: "if neuron 327 (ear detector) AND neuron 891 (whisker detector) AND neuron 1204 (fur texture) are all active, it's a cat." This is co-adaptation — neurons that only work together. Dropout breaks co-adaptation by randomly removing some of these neurons, forcing the network to develop multiple independent lines of evidence.

Inverted Dropout

In practice, most implementations use inverted dropout: scale activations by 1/(1−p) during training instead of scaling by (1−p) at test time. This way, the test-time forward pass requires no modification at all — just remove the dropout.

Label Smoothing

Instead of training with hard labels [1, 0, 0, ..., 0] (100% confident it's class 0), use soft labels [0.9, 0.01, 0.01, ..., 0.01]. This prevents the network from becoming overconfident and improves generalization. The smoothing parameter ε (typically 0.1) controls how much probability mass shifts from the true class to the others.

Label Smoothing ysmooth = (1 − ε) · yhard + ε / K
K = number of classes, ε = smoothing parameter (e.g. 0.1)

Stochastic Depth (Drop Path)

A natural extension of dropout to residual networks: during training, randomly skip entire residual blocks (set their output to zero, keeping only the skip connection). This effectively trains an ensemble of networks with different depths. At test time, use all blocks but scale their contributions.

The Unifying Principle

Dropout, data augmentation, cutout, mixup, stochastic depth, label smoothing — they're all the same idea wearing different hats. Training: add randomness. Testing: average it out. The randomness prevents the network from memorizing the training data; the averaging recovers clean predictions.

TechniqueWhat Gets RandomizedWhere AppliedTest-Time Handling
DropoutNeuron activationsFC layers (sometimes conv)Scale by (1−p)
Data augmentationInput pixelsBefore the networkAverage over crops
CutoutInput pixel regionsInput imageUse full image
Stochastic depthEntire residual blocksResNet blocksScale block outputs
Label smoothingTarget distributionLoss functionHard labels (argmax)
Worked Example — Dropout Co-adaptation

Suppose 3 neurons detect [ear, whisker, tail] and the network learns: cat = ear AND whisker AND tail. With dropout p = 0.5, on any given pass each neuron has a 50% chance of being dropped. The network can't rely on all three being present, so it learns: cat = (ear alone is evidence) AND (whisker alone is evidence) AND (tail alone is evidence). Each neuron becomes independently useful — more robust at test time.

Chapter 04

Batch Normalization

You've initialized your weights carefully using Kaiming initialization (std = √(2/Din)) so activations are well-scaled at the start. But as training progresses, the distribution of activations at each layer shifts. The inputs to layer 5 today look nothing like the inputs to layer 5 a thousand gradient steps ago. Layer 5 is constantly adapting to a moving target.

Definition
Internal Covariate Shift

The phenomenon where the distribution of inputs to a neural network layer changes during training, because the parameters of all preceding layers are changing. Each layer is trying to learn on shifting sand. Batch normalization directly addresses this by re-normalizing activations at each layer.

The BN Formula

For a mini-batch of N examples, batch normalization computes per-channel statistics and normalizes:

Batch Normalization — Forward Pass 1. Compute batch mean:   μB = (1/N) Σi xi
2. Compute batch variance:   σ2B = (1/N) Σi (xi − μB)2
3. Normalize:   x̂i = (xi − μB) / √(σ2B + ε)
4. Scale and shift:   yi = γ · x̂i + β
γ and β are learned parameters. ε ≈ 10−5 prevents division by zero.

Step 3 forces the activations to have zero mean and unit variance. But that might be too restrictive — maybe the optimal activation distribution for this layer isn't standard normal. So step 4 adds learnable parameters γ (scale) and β (shift) that let the network undo the normalization if it wants to. If γ = σB and β = μB, the transform is the identity.

Why Learnable Scale and Shift?

Without γ and β, BN would force every layer's activations to be standard normal — which might not be what the network needs. The learnable parameters give the network the option to recover the original distribution if that's optimal. But the default is normalized, which is a much better starting point for optimization.

Training vs. Inference

During training, μB and σ2B come from the current mini-batch. But at inference, you might have a single image — there's no "batch" to compute statistics over. The solution: during training, maintain running averages of μ and σ2 across all batches using exponential moving averages. At inference, use these fixed running statistics.

The Training/Inference Trap

Forgetting to switch batch norm to evaluation mode (model.eval() in PyTorch) is one of the most common bugs in deep learning. In training mode, BN uses batch statistics; in eval mode, it uses the running averages. If you evaluate with batch statistics and a batch size of 1, the variance is zero and everything breaks.

Where to Place BN

The original paper placed BN before the activation function: Conv → BN → ReLU. Some later work suggests placing it after: Conv → ReLU → BN. In practice, both work well. Modern architectures typically use BN before activation, and many replace BN with Layer Normalization (which normalizes across features instead of across the batch, avoiding batch-size dependence).

Batch Normalization Explorer Interactive

Left: raw activations. Right: after BN with learnable γ and β. Adjust the sliders to see how the affine transform reshapes the distribution.

γ 1.0
β 0.0
Worked Example — BN on a Mini-Batch

Batch of 4 values from one channel: x = [2.0, 4.0, 6.0, 8.0].

Step 1: μB = (2+4+6+8)/4 = 5.0

Step 2: σ2B = ((2−5)2 + (4−5)2 + (6−5)2 + (8−5)2) / 4 = (9+1+1+9)/4 = 5.0

Step 3: x̂ = (x − 5) / √(5 + 10−5) ≈ [−1.34, −0.45, 0.45, 1.34]

Step 4: With γ=2, β=1: y = 2 · x̂ + 1 ≈ [−1.68, 0.10, 1.90, 3.68]

The distribution is centered and scaled, then the learned affine transform reshapes it to whatever the network needs.

NormalizationStatistics OverBatch-Size Dependent?Typical Use
Batch NormBatch dimension (N)YesCNNs
Layer NormFeature dimension (D)NoTransformers, RNNs
Instance NormSpatial dims (H, W)NoStyle transfer
Group NormChannel groupsNoDetection, small batches
Chapter 05

AlexNet to VGG

In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered a CNN in the ImageNet competition and won by a landslide. AlexNet cut the top-5 error rate from 26% to 16.4% — a gap so large that overnight, the entire computer vision community pivoted to deep learning.

AlexNet (2012)

AlexNet had 8 learned layers: 5 convolutional + 3 fully connected. By today's standards, it's simple. But it introduced key ideas that still matter:

InnovationImpact
ReLU activation6× faster convergence than tanh, no vanishing gradients in positive region
GPU trainingSplit the model across 2 GPUs — first large-scale GPU training
Dropout (p=0.5)Applied to FC layers, reduced overfitting dramatically
Data augmentationRandom crops + horizontal flips + color jitter
Local response normNormalize across channels (later replaced by batch norm)

AlexNet used large filters: 11×11 in the first layer, 5×5 in the second. These large filters seem necessary to cover enough spatial context. But VGG showed a better way.

VGG (2014): The Depth Revolution

Karen Simonyan and Andrew Zisserman asked a simple question: what if we replace all large filters with stacks of 3×3 filters?

The 3×3 Insight

Three stacked 3×3 convolutions (stride 1, padding 1) have the same effective receptive field as one 7×7 convolution. But they have three non-linearities instead of one, making the decision function more discriminative. And they use fewer parameters: 3 × (32C2) = 27C2 vs. 72C2 = 49C2. Deeper and cheaper.

Receptive Field Calculation

How do we know that three 3×3 layers see 7×7 pixels of the input? Trace it backwards:

Receptive Field — Step by Step

Layer 3 (top): one output neuron sees a 3×3 patch of Layer 2's output.

Layer 2 (middle): each of those 9 neurons sees a 3×3 patch of Layer 1's output. The union covers a 5×5 region of Layer 1.

Layer 1 (bottom): each of those 25 neurons sees a 3×3 patch of the input. The union covers a 7×7 region of the input.

In general, L stacked 3×3 convolutions with stride 1 have a receptive field of (2L + 1) × (2L + 1).

Receptive Field Size RF = (2L + 1) × (2L + 1)   for L stacked 3×3 conv layers (stride 1)
L=1 → 3×3, L=2 → 5×5, L=3 → 7×7, L=5 → 11×11

VGG Architecture

VGG-16 uses 13 conv layers and 3 FC layers, all with 3×3 filters. The network is clean and regular: stack conv layers, pool to halve spatial dimensions, double the number of filters. This "stack and double" pattern became the default blueprint for CNNs.

Worked Example — VGG-16 Parameter Count

The first conv layer: 3×3×3×64 = 1,728 params. The last conv layer: 3×3×512×512 = 2,359,296 params. But the real cost is the first FC layer: 7×7×512×4096 = 102 million parameters. The 3 FC layers alone account for ~124M of VGG-16's 138M total parameters. This is why modern architectures eliminated FC layers in favor of global average pooling.

PropertyAlexNetVGG-16
Layers8 (5 conv + 3 FC)16 (13 conv + 3 FC)
Parameters62M138M
Filter sizes11×11, 5×5, 3×3All 3×3
Top-5 error16.4%7.3%
Key lessonCNNs work on GPUsDepth > width, small filters win
VGG's Achilles Heel

138 million parameters and 15.5 billion FLOPs for a single forward pass. VGG-16 requires ~528 MB just to store the model weights. This made it impractical for mobile and embedded deployment. Later architectures (GoogLeNet, MobileNet) focused on efficiency without sacrificing accuracy.

Chapter 06

GoogLeNet & Inception

While VGG went deeper by stacking identical 3×3 layers, Google took a different approach: go wider. Instead of choosing between a 1×1, 3×3, or 5×5 convolution, why not apply all of them in parallel and let the network decide which is useful?

The Inception Module

An Inception module applies four parallel operations to the same input: (1) 1×1 conv, (2) 3×3 conv, (3) 5×5 conv, and (4) 3×3 max pooling. The outputs are concatenated along the channel dimension. The network learns to weight these different spatial scales.

The Computational Problem

Naive concatenation is extremely expensive. If the input has 256 channels and each branch outputs 256 channels, the concatenated output has 1024 channels. The 5×5 conv branch alone would need 5×5×256×256 ≈ 1.6M parameters. With 256 input channels and spatial dimensions, the FLOPs explode.

The 1×1 Convolution: The Bottleneck

The fix is brilliantly simple. Before each expensive convolution, apply a 1×1 convolution to reduce the number of channels. A 1×1 conv with 256 input channels and 64 output channels is just a per-pixel linear projection from 256-D to 64-D. It costs 256×64 = 16,384 parameters per spatial position.

Definition
1×1 Convolution (Pointwise Convolution)

A convolution with a 1×1 spatial kernel. It doesn't look at neighboring pixels at all — it only mixes information across channels at each spatial location. Think of it as a per-pixel fully-connected layer. Used as a channel bottleneck: reduce 256 channels to 64, apply the expensive 5×5 conv on 64 channels, then the cost drops by 4×.

Worked Example — Inception Bottleneck Savings

Input: 28×28×256. Want: 5×5 conv with 128 output channels.

Without bottleneck: 5×5×256×128 = 819,200 params. FLOPs: 819,200 × 28×28 ≈ 642M.

With 1×1 bottleneck (reduce to 32 channels first):

• 1×1 conv: 256×32 = 8,192 params. FLOPs: 8,192 × 28×28 ≈ 6.4M.

• 5×5 conv: 5×5×32×128 = 102,400 params. FLOPs: 102,400 × 28×28 ≈ 80M.

Total: 86.4M FLOPs vs 642M — a 7.4× reduction.

GoogLeNet Architecture

GoogLeNet stacks 9 Inception modules across 22 layers, uses global average pooling instead of FC layers (reducing the parameter count from VGG's 138M to just 5 million), and adds auxiliary classifiers at intermediate layers to provide additional gradient signal to early layers.

Auxiliary Classifiers

GoogLeNet attaches small classifier heads at intermediate points in the network. During training, these contribute to the loss (weighted by 0.3). The idea: provide gradient signal directly to early layers, combating the vanishing gradient problem. At test time, auxiliary classifiers are removed. Later work showed they function more as regularizers than gradient aids.

PropertyVGG-16GoogLeNet
Layers1622
Parameters138M5M
FLOPs15.5B1.5B
Top-5 error7.3%6.7%
Key ideaDepth + small filtersParallel multi-scale + bottleneck
Why Multi-Scale Processing Works

Different objects in an image exist at different scales. A 1×1 conv captures fine-grained per-pixel features (texture, color). A 3×3 conv captures local patterns (edges, corners). A 5×5 conv captures slightly larger structures (object parts). By processing all scales in parallel, the Inception module lets each spatial location combine information at multiple resolutions. The network learns which scales matter for each class.

Chapter 07

ResNet & Skip Connections

Here is the moment that changed everything. In 2015, Kaiming He and colleagues at Microsoft Research published a paper with a deceptively simple finding: a 56-layer plain CNN performs worse than a 20-layer one, even on the training set. This isn't overfitting — it's the degradation problem.

The Degradation Problem

If a 20-layer network achieves a certain training error, a 56-layer network should do at least as well. Why? Because the 56-layer network could learn to copy the 20-layer solution by setting the extra 36 layers to identity mappings (output = input). The fact that it doesn't find this solution means the optimizer can't navigate the loss landscape of deep plain networks.

Not Overfitting

The 56-layer network has higher training error than the 20-layer one. Both training and test error are worse. This rules out overfitting (which would show low training error but high test error). The problem is purely about optimization: gradient descent gets lost in the deep landscape.

The Residual Connection

The fix is startlingly elegant. Instead of learning H(x) directly, make each block learn the residual F(x) = H(x) − x, and add x back: H(x) = F(x) + x. This is implemented as a "skip connection" that adds the input directly to the block's output.

Residual Block H(x) = F(x) + x
F(x) is the "residual" learned by the conv layers. x is the identity shortcut.

Why does this help? If the optimal transformation is close to the identity (the extra layer isn't needed), the network just needs to push F(x) toward zero. Learning "change nothing" (F(x) = 0) is much easier than learning the identity mapping through a stack of nonlinear layers.

The Core Insight

Learning residuals is easier than learning full mappings. If the identity is a good starting point, the network only needs to learn small perturbations (F(x) ≈ 0). Without the skip connection, the network must learn to pass information through nonlinear layers unchanged — which SGD finds surprisingly difficult.

Why Skip Connections Fix Gradient Flow

Consider the gradient flowing backward through a residual block. By the chain rule:

Gradient Through a Residual Block ∂L/∂x = ∂L/∂H · ∂H/∂x = ∂L/∂H · (1 + ∂F/∂x)
The "1" is the gradient through the skip connection. It can never vanish.

In a plain network, the gradient is ∂L/∂x = ∂L/∂H · ∂F/∂x. If ∂F/∂x is small (which happens with deep networks), the gradient vanishes exponentially. With the skip connection, you always have the additive 1. Even if ∂F/∂x is zero, the gradient is ∂L/∂H · 1 = ∂L/∂H. The gradient has a highway through the skip connections that bypasses all the nonlinear layers.

Gradient Flow: Plain vs. Residual Interactive

Toggle the skip connection to see how gradient magnitude changes across layers. Without skip connections, gradients vanish exponentially. With them, the "1" in the chain rule maintains flow.

Layers 12
Gradient Through L Residual Blocks

Consider L stacked residual blocks. The gradient from the loss to the input is:

∂L/∂x0 = ∂L/∂xL · Πl=1L (1 + ∂Fl/∂xl-1)

Expanding this product gives 2L terms, each corresponding to a different path through the network. Some paths go through all residual functions; some skip them all (using only skip connections). Even if many ∂Fl/∂x terms are small, the path that uses all skip connections contributes a gradient of ∂L/∂xL · 1 · 1 · ... · 1 = ∂L/∂xL. There's always at least one high-gradient path.

ResNet Architecture Details

ResNet stacks residual blocks in four stages. Each stage doubles the number of channels and halves the spatial resolution (via stride-2 convolution). When dimensions change, a 1×1 conv with stride 2 is applied to the skip connection to match dimensions.

Worked Example — ResNet-34 Structure

Input: 224×224×3. Stem: 7×7 conv stride 2 → 112×112×64, then 3×3 max pool → 56×56×64.

Stage 1: 3 blocks of [3×3, 64; 3×3, 64] + skip. Output: 56×56×64.

Stage 2: 4 blocks of [3×3, 128; 3×3, 128] + skip. First block: stride 2. Output: 28×28×128.

Stage 3: 6 blocks of [3×3, 256; 3×3, 256] + skip. First block: stride 2. Output: 14×14×256.

Stage 4: 3 blocks of [3×3, 512; 3×3, 512] + skip. First block: stride 2. Output: 7×7×512.

Global average pool → 1×1×512 → FC 1000. Total: 21.3M params. No FC hidden layers — just one classification layer.

Bottleneck Blocks (ResNet-50+)

For deeper networks (50, 101, 152 layers), each residual block uses a bottleneck design: 1×1 conv (reduce channels) → 3×3 conv → 1×1 conv (restore channels). This reduces computation while maintaining the same number of blocks.

Bottleneck Block Input (256 ch) → 1×1 conv (64 ch) → 3×3 conv (64 ch) → 1×1 conv (256 ch) → + skip → Output
Without bottleneck: two 3×3 conv on 256 channels = 2 × 32 × 2562 = 1.18M params
With bottleneck: 256×64 + 32×642 + 64×256 = 69,632 params — 17× fewer
ModelLayersParamsTop-5 ErrorBlock Type
ResNet-181811.7M10.9%Basic (two 3×3)
ResNet-343421.8M7.4%Basic
ResNet-505025.6M6.7%Bottleneck
ResNet-10110144.5M6.0%Bottleneck
ResNet-15215260.2M5.7%Bottleneck
Chapter 08

Modern Architectures

ResNet opened the floodgates. Once skip connections solved the degradation problem, researchers explored variations on the theme. Three post-ResNet architectures stand out for their distinct ideas.

ResNeXt (2017): Grouped Convolutions

ResNeXt replaces the single 3×3 conv in each block with grouped convolutions: split the channels into 32 groups, apply separate 3×3 convolutions to each group, then concatenate. This is equivalent to having 32 parallel pathways (or "cardinality" C=32) within each block.

Definition
Cardinality

The number of parallel transformation paths within a single block. ResNet has cardinality 1 (one path). ResNeXt uses cardinality 32. Increasing cardinality is more effective than increasing depth or width for the same parameter budget.

Worked Example — ResNeXt Block

Input: 256 channels. ResNeXt block with C=32, bottleneck width d=4:

32 parallel paths, each: 1×1 conv (256 → 4) → 3×3 conv (4 → 4) → 1×1 conv (4 → 256).

Each path has 256×4 + 9×16 + 4×256 = 2,192 params. Total: 32 × 2,192 = 70,144 params.

This is similar in cost to a standard ResNet bottleneck (69,632) but the multiple pathways provide richer representations.

DenseNet (2017): Feature Reuse

DenseNet takes skip connections to the extreme: instead of adding the input to the output, it concatenates the input to the output. And not just from the previous layer — each layer receives the feature maps of all preceding layers in the same dense block.

DenseNet Connection xl = Hl( concat(x0, x1, ..., xl-1) )
Each layer sees ALL previous feature maps, not just the most recent one.

This extreme connectivity has two benefits: (1) feature reuse — later layers can directly access early features without them being washed out by nonlinearities, and (2) parameter efficiency — because features are reused rather than re-learned, each layer can be very narrow (e.g., only 12 or 24 new channels per layer, called the "growth rate").

SE-Net (2018): Channel Attention

Squeeze-and-Excitation Networks add a lightweight attention mechanism that learns to weight channels differently. After a conv layer, SE-Net: (1) squeezes the spatial dimensions via global average pooling to get a per-channel descriptor, then (2) excites by passing this descriptor through two FC layers (bottleneck → expand) with a sigmoid, producing per-channel weights that rescale the original feature map.

Channel Attention Intuition

Not all channels are equally useful for a given input. An image of a red car might benefit from up-weighting color channels and down-weighting texture channels. SE-Net learns to dynamically re-weight channels based on the global content of the current input. The cost is negligible: two small FC layers (reduction ratio r=16) per block.

Worked Example — SE Block

Input: 7×7×512. Squeeze: Global average pool → 1×1×512. Excite: FC 512 → 32 (ReLU) → FC 32 → 512 (sigmoid) = channel weights w ∈ [0,1]512. Scale: multiply each channel by its weight. Extra params: 512×32 + 32×512 = 32,768 (~0.05% of ResNet-50).

ArchitectureKey IdeaConnectivityParams (ImageNet)Top-5 Error
ResNet-50Residual (add)x + F(x)25.6M6.7%
ResNeXt-50Grouped (cardinality)x + Σ Fi(x)25.0M5.6%
DenseNet-121Dense (concat)concat(x0..l)8.0M6.2%
SE-ResNet-50Channel attentionx + s · F(x)28.1M5.5%
Chapter 09

CNN Architecture Explorer

Time to see the full picture. This interactive visualization lets you compare every major CNN architecture from AlexNet through ConvNeXt. Select an architecture to see its depth, parameter count, FLOPs, ImageNet accuracy, and a schematic block diagram showing the core building block.

Architecture Comparison Dashboard Interactive

Click an architecture to see its stats, block diagram, and how it compares to others.

The Pareto Frontier

There is no single "best" architecture. AlexNet is fast but inaccurate. VGG is accurate but huge. MobileNet is tiny but less accurate. EfficientNet and ConvNeXt represent the current Pareto frontier: the best accuracy for a given compute budget. The right choice depends on your deployment constraints — a self-driving car has different requirements than a phone app.

Chapter 10

Efficient & Modern CNNs

The architectures we've seen so far were designed for server-side inference with powerful GPUs. But what about deploying CNNs on phones, drones, or embedded devices with 100× less compute? This drove a wave of innovation in efficient architectures.

MobileNet (2017): Depthwise Separable Convolutions

A standard 3×3 convolution with Cin input channels and Cout output channels costs 3×3×Cin×Cout parameters. MobileNet splits this into two steps:

Definition
Depthwise Separable Convolution

Step 1 — Depthwise: Apply a separate 3×3 filter to each input channel independently. Cost: 3×3×Cin. This captures spatial patterns within each channel.
Step 2 — Pointwise: Apply a 1×1 convolution to mix channels. Cost: Cin×Cout. This captures cross-channel patterns.
Total cost: 9Cin + Cin×Cout vs. 9×Cin×Cout for standard conv. Savings: roughly Cout/9 — about 8–9× fewer parameters and FLOPs.

Worked Example — Depthwise Separable Cost

Input: 14×14 with 256 channels. Output: 256 channels. 3×3 conv.

Standard: 3×3×256×256 = 589,824 params. FLOPs: 589,824 × 14×14 ≈ 115.6M.

Depthwise separable: Depthwise: 9×256 = 2,304. Pointwise: 256×256 = 65,536. Total: 67,840 params — 8.7× fewer. FLOPs: 67,840 × 196 ≈ 13.3M.

MobileNetV2: Inverted Residuals

MobileNetV2 (2018) introduced the inverted residual block: expand channels with 1×1 conv, apply depthwise 3×3 conv on the expanded channels, then project back down with 1×1 conv. This is "inverted" compared to the ResNet bottleneck (which compresses first, then expands). The skip connection goes between the narrow (compressed) representations, not the wide ones.

EfficientNet (2019): Compound Scaling

How should you scale up a CNN for better accuracy? You could go deeper (more layers), wider (more channels), or use higher-resolution inputs. Previous work scaled one dimension at a time. Tan and Le (2019) showed that scaling all three dimensions together with a fixed ratio works dramatically better.

Compound Scaling depth: d = αφ,   width: w = βφ,   resolution: r = γφ
subject to: α · β2 · γ2 ≈ 2
φ is a user-specified compound coefficient. α, β, γ are found by grid search on a small baseline (EfficientNet-B0). The constraint ensures FLOPs roughly double per unit increase in φ.
Why Compound Scaling Works

If you make a network deeper without making it wider, the extra layers don't have enough channels to extract useful features. If you use higher resolution without going deeper, the network can't capture the larger-scale patterns that higher resolution reveals. All three dimensions are correlated: scaling them together lets each dimension complement the others.

EfficientNet-B0 starts from a MobileNetV2-like architecture (designed via neural architecture search) and scales it with φ = 0 through 7, producing B0 through B7. EfficientNet-B7 achieves 84.3% top-1 accuracy on ImageNet with 66M parameters — comparable to the much larger GPipe (557M params) that was state-of-the-art at the time.

ConvNeXt (2022): Modernizing Convolutions

Vision Transformers (ViTs) overtook CNNs on many benchmarks. But was this because of self-attention, or because of modern training techniques (larger datasets, stronger augmentation, different learning rate schedules)? ConvNeXt answers this question by modernizing a standard ResNet using only ideas from the Transformer literature — without self-attention.

ConvNeXt takes a ResNet-50 and applies a series of modifications, each borrowed from Transformers:

ModificationInspired ByEffect
Larger kernel (7×7 depthwise conv)ViT's large attention receptive field+0.5% accuracy
Fewer activation functions (GELU, only after pointwise)Transformer blocks have fewer nonlinearities+0.7%
Layer Norm instead of Batch NormTransformers use LayerNorm+0.1%
Inverted bottleneck (expand then narrow)MLP block in Transformers: 4× expansion+0.2%
Fewer normalization layers (only one per block)Transformer has one LN per sub-block+0.1%
Patchify stem (4×4 non-overlapping conv stride 4)ViT's patch embedding+0.2%

The result: ConvNeXt-T (29M params) matches Swin Transformer-T (29M params) at 82.1% top-1 on ImageNet, using only convolutions. ConvNeXt-XL scales to 87.8% top-1, competitive with the best Transformers.

The ConvNeXt Lesson

ViTs didn't win because self-attention is inherently superior to convolution. They won because the Transformer community developed better training recipes, scaling strategies, and architectural details. When you transplant these advances back to CNNs, convolutions are competitive again. The lesson: training methodology matters as much as architecture.

The Full Timeline

YearArchitectureKey InnovationParamsTop-1 Accuracy
2012AlexNetGPU training + ReLU62M63.3%
2014VGG-16Small 3×3 filters138M74.4%
2014GoogLeNetInception + 1×1 conv5M74.8%
2015ResNet-50Skip connections25.6M76.1%
2017ResNeXt-50Grouped convolutions25.0M77.8%
2017MobileNetV1Depthwise separable conv3.4M70.6%
2017DenseNet-121Dense connectivity8.0M74.4%
2018MobileNetV2Inverted residuals3.4M72.0%
2019EfficientNet-B0Compound scaling5.3M77.1%
2022ConvNeXt-TModernized ResNet29M82.1%
Transfer Learning Strategy

Small dataset, similar domain: Freeze a pretrained ResNet/ConvNeXt backbone, train only a new classification head. This is a linear probe — the backbone is a fixed feature extractor.

Small dataset, different domain: Try a different pretrained model or collect more data. Features from ImageNet may not transfer well to medical imaging or satellite data.

Large dataset: Fine-tune the entire model end-to-end. Initialize from pretrained weights, use a small learning rate (10× smaller than training from scratch), and let all layers adapt to your data.

The One Sentence

From AlexNet to ConvNeXt, the story of CNNs is the story of making depth work: small filters for receptive fields, skip connections for gradient flow, bottlenecks for efficiency, and training tricks for everything else.