You know what a convolution does. But how do you actually train a 152-layer network — and why does adding more layers sometimes make things worse? This is the story of the tricks and architectures that solved it.
You understand convolutions, pooling, fully-connected layers. You can draw a CNN on a whiteboard. So you stack some layers, write model.fit(), and wait. The training loss barely moves. Or it plummets on the training set and the model hallucinates on test images. What went wrong?
Nothing conceptual. The architecture is fine. The problem is that deep neural networks are remarkably finicky to train. A 20-layer plain network can outperform an 8-layer one — but a 56-layer network often performs worse than the 20-layer network, even on the training set. This isn't overfitting. This is an optimization failure: deeper networks are harder for gradient descent to navigate.
Deeper networks have strictly more representational capacity — a 56-layer network can represent everything a 20-layer one can, by setting the extra layers to identity mappings. But gradient descent can't find these solutions. The gap between what a network can represent and what training can actually reach is the entire subject of this lecture.
This lecture covers two sides of the same coin: training tricks that help gradient descent do its job (augmentation, dropout, batch normalization), and architectural innovations that reshape the loss landscape to make optimization easier (VGG's small filters, Inception's parallel paths, ResNet's skip connections).
From 2012 to 2017, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) drove a revolution. Error rates dropped from 26% (hand-crafted features) to 16% (AlexNet, 8 layers) to 7.3% (VGG, 19 layers) to 6.7% (GoogLeNet, 22 layers) to 3.6% (ResNet, 152 layers) — surpassing human performance (~5%). Each leap came from a specific insight about training or architecture.
| Year | Model | Layers | Top-5 Error | Key Insight |
|---|---|---|---|---|
| 2012 | AlexNet | 8 | 16.4% | GPUs + ReLU + dropout |
| 2013 | ZFNet | 8 | 11.7% | Visualization + tuning |
| 2014 | VGG | 19 | 7.3% | Small 3×3 filters, depth |
| 2014 | GoogLeNet | 22 | 6.7% | Inception modules, 1×1 conv |
| 2015 | ResNet | 152 | 3.6% | Skip connections |
| 2017 | SENet | 152 | 2.3% | Channel attention |
He et al. (2015) showed that a 56-layer plain CNN has higher training error than a 20-layer one. Not higher test error — higher training error. The network isn't overfitting; it's failing to optimize at all. This is the degradation problem, and it motivated the most important architectural innovation since the convolutional layer itself: the residual connection.
Your dataset has 50,000 images. Your model has 25 million parameters. This mismatch is a recipe for overfitting. One of the most effective remedies is embarrassingly simple: show the network slightly different versions of the same image every time.
Applying random, label-preserving transformations to training images before feeding them to the network. A flipped cat is still a cat. A slightly darker photo of a dog is still a dog. Each epoch, the network sees a different version of each image — effectively multiplying the dataset size without collecting new data.
Horizontal flip. Mirror the image left-to-right. This doubles your effective dataset. A bird facing left and a bird facing right should both be classified as "bird." This is the single most common augmentation and is almost always beneficial.
Random crop and scale. Resize the image so the short side is a random length (say 256–480 pixels), then extract a random 224×224 patch. This teaches the network to recognize objects at different scales and positions. ResNet's training recipe: pick random L ∈ [256, 480], resize, crop 224×224.
Color jitter. Randomly adjust brightness, contrast, and saturation. The lighting conditions in your training set shouldn't dictate what the network learns. A slightly blue-tinted car is still a car.
Cutout. Mask a random rectangular region of the image with zeros. This forces the network to rely on multiple parts of the object, not just one discriminative patch. Works especially well on small datasets like CIFAR-10.
Mixup. Blend two training images and their labels: x̄ = λxi + (1−λ)xj, ȳ = λyi + (1−λ)yj, where λ ~ Beta(0.2, 0.2). A 70/30 blend of "cat" and "dog" gets the label [0.7 cat, 0.3 dog]. This regularizes the network by training it on convex combinations of examples.
CutMix. Instead of blending pixel values (Mixup), cut a rectangular patch from one image and paste it onto another. The label is proportional to the area ratio. This preserves local image statistics better than Mixup's ghostly blends.
All augmentations follow the same template: add randomness during training, remove it at test time. During training, each image is randomly transformed. At test time, you either use the original image or average predictions over a fixed set of crops (test-time augmentation). The randomness prevents the network from memorizing specific pixel patterns.
Click an augmentation type to see how it transforms an image. Each click generates a new random transformation.
ResNet at test time uses 10-crop evaluation: resize the image to 5 scales {224, 256, 384, 480, 640}. At each scale, take 4 corner crops + 1 center crop, plus their horizontal flips = 10 crops per scale × 5 scales = 50 forward passes. Average the predictions. This ensemble of crops improves accuracy by ~1% over a single center crop.
Not all augmentations are label-preserving. Vertical flip makes sense for satellite images (no "up" or "down") but not for digit recognition (a flipped 6 becomes a 9). Always verify that your augmentations don't change the correct label.
Data augmentation is one form of regularization. But CNNs have other tricks in their regularization arsenal, all following the same principle: inject noise during training, average it out at test time.
During each forward pass, randomly set each neuron's output to zero with probability p (typically p = 0.5 for FC layers). At test time, keep all neurons active but multiply outputs by (1−p) to compensate for the missing activations.
A regularization technique that randomly "drops" (zeros out) neurons during training. Think of it two ways: (1) it forces redundant representations — no single neuron can be essential, because it might be dropped. (2) It trains an exponentially large ensemble of sub-networks that share parameters. An FC layer with 4096 units has 24096 possible dropout masks — more configurations than atoms in the universe.
Why does it work? Consider a "cat" classifier. Without dropout, the network might learn: "if neuron 327 (ear detector) AND neuron 891 (whisker detector) AND neuron 1204 (fur texture) are all active, it's a cat." This is co-adaptation — neurons that only work together. Dropout breaks co-adaptation by randomly removing some of these neurons, forcing the network to develop multiple independent lines of evidence.
In practice, most implementations use inverted dropout: scale activations by 1/(1−p) during training instead of scaling by (1−p) at test time. This way, the test-time forward pass requires no modification at all — just remove the dropout.
Instead of training with hard labels [1, 0, 0, ..., 0] (100% confident it's class 0), use soft labels [0.9, 0.01, 0.01, ..., 0.01]. This prevents the network from becoming overconfident and improves generalization. The smoothing parameter ε (typically 0.1) controls how much probability mass shifts from the true class to the others.
A natural extension of dropout to residual networks: during training, randomly skip entire residual blocks (set their output to zero, keeping only the skip connection). This effectively trains an ensemble of networks with different depths. At test time, use all blocks but scale their contributions.
Dropout, data augmentation, cutout, mixup, stochastic depth, label smoothing — they're all the same idea wearing different hats. Training: add randomness. Testing: average it out. The randomness prevents the network from memorizing the training data; the averaging recovers clean predictions.
| Technique | What Gets Randomized | Where Applied | Test-Time Handling |
|---|---|---|---|
| Dropout | Neuron activations | FC layers (sometimes conv) | Scale by (1−p) |
| Data augmentation | Input pixels | Before the network | Average over crops |
| Cutout | Input pixel regions | Input image | Use full image |
| Stochastic depth | Entire residual blocks | ResNet blocks | Scale block outputs |
| Label smoothing | Target distribution | Loss function | Hard labels (argmax) |
Suppose 3 neurons detect [ear, whisker, tail] and the network learns: cat = ear AND whisker AND tail. With dropout p = 0.5, on any given pass each neuron has a 50% chance of being dropped. The network can't rely on all three being present, so it learns: cat = (ear alone is evidence) AND (whisker alone is evidence) AND (tail alone is evidence). Each neuron becomes independently useful — more robust at test time.
You've initialized your weights carefully using Kaiming initialization (std = √(2/Din)) so activations are well-scaled at the start. But as training progresses, the distribution of activations at each layer shifts. The inputs to layer 5 today look nothing like the inputs to layer 5 a thousand gradient steps ago. Layer 5 is constantly adapting to a moving target.
The phenomenon where the distribution of inputs to a neural network layer changes during training, because the parameters of all preceding layers are changing. Each layer is trying to learn on shifting sand. Batch normalization directly addresses this by re-normalizing activations at each layer.
For a mini-batch of N examples, batch normalization computes per-channel statistics and normalizes:
Step 3 forces the activations to have zero mean and unit variance. But that might be too restrictive — maybe the optimal activation distribution for this layer isn't standard normal. So step 4 adds learnable parameters γ (scale) and β (shift) that let the network undo the normalization if it wants to. If γ = σB and β = μB, the transform is the identity.
Without γ and β, BN would force every layer's activations to be standard normal — which might not be what the network needs. The learnable parameters give the network the option to recover the original distribution if that's optimal. But the default is normalized, which is a much better starting point for optimization.
During training, μB and σ2B come from the current mini-batch. But at inference, you might have a single image — there's no "batch" to compute statistics over. The solution: during training, maintain running averages of μ and σ2 across all batches using exponential moving averages. At inference, use these fixed running statistics.
Forgetting to switch batch norm to evaluation mode (model.eval() in PyTorch) is one of the most common bugs in deep learning. In training mode, BN uses batch statistics; in eval mode, it uses the running averages. If you evaluate with batch statistics and a batch size of 1, the variance is zero and everything breaks.
The original paper placed BN before the activation function: Conv → BN → ReLU. Some later work suggests placing it after: Conv → ReLU → BN. In practice, both work well. Modern architectures typically use BN before activation, and many replace BN with Layer Normalization (which normalizes across features instead of across the batch, avoiding batch-size dependence).
Left: raw activations. Right: after BN with learnable γ and β. Adjust the sliders to see how the affine transform reshapes the distribution.
Batch of 4 values from one channel: x = [2.0, 4.0, 6.0, 8.0].
Step 1: μB = (2+4+6+8)/4 = 5.0
Step 2: σ2B = ((2−5)2 + (4−5)2 + (6−5)2 + (8−5)2) / 4 = (9+1+1+9)/4 = 5.0
Step 3: x̂ = (x − 5) / √(5 + 10−5) ≈ [−1.34, −0.45, 0.45, 1.34]
Step 4: With γ=2, β=1: y = 2 · x̂ + 1 ≈ [−1.68, 0.10, 1.90, 3.68]
The distribution is centered and scaled, then the learned affine transform reshapes it to whatever the network needs.
| Normalization | Statistics Over | Batch-Size Dependent? | Typical Use |
|---|---|---|---|
| Batch Norm | Batch dimension (N) | Yes | CNNs |
| Layer Norm | Feature dimension (D) | No | Transformers, RNNs |
| Instance Norm | Spatial dims (H, W) | No | Style transfer |
| Group Norm | Channel groups | No | Detection, small batches |
In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered a CNN in the ImageNet competition and won by a landslide. AlexNet cut the top-5 error rate from 26% to 16.4% — a gap so large that overnight, the entire computer vision community pivoted to deep learning.
AlexNet had 8 learned layers: 5 convolutional + 3 fully connected. By today's standards, it's simple. But it introduced key ideas that still matter:
| Innovation | Impact |
|---|---|
| ReLU activation | 6× faster convergence than tanh, no vanishing gradients in positive region |
| GPU training | Split the model across 2 GPUs — first large-scale GPU training |
| Dropout (p=0.5) | Applied to FC layers, reduced overfitting dramatically |
| Data augmentation | Random crops + horizontal flips + color jitter |
| Local response norm | Normalize across channels (later replaced by batch norm) |
AlexNet used large filters: 11×11 in the first layer, 5×5 in the second. These large filters seem necessary to cover enough spatial context. But VGG showed a better way.
Karen Simonyan and Andrew Zisserman asked a simple question: what if we replace all large filters with stacks of 3×3 filters?
Three stacked 3×3 convolutions (stride 1, padding 1) have the same effective receptive field as one 7×7 convolution. But they have three non-linearities instead of one, making the decision function more discriminative. And they use fewer parameters: 3 × (32C2) = 27C2 vs. 72C2 = 49C2. Deeper and cheaper.
How do we know that three 3×3 layers see 7×7 pixels of the input? Trace it backwards:
Layer 3 (top): one output neuron sees a 3×3 patch of Layer 2's output.
Layer 2 (middle): each of those 9 neurons sees a 3×3 patch of Layer 1's output. The union covers a 5×5 region of Layer 1.
Layer 1 (bottom): each of those 25 neurons sees a 3×3 patch of the input. The union covers a 7×7 region of the input.
In general, L stacked 3×3 convolutions with stride 1 have a receptive field of (2L + 1) × (2L + 1).
VGG-16 uses 13 conv layers and 3 FC layers, all with 3×3 filters. The network is clean and regular: stack conv layers, pool to halve spatial dimensions, double the number of filters. This "stack and double" pattern became the default blueprint for CNNs.
The first conv layer: 3×3×3×64 = 1,728 params. The last conv layer: 3×3×512×512 = 2,359,296 params. But the real cost is the first FC layer: 7×7×512×4096 = 102 million parameters. The 3 FC layers alone account for ~124M of VGG-16's 138M total parameters. This is why modern architectures eliminated FC layers in favor of global average pooling.
| Property | AlexNet | VGG-16 |
|---|---|---|
| Layers | 8 (5 conv + 3 FC) | 16 (13 conv + 3 FC) |
| Parameters | 62M | 138M |
| Filter sizes | 11×11, 5×5, 3×3 | All 3×3 |
| Top-5 error | 16.4% | 7.3% |
| Key lesson | CNNs work on GPUs | Depth > width, small filters win |
138 million parameters and 15.5 billion FLOPs for a single forward pass. VGG-16 requires ~528 MB just to store the model weights. This made it impractical for mobile and embedded deployment. Later architectures (GoogLeNet, MobileNet) focused on efficiency without sacrificing accuracy.
While VGG went deeper by stacking identical 3×3 layers, Google took a different approach: go wider. Instead of choosing between a 1×1, 3×3, or 5×5 convolution, why not apply all of them in parallel and let the network decide which is useful?
An Inception module applies four parallel operations to the same input: (1) 1×1 conv, (2) 3×3 conv, (3) 5×5 conv, and (4) 3×3 max pooling. The outputs are concatenated along the channel dimension. The network learns to weight these different spatial scales.
Naive concatenation is extremely expensive. If the input has 256 channels and each branch outputs 256 channels, the concatenated output has 1024 channels. The 5×5 conv branch alone would need 5×5×256×256 ≈ 1.6M parameters. With 256 input channels and spatial dimensions, the FLOPs explode.
The fix is brilliantly simple. Before each expensive convolution, apply a 1×1 convolution to reduce the number of channels. A 1×1 conv with 256 input channels and 64 output channels is just a per-pixel linear projection from 256-D to 64-D. It costs 256×64 = 16,384 parameters per spatial position.
A convolution with a 1×1 spatial kernel. It doesn't look at neighboring pixels at all — it only mixes information across channels at each spatial location. Think of it as a per-pixel fully-connected layer. Used as a channel bottleneck: reduce 256 channels to 64, apply the expensive 5×5 conv on 64 channels, then the cost drops by 4×.
Input: 28×28×256. Want: 5×5 conv with 128 output channels.
Without bottleneck: 5×5×256×128 = 819,200 params. FLOPs: 819,200 × 28×28 ≈ 642M.
With 1×1 bottleneck (reduce to 32 channels first):
• 1×1 conv: 256×32 = 8,192 params. FLOPs: 8,192 × 28×28 ≈ 6.4M.
• 5×5 conv: 5×5×32×128 = 102,400 params. FLOPs: 102,400 × 28×28 ≈ 80M.
Total: 86.4M FLOPs vs 642M — a 7.4× reduction.
GoogLeNet stacks 9 Inception modules across 22 layers, uses global average pooling instead of FC layers (reducing the parameter count from VGG's 138M to just 5 million), and adds auxiliary classifiers at intermediate layers to provide additional gradient signal to early layers.
GoogLeNet attaches small classifier heads at intermediate points in the network. During training, these contribute to the loss (weighted by 0.3). The idea: provide gradient signal directly to early layers, combating the vanishing gradient problem. At test time, auxiliary classifiers are removed. Later work showed they function more as regularizers than gradient aids.
| Property | VGG-16 | GoogLeNet |
|---|---|---|
| Layers | 16 | 22 |
| Parameters | 138M | 5M |
| FLOPs | 15.5B | 1.5B |
| Top-5 error | 7.3% | 6.7% |
| Key idea | Depth + small filters | Parallel multi-scale + bottleneck |
Different objects in an image exist at different scales. A 1×1 conv captures fine-grained per-pixel features (texture, color). A 3×3 conv captures local patterns (edges, corners). A 5×5 conv captures slightly larger structures (object parts). By processing all scales in parallel, the Inception module lets each spatial location combine information at multiple resolutions. The network learns which scales matter for each class.
Here is the moment that changed everything. In 2015, Kaiming He and colleagues at Microsoft Research published a paper with a deceptively simple finding: a 56-layer plain CNN performs worse than a 20-layer one, even on the training set. This isn't overfitting — it's the degradation problem.
If a 20-layer network achieves a certain training error, a 56-layer network should do at least as well. Why? Because the 56-layer network could learn to copy the 20-layer solution by setting the extra 36 layers to identity mappings (output = input). The fact that it doesn't find this solution means the optimizer can't navigate the loss landscape of deep plain networks.
The 56-layer network has higher training error than the 20-layer one. Both training and test error are worse. This rules out overfitting (which would show low training error but high test error). The problem is purely about optimization: gradient descent gets lost in the deep landscape.
The fix is startlingly elegant. Instead of learning H(x) directly, make each block learn the residual F(x) = H(x) − x, and add x back: H(x) = F(x) + x. This is implemented as a "skip connection" that adds the input directly to the block's output.
Why does this help? If the optimal transformation is close to the identity (the extra layer isn't needed), the network just needs to push F(x) toward zero. Learning "change nothing" (F(x) = 0) is much easier than learning the identity mapping through a stack of nonlinear layers.
Learning residuals is easier than learning full mappings. If the identity is a good starting point, the network only needs to learn small perturbations (F(x) ≈ 0). Without the skip connection, the network must learn to pass information through nonlinear layers unchanged — which SGD finds surprisingly difficult.
Consider the gradient flowing backward through a residual block. By the chain rule:
In a plain network, the gradient is ∂L/∂x = ∂L/∂H · ∂F/∂x. If ∂F/∂x is small (which happens with deep networks), the gradient vanishes exponentially. With the skip connection, you always have the additive 1. Even if ∂F/∂x is zero, the gradient is ∂L/∂H · 1 = ∂L/∂H. The gradient has a highway through the skip connections that bypasses all the nonlinear layers.
Toggle the skip connection to see how gradient magnitude changes across layers. Without skip connections, gradients vanish exponentially. With them, the "1" in the chain rule maintains flow.
Consider L stacked residual blocks. The gradient from the loss to the input is:
∂L/∂x0 = ∂L/∂xL · Πl=1L (1 + ∂Fl/∂xl-1)
Expanding this product gives 2L terms, each corresponding to a different path through the network. Some paths go through all residual functions; some skip them all (using only skip connections). Even if many ∂Fl/∂x terms are small, the path that uses all skip connections contributes a gradient of ∂L/∂xL · 1 · 1 · ... · 1 = ∂L/∂xL. There's always at least one high-gradient path.
ResNet stacks residual blocks in four stages. Each stage doubles the number of channels and halves the spatial resolution (via stride-2 convolution). When dimensions change, a 1×1 conv with stride 2 is applied to the skip connection to match dimensions.
Input: 224×224×3. Stem: 7×7 conv stride 2 → 112×112×64, then 3×3 max pool → 56×56×64.
Stage 1: 3 blocks of [3×3, 64; 3×3, 64] + skip. Output: 56×56×64.
Stage 2: 4 blocks of [3×3, 128; 3×3, 128] + skip. First block: stride 2. Output: 28×28×128.
Stage 3: 6 blocks of [3×3, 256; 3×3, 256] + skip. First block: stride 2. Output: 14×14×256.
Stage 4: 3 blocks of [3×3, 512; 3×3, 512] + skip. First block: stride 2. Output: 7×7×512.
Global average pool → 1×1×512 → FC 1000. Total: 21.3M params. No FC hidden layers — just one classification layer.
For deeper networks (50, 101, 152 layers), each residual block uses a bottleneck design: 1×1 conv (reduce channels) → 3×3 conv → 1×1 conv (restore channels). This reduces computation while maintaining the same number of blocks.
| Model | Layers | Params | Top-5 Error | Block Type |
|---|---|---|---|---|
| ResNet-18 | 18 | 11.7M | 10.9% | Basic (two 3×3) |
| ResNet-34 | 34 | 21.8M | 7.4% | Basic |
| ResNet-50 | 50 | 25.6M | 6.7% | Bottleneck |
| ResNet-101 | 101 | 44.5M | 6.0% | Bottleneck |
| ResNet-152 | 152 | 60.2M | 5.7% | Bottleneck |
ResNet opened the floodgates. Once skip connections solved the degradation problem, researchers explored variations on the theme. Three post-ResNet architectures stand out for their distinct ideas.
ResNeXt replaces the single 3×3 conv in each block with grouped convolutions: split the channels into 32 groups, apply separate 3×3 convolutions to each group, then concatenate. This is equivalent to having 32 parallel pathways (or "cardinality" C=32) within each block.
The number of parallel transformation paths within a single block. ResNet has cardinality 1 (one path). ResNeXt uses cardinality 32. Increasing cardinality is more effective than increasing depth or width for the same parameter budget.
Input: 256 channels. ResNeXt block with C=32, bottleneck width d=4:
32 parallel paths, each: 1×1 conv (256 → 4) → 3×3 conv (4 → 4) → 1×1 conv (4 → 256).
Each path has 256×4 + 9×16 + 4×256 = 2,192 params. Total: 32 × 2,192 = 70,144 params.
This is similar in cost to a standard ResNet bottleneck (69,632) but the multiple pathways provide richer representations.
DenseNet takes skip connections to the extreme: instead of adding the input to the output, it concatenates the input to the output. And not just from the previous layer — each layer receives the feature maps of all preceding layers in the same dense block.
This extreme connectivity has two benefits: (1) feature reuse — later layers can directly access early features without them being washed out by nonlinearities, and (2) parameter efficiency — because features are reused rather than re-learned, each layer can be very narrow (e.g., only 12 or 24 new channels per layer, called the "growth rate").
Squeeze-and-Excitation Networks add a lightweight attention mechanism that learns to weight channels differently. After a conv layer, SE-Net: (1) squeezes the spatial dimensions via global average pooling to get a per-channel descriptor, then (2) excites by passing this descriptor through two FC layers (bottleneck → expand) with a sigmoid, producing per-channel weights that rescale the original feature map.
Not all channels are equally useful for a given input. An image of a red car might benefit from up-weighting color channels and down-weighting texture channels. SE-Net learns to dynamically re-weight channels based on the global content of the current input. The cost is negligible: two small FC layers (reduction ratio r=16) per block.
Input: 7×7×512. Squeeze: Global average pool → 1×1×512. Excite: FC 512 → 32 (ReLU) → FC 32 → 512 (sigmoid) = channel weights w ∈ [0,1]512. Scale: multiply each channel by its weight. Extra params: 512×32 + 32×512 = 32,768 (~0.05% of ResNet-50).
| Architecture | Key Idea | Connectivity | Params (ImageNet) | Top-5 Error |
|---|---|---|---|---|
| ResNet-50 | Residual (add) | x + F(x) | 25.6M | 6.7% |
| ResNeXt-50 | Grouped (cardinality) | x + Σ Fi(x) | 25.0M | 5.6% |
| DenseNet-121 | Dense (concat) | concat(x0..l) | 8.0M | 6.2% |
| SE-ResNet-50 | Channel attention | x + s · F(x) | 28.1M | 5.5% |
Time to see the full picture. This interactive visualization lets you compare every major CNN architecture from AlexNet through ConvNeXt. Select an architecture to see its depth, parameter count, FLOPs, ImageNet accuracy, and a schematic block diagram showing the core building block.
Click an architecture to see its stats, block diagram, and how it compares to others.
There is no single "best" architecture. AlexNet is fast but inaccurate. VGG is accurate but huge. MobileNet is tiny but less accurate. EfficientNet and ConvNeXt represent the current Pareto frontier: the best accuracy for a given compute budget. The right choice depends on your deployment constraints — a self-driving car has different requirements than a phone app.
The architectures we've seen so far were designed for server-side inference with powerful GPUs. But what about deploying CNNs on phones, drones, or embedded devices with 100× less compute? This drove a wave of innovation in efficient architectures.
A standard 3×3 convolution with Cin input channels and Cout output channels costs 3×3×Cin×Cout parameters. MobileNet splits this into two steps:
Step 1 — Depthwise: Apply a separate 3×3 filter to each input channel independently. Cost: 3×3×Cin. This captures spatial patterns within each channel.
Step 2 — Pointwise: Apply a 1×1 convolution to mix channels. Cost: Cin×Cout. This captures cross-channel patterns.
Total cost: 9Cin + Cin×Cout vs. 9×Cin×Cout for standard conv. Savings: roughly Cout/9 — about 8–9× fewer parameters and FLOPs.
Input: 14×14 with 256 channels. Output: 256 channels. 3×3 conv.
Standard: 3×3×256×256 = 589,824 params. FLOPs: 589,824 × 14×14 ≈ 115.6M.
Depthwise separable: Depthwise: 9×256 = 2,304. Pointwise: 256×256 = 65,536. Total: 67,840 params — 8.7× fewer. FLOPs: 67,840 × 196 ≈ 13.3M.
MobileNetV2 (2018) introduced the inverted residual block: expand channels with 1×1 conv, apply depthwise 3×3 conv on the expanded channels, then project back down with 1×1 conv. This is "inverted" compared to the ResNet bottleneck (which compresses first, then expands). The skip connection goes between the narrow (compressed) representations, not the wide ones.
How should you scale up a CNN for better accuracy? You could go deeper (more layers), wider (more channels), or use higher-resolution inputs. Previous work scaled one dimension at a time. Tan and Le (2019) showed that scaling all three dimensions together with a fixed ratio works dramatically better.
If you make a network deeper without making it wider, the extra layers don't have enough channels to extract useful features. If you use higher resolution without going deeper, the network can't capture the larger-scale patterns that higher resolution reveals. All three dimensions are correlated: scaling them together lets each dimension complement the others.
EfficientNet-B0 starts from a MobileNetV2-like architecture (designed via neural architecture search) and scales it with φ = 0 through 7, producing B0 through B7. EfficientNet-B7 achieves 84.3% top-1 accuracy on ImageNet with 66M parameters — comparable to the much larger GPipe (557M params) that was state-of-the-art at the time.
Vision Transformers (ViTs) overtook CNNs on many benchmarks. But was this because of self-attention, or because of modern training techniques (larger datasets, stronger augmentation, different learning rate schedules)? ConvNeXt answers this question by modernizing a standard ResNet using only ideas from the Transformer literature — without self-attention.
ConvNeXt takes a ResNet-50 and applies a series of modifications, each borrowed from Transformers:
| Modification | Inspired By | Effect |
|---|---|---|
| Larger kernel (7×7 depthwise conv) | ViT's large attention receptive field | +0.5% accuracy |
| Fewer activation functions (GELU, only after pointwise) | Transformer blocks have fewer nonlinearities | +0.7% |
| Layer Norm instead of Batch Norm | Transformers use LayerNorm | +0.1% |
| Inverted bottleneck (expand then narrow) | MLP block in Transformers: 4× expansion | +0.2% |
| Fewer normalization layers (only one per block) | Transformer has one LN per sub-block | +0.1% |
| Patchify stem (4×4 non-overlapping conv stride 4) | ViT's patch embedding | +0.2% |
The result: ConvNeXt-T (29M params) matches Swin Transformer-T (29M params) at 82.1% top-1 on ImageNet, using only convolutions. ConvNeXt-XL scales to 87.8% top-1, competitive with the best Transformers.
ViTs didn't win because self-attention is inherently superior to convolution. They won because the Transformer community developed better training recipes, scaling strategies, and architectural details. When you transplant these advances back to CNNs, convolutions are competitive again. The lesson: training methodology matters as much as architecture.
| Year | Architecture | Key Innovation | Params | Top-1 Accuracy |
|---|---|---|---|---|
| 2012 | AlexNet | GPU training + ReLU | 62M | 63.3% |
| 2014 | VGG-16 | Small 3×3 filters | 138M | 74.4% |
| 2014 | GoogLeNet | Inception + 1×1 conv | 5M | 74.8% |
| 2015 | ResNet-50 | Skip connections | 25.6M | 76.1% |
| 2017 | ResNeXt-50 | Grouped convolutions | 25.0M | 77.8% |
| 2017 | MobileNetV1 | Depthwise separable conv | 3.4M | 70.6% |
| 2017 | DenseNet-121 | Dense connectivity | 8.0M | 74.4% |
| 2018 | MobileNetV2 | Inverted residuals | 3.4M | 72.0% |
| 2019 | EfficientNet-B0 | Compound scaling | 5.3M | 77.1% |
| 2022 | ConvNeXt-T | Modernized ResNet | 29M | 82.1% |
Small dataset, similar domain: Freeze a pretrained ResNet/ConvNeXt backbone, train only a new classification head. This is a linear probe — the backbone is a fixed feature extractor.
Small dataset, different domain: Try a different pretrained model or collect more data. Features from ImageNet may not transfer well to medical imaging or satellite data.
Large dataset: Fine-tune the entire model end-to-end. Initialize from pretrained weights, use a small learning rate (10× smaller than training from scratch), and let all layers adapt to your data.
From AlexNet to ConvNeXt, the story of CNNs is the story of making depth work: small filters for receptive fields, skip connections for gradient flow, bottlenecks for efficiency, and training tricks for everything else.