Why can't we just stack more layers? This paper found the answer — and a breathtakingly simple fix that made 152-layer networks trainable.
By 2015, everyone agreed: deeper neural networks learn richer features. AlexNet had 8 layers. VGGNet pushed to 19. Each time we added layers, we got better results on ImageNet. The pattern seemed obvious — just stack more layers.
So researchers tried 56 layers. And something bizarre happened.
The 56-layer network performed worse than the 20-layer network. Not just on test data — on training data. This was not overfitting. The deeper network was genuinely worse at learning, even on the data it was directly optimizing on.
This paper by Kaiming He and colleagues at Microsoft Research diagnosed the problem and proposed a fix so elegant it fits in one line of math. Their residual networks (ResNets) trained with 152 layers — 8× deeper than VGG — and won every track of ILSVRC 2015 with 3.57% top-5 error. The idea was simple enough that you can implement it in a single line of code: output = F(x) + x.
Let's build the intuition from scratch.
Let's be precise about what goes wrong. When you train a neural network, you use backpropagation: you compute the loss at the output, then propagate gradients backward through every layer so each layer knows how to adjust its weights.
For a network with L layers, the gradient of the loss with respect to the first layer's weights passes through all L layers during backpropagation. At each layer, the gradient is multiplied by that layer's local derivative. If those derivatives are consistently less than 1, the gradient shrinks exponentially:
Each factor ∂ak/∂ak-1 is typically less than 1. Multiply L of them together and the product vanishes — this is the vanishing gradient problem. The early layers receive essentially zero gradient signal and stop learning.
By 2015, clever initialization (Xavier, He init) and batch normalization had largely tamed the vanishing gradient problem. Networks of 20-30 layers could train fine. But a subtler issue remained.
The paper's key empirical finding: on CIFAR-10, a 56-layer plain network achieved about 6.97% test error, while a 20-layer plain network achieved 6.16%. The deeper network was strictly worse, both on training and test. Adding batch normalization and proper initialization did not help — the degradation was an optimization problem, not a representation problem.
| Network | Layers | Training Error | Test Error |
|---|---|---|---|
| Plain-20 | 20 | Lower | ~6.16% |
| Plain-56 | 56 | Higher | ~6.97% |
| Deeper network is worse on BOTH training and test | |||
Here is the argument that makes the degradation problem so puzzling. It is a proof by construction that deeper networks should never be worse.
Suppose you have a trained 20-layer network that achieves 6% error. Now you want to build a 56-layer network. Here is one way to do it:
This constructed solution exists. The 56-layer network has at least one parameter setting that is as good as the 20-layer network. Therefore, the 56-layer network should achieve error at most 6%. In practice, it might do even better by using those 36 extra layers to learn useful refinements.
But the optimizer does not find this solution. It gets stuck somewhere worse. The problem is not that the solution doesn't exist — it's that stochastic gradient descent cannot find it.
This is the insight that drives the entire paper. The authors do not change the optimizer. They do not change the loss function. They change the architecture to make identity mappings trivially easy to represent.
The problem is clear: we need those extra layers to learn the identity function when that's optimal, and learn something useful when it's not. But learning the identity function with a stack of nonlinear layers (convolution → batch norm → ReLU) turns out to be surprisingly hard.
Here is the elegant fix. Instead of asking a block of layers to learn the desired output H(x) directly, ask it to learn the residual:
Then the actual output is:
Why does this help? Think about what happens when the identity mapping is optimal — when the best thing the block can do is pass the input through unchanged. In the original formulation, the layers must learn H(x) = x, which means fitting an identity function with nonlinear layers. In the residual formulation, the layers only need to learn F(x) = 0 — push all weights toward zero. That is dramatically easier for an optimizer.
Think of it like this. You are editing a document. The original formulation says: "rewrite the entire document from scratch every time." The residual formulation says: "here is the current document; tell me what to change." When no changes are needed, the editor does nothing. When small corrections are needed, it makes them. This is far more efficient.
The authors observed (their Figure 7) that in trained ResNets, the learned residual functions F(x) tend to have small magnitudes. The layers are learning small perturbations around the identity, not entirely new functions. The residual formulation matches the actual structure of the solution.
How do you actually implement H(x) = F(x) + x in a neural network? With a shortcut connection — a wire that skips over the block of layers and adds the input directly to the output.
The formal equation for a building block:
where F(x, {Wi}) represents the stack of two or three convolutional layers. The shortcut connection is the "+ x" term. It adds zero extra parameters and essentially zero extra computation (just an element-wise addition).
There is one subtlety. The dimensions of x and F(x) must match for the addition to work. When they do match (same spatial size, same number of channels), the shortcut is a pure identity. When dimensions change (e.g., downsampling or increasing channels), the paper considers two options:
| Option | Approach | Extra Parameters? |
|---|---|---|
| A — Zero Padding | Pad the shortcut with zeros to match channels; stride-2 for spatial | None |
| B — Projection | Use a 1×1 convolution: y = F(x) + Wsx | Small (1×1 conv) |
The paper found that projection shortcuts are only marginally better than zero-padding. The identity shortcut is sufficient. This is important — it means the improvement comes from the architecture, not from extra parameters.
The paper introduces two types of residual blocks, used at different depths.
Basic Block (used in ResNet-18 and ResNet-34): Two 3×3 convolutional layers stacked together.
Bottleneck Block (used in ResNet-50, 101, and 152): Three layers — a 1×1 conv to reduce channels, a 3×3 conv to process, and a 1×1 conv to restore channels.
Why the bottleneck? A 3×3 convolution on 256 channels costs 256 × 256 × 3 × 3 ≈ 590K multiplies per spatial position. The bottleneck first reduces to 64 channels (1×1 conv), applies the expensive 3×3 conv on only 64 channels (64 × 64 × 9 ≈ 37K), then expands back to 256 (1×1 conv). Total: about 70K — an 8× reduction in computation.
| Model | Block Type | Layers | Parameters | FLOPs (Bn) |
|---|---|---|---|---|
| ResNet-18 | Basic | 18 | 11.7M | 1.8 |
| ResNet-34 | Basic | 34 | 21.8M | 3.6 |
| ResNet-50 | Bottleneck | 50 | 25.6M | 3.8 |
| ResNet-101 | Bottleneck | 101 | 44.5M | 7.6 |
| ResNet-152 | Bottleneck | 152 | 60.2M | 11.3 |
| VGG-19 | N/A | 19 | 144M | 19.6 |
Notice: ResNet-152 has fewer parameters and FLOPs than VGG-19, despite being 8× deeper. The bottleneck design is extremely parameter-efficient.
The residual formulation has a beautiful consequence for gradient flow during backpropagation. Let's derive it.
Consider a chain of residual blocks. The output of block l is:
Unrolling this recursion from layer l to layer L:
Now compute the gradient of the loss with respect to xl:
The crucial term is that 1. Regardless of what happens in the F terms, the gradient always has a direct path from the loss to layer l through the identity connections. The gradient never vanishes, because it always includes the unattenuated term ∂Loss/∂xL · 1.
In a plain network with L layers, the gradient from loss to the first layer is a product of L terms. If each term averages 0.9, the gradient after 50 layers is 0.950 ≈ 0.005. After 100 layers: 0.9100 ≈ 0.00003.
In a residual network, the gradient is a sum that always includes a direct "1" term. Even if the F-block derivatives are small, the total gradient remains meaningful. This is the fundamental mechanism that lets ResNets train with 100+ layers.
Now you can see the difference for yourself. The simulation below shows signal and gradient flowing through two networks: a plain network and a residual network with identity shortcuts.
Adjust the depth with the slider. In the plain network, watch the gradient magnitude collapse as depth increases. In the residual network, the shortcut connections preserve gradient flow no matter how deep the network gets.
Adjust Depth to add layers. Toggle Show Gradients to visualize backpropagation. Watch the gradient magnitude in early layers — the plain network fades, the residual network persists.
As you push depth beyond 20 in the plain network, the gradient magnitude at the first layer drops to near zero — the early layers have stopped learning. The residual network maintains strong gradient signal at any depth, because the identity shortcuts act as gradient highways.
This is exactly what He et al. observed in their CIFAR-10 experiments: a 1202-layer ResNet trained successfully (though it overfit slightly due to the small dataset size). No plain network of comparable depth can even converge.
The paper is a masterclass in controlled experimentation. Every claim is backed by fair comparisons where only one variable changes at a time.
Experiment 1: Plain Networks Degrade, ResNets Don't.
On ImageNet, a 34-layer plain network has higher validation error than an 18-layer plain network. But a 34-layer ResNet has lower error than an 18-layer ResNet. The shortcut connections completely reverse the degradation phenomenon.
| Network | Top-1 Error | Trend |
|---|---|---|
| Plain-18 | 27.94% | ↑ worse with depth |
| Plain-34 | 28.54% | |
| ResNet-18 | 27.88% | ↓ better with depth |
| ResNet-34 | 25.03% |
Experiment 2: Scaling to Extreme Depth.
With bottleneck blocks, the authors scaled to 152 layers. The results on ImageNet:
| Model | Top-1 Error | Top-5 Error |
|---|---|---|
| VGG-16 | 28.07% | 9.33% |
| ResNet-50 | 24.7% | 7.8% |
| ResNet-101 | 23.6% | 7.1% |
| ResNet-152 | 23.0% | 6.7% |
| ResNet ensemble | — | 3.57% |
The ensemble of ResNets achieved 3.57% top-5 error, winning ILSVRC 2015 by a large margin. For context, human performance on ImageNet is estimated at about 5.1% top-5 error — ResNet surpassed human performance.
Experiment 3: CIFAR-10 with 1000+ Layers.
On CIFAR-10, the authors trained a 1202-layer ResNet. It trained successfully (a plain network of this depth would be completely untrainable), though it slightly overfit due to the small dataset. The optimal model on CIFAR-10 was the 110-layer ResNet at 6.43% error.
Training curves for plain and residual networks at different depths. Watch how plain networks degrade with depth while ResNets consistently improve.
ResNet is not just a good ImageNet model. It is a design principle that reshaped how we think about deep networks.
ResNet and Highway Networks. Srivastava et al. proposed highway networks concurrently, using gated shortcuts: y = T(x) · H(x) + (1 − T(x)) · x, where T is a learned gate. ResNet's key simplification: remove the gate entirely. The shortcut is always open, always passing all information. This parameter-free design turned out to be strictly better — highway networks never demonstrated gains beyond ~100 layers.
ResNet and DenseNet. Huang et al. (2017) took the skip connection idea further: instead of adding, concatenate feature maps from all preceding layers. DenseNet-121 achieves comparable accuracy to ResNet-152 with far fewer parameters. The core insight is the same — create short paths for information flow — but DenseNet maximizes feature reuse.
ResNet and Transformers. The Transformer architecture (2017) uses residual connections around every attention and feed-forward block. Without them, transformers cannot train at scale. The "pre-norm" vs "post-norm" debate in transformers mirrors the discussion of where to place batch normalization relative to the shortcut. ResNet's influence on modern LLMs is direct and deep.
ResNet and the Unrolled Iterative Estimation View. Liao & Poggio (2016) showed that residual networks can be interpreted as unrolled iterative solvers: each block refines the representation by a small step, like one iteration of gradient descent on an implicit objective. The shortcut connection ensures the solution is stable under iteration.
Paper details. "Deep Residual Learning for Image Recognition," Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. CVPR 2016 (Best Paper). arXiv:1512.03385. First submitted December 2015.