Deep Residual Learning

Chapter 0: The Puzzle

By 2015, everyone agreed: deeper neural networks learn richer features. AlexNet had 8 layers. VGGNet pushed to 19. Each time we added layers, we got better results on ImageNet. The pattern seemed obvious — just stack more layers.

So researchers tried 56 layers. And something bizarre happened.

The 56-layer network performed worse than the 20-layer network. Not just on test data — on training data. This was not overfitting. The deeper network was genuinely worse at learning, even on the data it was directly optimizing on.

The paradox: More layers should mean more capacity. More capacity should mean better fitting. But experiments showed the opposite — deeper plain networks had higher training error. Something fundamental was broken, and it was not overfitting.

This paper by Kaiming He and colleagues at Microsoft Research diagnosed the problem and proposed a fix so elegant it fits in one line of math. Their residual networks (ResNets) trained with 152 layers — 8× deeper than VGG — and won every track of ILSVRC 2015 with 3.57% top-5 error. The idea was simple enough that you can implement it in a single line of code: output = F(x) + x.

Let's build the intuition from scratch.

A 56-layer plain network has higher training error than a 20-layer network on the same data. Is this caused by overfitting?

No — overfitting means low training error but high test error. Here the training error itself is higher, which means the optimizer cannot even fit the training data. Yes — more layers means more parameters, so the model memorizes noise Yes — the 56-layer network has too much capacity

Chapter 1: The Degradation Problem

Let's be precise about what goes wrong. When you train a neural network, you use backpropagation: you compute the loss at the output, then propagate gradients backward through every layer so each layer knows how to adjust its weights.

For a network with L layers, the gradient of the loss with respect to the first layer's weights passes through all L layers during backpropagation. At each layer, the gradient is multiplied by that layer's local derivative. If those derivatives are consistently less than 1, the gradient shrinks exponentially:

∂Loss / ∂W₁ = ∂Loss/∂a_L · ∂a_L/∂a_L-1 · … · ∂a₂/∂a₁ · ∂a₁/∂W₁

Each factor ∂a_k/∂a_k-1 is typically less than 1. Multiply L of them together and the product vanishes — this is the vanishing gradient problem. The early layers receive essentially zero gradient signal and stop learning.

By 2015, clever initialization (Xavier, He init) and batch normalization had largely tamed the vanishing gradient problem. Networks of 20-30 layers could train fine. But a subtler issue remained.

Degradation ≠ vanishing gradients. The degradation problem is not the same as vanishing gradients. Vanishing gradients prevent training from even starting. Degradation is more insidious: the network converges, but to a worse solution. The optimizer finds a local minimum that a shallower network would have avoided.

The paper's key empirical finding: on CIFAR-10, a 56-layer plain network achieved about 6.97% test error, while a 20-layer plain network achieved 6.16%. The deeper network was strictly worse, both on training and test. Adding batch normalization and proper initialization did not help — the degradation was an optimization problem, not a representation problem.

Network	Layers	Training Error	Test Error
Plain-20	20	Lower	~6.16%
Plain-56	56	Higher	~6.97%
Deeper network is worse on BOTH training and test

What distinguishes the degradation problem from the vanishing gradient problem?

Vanishing gradients prevent convergence entirely; degradation allows convergence but to a worse solution They are the same thing, just different names Degradation is caused by overfitting; vanishing gradients are not

Chapter 2: A Thought Experiment

Here is the argument that makes the degradation problem so puzzling. It is a proof by construction that deeper networks should never be worse.

Suppose you have a trained 20-layer network that achieves 6% error. Now you want to build a 56-layer network. Here is one way to do it:

Copy

Take the first 20 layers from the trained 20-layer network. Copy their weights exactly.

↓

Add Identity Layers

Stack 36 more layers on top. Set each one to the identity function: output = input.

↓

Result

A 56-layer network that achieves exactly 6% error — the same as the 20-layer network.

This constructed solution exists. The 56-layer network has at least one parameter setting that is as good as the 20-layer network. Therefore, the 56-layer network should achieve error at most 6%. In practice, it might do even better by using those 36 extra layers to learn useful refinements.

But the optimizer does not find this solution. It gets stuck somewhere worse. The problem is not that the solution doesn't exist — it's that stochastic gradient descent cannot find it.

The key insight: A deep network should never be worse than a shallow one, because the deep network always contains the shallow solution plus identity layers. The fact that it is worse means the optimizer is failing. The question becomes: how do we make it easier for the optimizer to learn identity mappings?

This is the insight that drives the entire paper. The authors do not change the optimizer. They do not change the loss function. They change the architecture to make identity mappings trivially easy to represent.

Why should a 56-layer network, in theory, always perform at least as well as a 20-layer network?

Because you can construct a 56-layer solution by copying the 20-layer weights and setting the extra layers to the identity function, matching the 20-layer performance Because more parameters always means lower loss Because deeper networks have better gradient flow

Chapter 3: Residual Learning

The problem is clear: we need those extra layers to learn the identity function when that's optimal, and learn something useful when it's not. But learning the identity function with a stack of nonlinear layers (convolution → batch norm → ReLU) turns out to be surprisingly hard.

Here is the elegant fix. Instead of asking a block of layers to learn the desired output H(x) directly, ask it to learn the residual:

F(x) := H(x) − x

Then the actual output is:

H(x) = F(x) + x

Why does this help? Think about what happens when the identity mapping is optimal — when the best thing the block can do is pass the input through unchanged. In the original formulation, the layers must learn H(x) = x, which means fitting an identity function with nonlinear layers. In the residual formulation, the layers only need to learn F(x) = 0 — push all weights toward zero. That is dramatically easier for an optimizer.

The residual trick: Don't ask "what should the output be?" Ask "what should I add to the input?" If the answer is "nothing," the network just sets F(x) ≈ 0. If the answer is "a small adjustment," the network learns that small perturbation. Either way, the identity mapping is the default, not something that has to be learned from scratch.

Think of it like this. You are editing a document. The original formulation says: "rewrite the entire document from scratch every time." The residual formulation says: "here is the current document; tell me what to change." When no changes are needed, the editor does nothing. When small corrections are needed, it makes them. This is far more efficient.

The authors observed (their Figure 7) that in trained ResNets, the learned residual functions F(x) tend to have small magnitudes. The layers are learning small perturbations around the identity, not entirely new functions. The residual formulation matches the actual structure of the solution.

In a residual block, what does F(x) represent?

The full output of the block The change to make to the input — the residual that gets added to x The gradient flowing backward through the block

Chapter 4: Identity Shortcuts

How do you actually implement H(x) = F(x) + x in a neural network? With a shortcut connection — a wire that skips over the block of layers and adds the input directly to the output.

Input x

The feature map entering the residual block

↓ splits into two paths

Main Path: F(x)

Conv → BN → ReLU → Conv → BN

Shortcut: x

Identity — just copy the input (no parameters!)

↓ element-wise add

Output: F(x) + x

Then apply ReLU

The formal equation for a building block:

y = F(x, {W_i}) + x

where F(x, {W_i}) represents the stack of two or three convolutional layers. The shortcut connection is the "+ x" term. It adds zero extra parameters and essentially zero extra computation (just an element-wise addition).

There is one subtlety. The dimensions of x and F(x) must match for the addition to work. When they do match (same spatial size, same number of channels), the shortcut is a pure identity. When dimensions change (e.g., downsampling or increasing channels), the paper considers two options:

Option	Approach	Extra Parameters?
A — Zero Padding	Pad the shortcut with zeros to match channels; stride-2 for spatial	None
B — Projection	Use a 1×1 convolution: y = F(x) + W_sx	Small (1×1 conv)

The paper found that projection shortcuts are only marginally better than zero-padding. The identity shortcut is sufficient. This is important — it means the improvement comes from the architecture, not from extra parameters.

No free parameters: The skip connection adds zero learnable parameters. Plain and residual networks with the same depth have (almost) exactly the same number of parameters and FLOPs. Any performance difference comes purely from the reformulation making optimization easier.

When the input and output of a residual block have different dimensions, what does the paper recommend?

Use a 1×1 projection convolution on the shortcut to match dimensions, though zero-padding works almost as well Remove the shortcut connection for those blocks Resize the input with interpolation

Chapter 5: The Building Blocks

The paper introduces two types of residual blocks, used at different depths.

Basic Block (used in ResNet-18 and ResNet-34): Two 3×3 convolutional layers stacked together.

Basic Block

3×3 Conv → BN → ReLU → 3×3 Conv → BN → (+x) → ReLU

Bottleneck Block (used in ResNet-50, 101, and 152): Three layers — a 1×1 conv to reduce channels, a 3×3 conv to process, and a 1×1 conv to restore channels.

Bottleneck Block

1×1 Conv (reduce) → BN → ReLU → 3×3 Conv → BN → ReLU → 1×1 Conv (expand) → BN → (+x) → ReLU

Why the bottleneck? A 3×3 convolution on 256 channels costs 256 × 256 × 3 × 3 ≈ 590K multiplies per spatial position. The bottleneck first reduces to 64 channels (1×1 conv), applies the expensive 3×3 conv on only 64 channels (64 × 64 × 9 ≈ 37K), then expands back to 256 (1×1 conv). Total: about 70K — an 8× reduction in computation.

Model	Block Type	Layers	Parameters	FLOPs (Bn)
ResNet-18	Basic	18	11.7M	1.8
ResNet-34	Basic	34	21.8M	3.6
ResNet-50	Bottleneck	50	25.6M	3.8
ResNet-101	Bottleneck	101	44.5M	7.6
ResNet-152	Bottleneck	152	60.2M	11.3
VGG-19	N/A	19	144M	19.6

Notice: ResNet-152 has fewer parameters and FLOPs than VGG-19, despite being 8× deeper. The bottleneck design is extremely parameter-efficient.

The architectural hierarchy: The overall network has four stages. At each stage, the spatial resolution halves and the channel count doubles (64 → 128 → 256 → 512). Each stage contains multiple residual blocks. The network ends with global average pooling and a single fully-connected layer for classification.

Why does the bottleneck block use 1×1 convolutions before and after the 3×3 convolution?

To reduce the channel dimension before the expensive 3×3 conv and restore it after, cutting computation by roughly 8× To add more nonlinearities for better feature extraction To match the dimensions for the shortcut connection

Chapter 6: Why It Works: Gradients

The residual formulation has a beautiful consequence for gradient flow during backpropagation. Let's derive it.

Consider a chain of residual blocks. The output of block l is:

x_l+1 = x_l + F(x_l)

Unrolling this recursion from layer l to layer L:

x_L = x_l + ∑_i=l^L-1 F(x_i)

Now compute the gradient of the loss with respect to x_l:

∂Loss/∂x_l = ∂Loss/∂x_L · ∂x_L/∂x_l = ∂Loss/∂x_L · (1 + ∂/∂x_l ∑_i=l^L-1 F(x_i))

The crucial term is that 1. Regardless of what happens in the F terms, the gradient always has a direct path from the loss to layer l through the identity connections. The gradient never vanishes, because it always includes the unattenuated term ∂Loss/∂x_L · 1.

The gradient highway: In a plain network, the gradient must pass through every layer's weights and nonlinearities — it is a product of many small terms. In a residual network, the gradient has a shortcut that bypasses all intermediate layers. It is like having an express lane on a highway. Even if the local roads are congested (small derivatives in the F blocks), the express lane always delivers the gradient signal.

In a plain network with L layers, the gradient from loss to the first layer is a product of L terms. If each term averages 0.9, the gradient after 50 layers is 0.9⁵⁰ ≈ 0.005. After 100 layers: 0.9¹⁰⁰ ≈ 0.00003.

In a residual network, the gradient is a sum that always includes a direct "1" term. Even if the F-block derivatives are small, the total gradient remains meaningful. This is the fundamental mechanism that lets ResNets train with 100+ layers.

In a residual network, what guarantees that the gradient signal reaches early layers?

Batch normalization prevents gradient vanishing The learning rate is set very high for early layers The identity shortcuts create a direct gradient path (the "1" term) that bypasses all intermediate F blocks

Chapter 7: Showcase — Plain vs Residual

Now you can see the difference for yourself. The simulation below shows signal and gradient flowing through two networks: a plain network and a residual network with identity shortcuts.

Adjust the depth with the slider. In the plain network, watch the gradient magnitude collapse as depth increases. In the residual network, the shortcut connections preserve gradient flow no matter how deep the network gets.

Plain vs Residual Network: Signal & Gradient Flow

Adjust Depth to add layers. Toggle Show Gradients to visualize backpropagation. Watch the gradient magnitude in early layers — the plain network fades, the residual network persists.

Click Run to start

Depth (layers)10

As you push depth beyond 20 in the plain network, the gradient magnitude at the first layer drops to near zero — the early layers have stopped learning. The residual network maintains strong gradient signal at any depth, because the identity shortcuts act as gradient highways.

This is exactly what He et al. observed in their CIFAR-10 experiments: a 1202-layer ResNet trained successfully (though it overfit slightly due to the small dataset size). No plain network of comparable depth can even converge.

In the simulation, what happens to the gradient magnitude at the first layer of the plain network as depth increases from 10 to 40?

It drops to near zero because the gradient is a product of many terms each less than 1 It stays constant because batch normalization preserves it It explodes because of compounding multiplications

Chapter 8: The Experiments

The paper is a masterclass in controlled experimentation. Every claim is backed by fair comparisons where only one variable changes at a time.

Experiment 1: Plain Networks Degrade, ResNets Don't.

On ImageNet, a 34-layer plain network has higher validation error than an 18-layer plain network. But a 34-layer ResNet has lower error than an 18-layer ResNet. The shortcut connections completely reverse the degradation phenomenon.

Network	Top-1 Error	Trend
Plain-18	27.94%	↑ worse with depth
Plain-34	28.54%	↑ worse with depth
ResNet-18	27.88%	↓ better with depth
ResNet-34	25.03%	↓ better with depth

Experiment 2: Scaling to Extreme Depth.

With bottleneck blocks, the authors scaled to 152 layers. The results on ImageNet:

Model	Top-1 Error	Top-5 Error
VGG-16	28.07%	9.33%
ResNet-50	24.7%	7.8%
ResNet-101	23.6%	7.1%
ResNet-152	23.0%	6.7%
ResNet ensemble	—	3.57%

The ensemble of ResNets achieved 3.57% top-5 error, winning ILSVRC 2015 by a large margin. For context, human performance on ImageNet is estimated at about 5.1% top-5 error — ResNet surpassed human performance.

Experiment 3: CIFAR-10 with 1000+ Layers.

On CIFAR-10, the authors trained a 1202-layer ResNet. It trained successfully (a plain network of this depth would be completely untrainable), though it slightly overfit due to the small dataset. The optimal model on CIFAR-10 was the 110-layer ResNet at 6.43% error.

Beyond classification: ResNets did not just win ImageNet classification. The same team won every track of ILSVRC & COCO 2015: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. The deep representations learned by ResNets proved universally useful. On COCO object detection, switching to ResNet features gave a 28% relative improvement.

Depth vs Error: Plain Networks vs ResNets

Training curves for plain and residual networks at different depths. Watch how plain networks degrade with depth while ResNets consistently improve.

Click to animate

What was the top-5 error rate of the ResNet ensemble on the ImageNet test set?

5.1% 3.57% — surpassing estimated human performance of ~5.1% 6.7%

Chapter 9: Connections

ResNet is not just a good ImageNet model. It is a design principle that reshaped how we think about deep networks.

ResNet and Highway Networks. Srivastava et al. proposed highway networks concurrently, using gated shortcuts: y = T(x) · H(x) + (1 − T(x)) · x, where T is a learned gate. ResNet's key simplification: remove the gate entirely. The shortcut is always open, always passing all information. This parameter-free design turned out to be strictly better — highway networks never demonstrated gains beyond ~100 layers.

ResNet and DenseNet. Huang et al. (2017) took the skip connection idea further: instead of adding, concatenate feature maps from all preceding layers. DenseNet-121 achieves comparable accuracy to ResNet-152 with far fewer parameters. The core insight is the same — create short paths for information flow — but DenseNet maximizes feature reuse.

ResNet and Transformers. The Transformer architecture (2017) uses residual connections around every attention and feed-forward block. Without them, transformers cannot train at scale. The "pre-norm" vs "post-norm" debate in transformers mirrors the discussion of where to place batch normalization relative to the shortcut. ResNet's influence on modern LLMs is direct and deep.

ResNet and the Unrolled Iterative Estimation View. Liao & Poggio (2016) showed that residual networks can be interpreted as unrolled iterative solvers: each block refines the representation by a small step, like one iteration of gradient descent on an implicit objective. The shortcut connection ensures the solution is stable under iteration.

ResNet (2015)

Identity shortcuts enable 100+ layer training

↓ inspired

DenseNet, ResNeXt, SE-Net

Richer connectivity patterns with the same core insight

↓ essential for

Transformers & Modern LLMs

Every transformer block has a residual connection — directly inherited from this paper

The lasting impact. ResNet has over 200,000 citations. The idea that you should learn residual adjustments rather than full mappings has become the default in virtually all deep learning. Every transformer, every diffusion model, every modern vision network uses skip connections. This paper did not just build a better image classifier — it unlocked the entire era of very deep networks.

Paper details. "Deep Residual Learning for Image Recognition," Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. CVPR 2016 (Best Paper). arXiv:1512.03385. First submitted December 2015.

← Back to Veanors Hub

Which modern architecture family uses residual (skip) connections around every block, directly inheriting this paper's design principle?

Recurrent Neural Networks (RNNs) Transformers — every attention and feed-forward block uses a residual connection Convolutional autoencoders

Deep ResidualLearningfor Image Recognition