He, Zhang, Ren & Sun — Microsoft Research, ECCV 2016 • arXiv:1603.05027

Identity Mappings in
Deep Residual
Networks

The original ResNet paper showed that skip connections work. This paper explains why — and proves that keeping the shortcut perfectly clean is the key to training 1000+ layer networks.

Prerequisites: ResNet basics (skip connections, residual blocks) + Backpropagation (we derive the rest)
10
Chapters
1
Simulation
10
Quizzes

Chapter 0: The Sequel

In 2015, ResNet showed that skip connections could train 152-layer networks and won every major benchmark. The recipe was simple: add the input to the output of each block. The "+" operation created a shortcut for gradients to flow backward, solving the degradation problem that had blocked deeper architectures.

ResNet was a massive empirical success, but it left the theoretical story incomplete. Two questions remained unanswered.

First: why does the identity shortcut work? Is it the best choice, or would a gated shortcut (like Highway Networks) or a learned 1×1 convolution shortcut do even better? After all, those have more parameters and more representational power.

Second: the original ResNet places a ReLU after the addition. That ReLU sits directly on the shortcut path, blocking negative values from passing through. For a 100-layer network, this might not matter. But for a 1000-layer network, does this tiny disruption compound into a real problem? And if so, what is the right way to arrange BN and ReLU inside a residual block?

The surprise: Simpler is better. Any modification to the identity shortcut — scaling, gating, convolutions, dropout — makes training harder. And moving BN+ReLU to before the weight layers (pre-activation) unlocks clean 1001-layer training on CIFAR with 4.62% error.

This is a paper about the geometry of information flow. The authors derive exact formulas for how signals and gradients propagate through a stack of residual blocks, then show that the cleanest possible path — pure identity on the shortcut, identity after addition — is mathematically and empirically optimal.

The paper makes two distinct contributions. The first is theoretical: a formal analysis showing that when both the shortcut and the post-addition function are identity mappings, the signal can propagate directly between any two layers in both forward and backward passes. Any deviation from identity — scaling, gating, convolution, even dropout — introduces a multiplicative factor that compounds exponentially with depth.

The second contribution is architectural: a new residual unit where BN and ReLU are moved before the weight layers ("pre-activation"), making the after-addition function an identity. Together, these contributions unlock networks of 1000+ layers. The paper demonstrates this with a 1001-layer ResNet that trains smoothly and achieves 4.62% error on CIFAR-10 — at a time when most practitioners considered 50 layers to be "very deep."

The original ResNet (2015) places a ReLU after the element-wise addition in each residual block. What is the potential problem with this?

Chapter 1: The Propagation Equations

A residual unit in the original ResNet computes two things:

yl = h(xl) + F(xl, Wl)
xl+1 = f(yl)

Here xl is the input to block l, F is the residual function (the stack of weight layers — typically two 3×3 convolutions with BN and ReLU between them), h is the shortcut function, and f is whatever activation happens after the element-wise addition. In the original ResNet, h(x) = x (identity shortcut) and f = ReLU (post-addition activation).

These two equations completely describe a single residual unit. The first equation computes the sum of the shortcut h(xl) and the residual F(xl, Wl). The second equation applies the post-addition activation f. Everything about information propagation — forward signals, backward gradients, training dynamics — follows from these two lines.

The paper's approach is to ask: under what conditions on h and f do signals propagate most cleanly? The answer, as we will derive, is both must be identity mappings.

Now suppose both h and f are identity mappings. Then xl+1 = yl = xl + F(xl, Wl). This is the simplest possible residual unit: input plus residual, nothing else. Now apply this recursively. Block l+1: xl+2 = xl+1 + Fl+1 = xl + Fl + Fl+1. Block l+2: xl+3 = xl + Fl + Fl+1 + Fl+2. In general, from block l to block L:

xL = xl + ∑i=lL-1 F(xi, Wi)

This is remarkable. Read it again: the feature at any deep layer L is just the feature at any shallow layer l, plus a sum of residuals. There is no long chain of dependencies. Layer 100 can "see" layer 1 directly through the sum. Compare this to a plain network where xL is a long chain of matrix multiplications — a product, not a sum. In a plain network, layer 100 can only see layer 1 through 99 intermediate transformations.

Another way to read this equation: the output of the whole network xL = x0 + ∑i=0L-1 F(xi, Wi) is the input x0 plus the accumulated residuals from all blocks. Each block contributes an independent additive term. No block can "erase" what previous blocks wrote — it can only add its own correction.

This additive structure has another implication: deleting a single block from a trained ResNet should have only a minor impact, because removing one term from a sum changes the total by only that term's magnitude. Veit et al. (2016) later verified this experimentally — removing individual ResNet blocks at test time causes only small accuracy drops, unlike plain networks where removing any layer is catastrophic.

Sum vs product. In a plain network, features are products: xL = WL-1 · WL-2 · … · Wl · xl. In a residual network with identity mappings, features are sums: xL = xl + ∑ Fi. Sums are far more stable than products — one bad factor can collapse a product to zero, but one bad term barely affects a sum.

Now differentiate with respect to the loss E:

∂E/∂xl = ∂E/∂xL · (1 + ∂/∂xli=lL-1 F(xi, Wi))

Look at that 1. The gradient at layer l always includes a direct, unattenuated copy of the gradient at layer L. No matter how many blocks separate l from L, no matter what the weights are, this "1" is always there. It is the gradient highway — a clean, unobstructed path from the loss all the way back to the earliest layers.

The second term, involving the partial derivative of the sum of F terms, carries the gradient that flows through the weight layers. This term may be small, large, or even negative. But it is additive. It cannot cancel out the "1" for all samples in a mini-batch simultaneously, because the F functions produce different values for different inputs.

This gradient can never vanish entirely, because the term involving F would have to equal exactly −1 for every sample in the mini-batch simultaneously. That is statistically implausible.

Products kill, sums survive. In a plain network, the gradient passes through a product of L Jacobians. One near-zero factor collapses the whole product. In a ResNet with identity mappings, the gradient is a sum: a direct "1" path plus residual contributions. The "1" acts as a floor — no matter how bad the residual gradients are, the direct path always carries signal. This is why ResNets train at depths where plain networks collapse.

Note that these equations only hold perfectly when both h and f are identity mappings. The original ResNet satisfies one condition — h(x) = x (identity shortcut) — but violates the other: f = ReLU (not identity). The "1" in the gradient equation becomes approximate, not exact. For 100-layer networks, the approximation is good enough. For 1000-layer networks, it is not.

The rest of this paper is about making these equations hold exactly. Chapter 2 shows what happens when we break the first condition (identity shortcut). Chapters 4-5 show how to satisfy the second condition (identity after-addition) by rearranging the components inside each block.

In the gradient equation ∂E/∂xl = ∂E/∂xL · (1 + ...), what does the "1" represent?

Chapter 2: Why Identity Shortcuts

The equations in Chapter 1 only hold when h(x) = x. But what if we use a "better" shortcut? After all, a 1×1 convolution or a gating mechanism has more parameters. Shouldn't more capacity help?

The paper tests six alternatives on ResNet-110 (CIFAR-10), a network with 54 residual units. Every single one performs worse than the plain identity shortcut. The results are systematic and devastating for the "more parameters is better" hypothesis:

Shortcut TypeOn ShortcutOn FError (%)
Original (identity)116.61
Constant scaling (0.5, 0.5)0.50.512.35
Exclusive gating1−g(x)g(x)8.70
Shortcut-only gating (bg=0)1−g(x)112.86
1×1 conv shortcut1×1 conv112.22
Dropout shortcut (0.5)dropout1fail

The results are striking. The identity shortcut — the one with zero extra parameters — beats all alternatives. Some alternatives fail entirely (error > 20%). The pattern is unambiguous: any modification to the shortcut hurts.

Look at the plain network case (shortcut = 0, F = 1): it fails entirely. This is just a plain network with no skip connections at all, confirming that the shortcut is essential, not optional. Then look at the progression: scaling by 0.5 (fails), gating (8.70% at best), 1×1 conv (12.22%), dropout (fails). Every attempt to "improve" the shortcut makes things worse.

The principle: Any operation on the shortcut path impedes information propagation. Scaling multiplies gradients by a factor that can vanish or explode across hundreds of layers. Gating introduces a multiplicative product ∏ (1−gi) that tends toward zero. Even 1×1 convolutions introduce a product of derivatives ∏ h'i that destabilizes training. The shortcut must be clean.

This is perhaps the deepest insight. More parameters do not always help. The gated and convolutional shortcuts contain the identity shortcut as a special case (just set the gates to 1 or the 1×1 conv to the identity matrix). But the optimizer cannot find that solution. The extra flexibility becomes a trap.

The dropout shortcut result is particularly instructive. Dropout with probability 0.5 is statistically equivalent to scaling the shortcut by 0.5 in expectation. The network fails to train at all — the stochastic scaling destroys the gradient highway just as badly as deterministic scaling. Even dropout, which is normally a beneficial regularizer, becomes harmful when applied to the shortcut path.

There is an interesting asymmetry in the gating results. When the gating bias bg is initialized to −6 (so σ(bg) ≈ 0.0025, meaning the gate starts near identity), shortcut-only gating achieves 6.91% — nearly matching the baseline. When bg = 0 (gate starts at 0.5), it degrades to 12.86%. The network wants the shortcut to be the identity, and initialization that starts closer to identity starts closer to the good solution.

The exclusive gating results tell a complementary story. In exclusive gating, the shortcut gets factor (1−g) and the F path gets factor g. When (1−g) ≈ 1 (good for the shortcut), g ≈ 0 (bad for the residual function — it is suppressed). The gate creates a zero-sum competition between the shortcut and the residual path. The identity shortcut avoids this dilemma entirely: both paths operate at full strength simultaneously.

A 1×1 convolution shortcut has strictly more representational capacity than an identity shortcut (it can represent the identity by learning the identity matrix). Why does it perform worse?

Chapter 3: Breaking the Shortcut

Chapter 2 showed the experiments. Now let's see the math that predicts exactly why non-identity shortcuts fail. This analysis is one of the most elegant parts of the paper.

Replace h(x) = x with a simple scaling h(x) = λx. This is the mildest possible modification — just multiply the shortcut by a constant. Unrolling the recursion now gives:

xL = (∏i=lL-1 λi) · xl + ∑i=lL-1 F̂(xi, Wi)

The gradient becomes:

∂E/∂xl = ∂E/∂xL · (∏i=lL-1 λi + ∂/∂xl ∑ F̂i)

That clean "1" from the identity shortcut is now replaced by ∏ λi. If λ = 0.5 and you have 54 blocks, the product is 0.554 ≈ 5.5 × 10−17. That is a factor of 10−17 on the gradient — effectively zero in floating point. The gradient highway is destroyed, and all gradient signal must detour through the weight layers, where it faces further attenuation from the chain of Jacobians.

The exponential trap. Any constant scaling λ ≠ 1 on the shortcut creates an exponential in the gradient: λL. For λ < 1, this vanishes. For λ > 1, it explodes. Only λ = 1 (identity) gives a stable gradient of exactly 1 through the shortcut path, regardless of depth.

For more complex shortcuts like gating g(x) or 1×1 convolutions, the product becomes ∏ h'i where h' is the derivative of the shortcut function. This product is no better behaved — it is a long chain of multiplications, exactly the kind of thing that causes vanishing or exploding gradients in plain networks. We have come full circle: putting anything on the shortcut path re-introduces the very problem that skip connections were meant to solve.

Here is a concrete worked example. Consider exclusive gating with g(x) = σ(Wx + b). The shortcut is scaled by 1 − g(x), and the F path by g(x). During backpropagation, the shortcut gradient picks up a factor of 1 − g at each block. If g ≈ 0.5 (as it is when bg = 0), then the factor is 0.5 at each block. Through 54 blocks: 0.554 ≈ 0. The gradient highway is annihilated.

But if we initialize bg = −6, then σ(−6) ≈ 0.0025, so 1 − g ≈ 0.9975. Through 54 blocks: 0.997554 ≈ 0.87. The gradient highway retains 87% of the signal. This is why initialization matters so much for gated networks, and why the paper finds that gating with bg = −6 gets 6.91% — close to the identity baseline at 6.61%.

The paper's experiments confirm this perfectly. The closer a gating function is to the identity (by initializing bias bg very negatively, so 1−g(x) ≈ 1), the closer the result gets to the baseline. The shortcut wants to be the identity.

If you scale every shortcut by 0.5 in a network with 54 residual blocks, by what factor is the gradient highway attenuated?

Chapter 4: The Activation Problem

So the shortcut h(x) must be the identity. But what about f — the activation after the addition? In the original ResNet, f is ReLU. That means:

xl+1 = ReLU(xl + F(xl, Wl))

This ReLU sits directly on the information highway. It clips all negative values to zero. When we try to unroll the recursion, we no longer get the clean summation xL = xl + ∑ Fi, because the ReLU modifies the signal at every step.

For shallow ResNets (up to ~164 layers), this is manageable. After some initial training, the learned weights adjust so that yl = xl + F(xl) tends to be positive most of the time. (Since xl is already non-negative from the previous ReLU, yl is only negative when F has a large negative magnitude.) The post-addition ReLU rarely clips, and training proceeds normally.

But for very deep networks — hundreds or thousands of layers — even occasional clipping accumulates. If 5% of blocks clip on average, and you have 333 blocks, the expected number of clips per forward pass is about 17. Each clip zeroes a gradient path. The optimization landscape becomes increasingly rough, and the network struggles to find good solutions.

The goal: Make f an identity mapping too. If both h(x) = x and f(y) = y, then xl+1 = xl + F(xl, Wl) exactly, and the clean propagation equations hold perfectly for any depth.

But we still need nonlinearity somewhere — without ReLU and BN, the network cannot learn anything useful. A linear network of any depth is equivalent to a single linear layer. The question is not whether to use activations, but where to put them so they affect only the residual path F, not the shortcut path.

This is a constrained design problem. We have six components that must appear in each block: two weight layers, two batch normalizations, and two ReLUs. We have one constraint: nothing should sit on the shortcut path between the input and the addition. The question is: what ordering of {BN, ReLU, Weight, BN, ReLU, Weight} satisfies this constraint while preserving representational power?

The paper tries five arrangements. All use the same components (BN, ReLU, weight layers) — only the order differs:

(a) Original
Weight → BN → ReLU → Weight → BN → Add → ReLU
(b) BN after add
Weight → BN → ReLU → Weight → BN → Add → BN → ReLU
(c) ReLU before add
Weight → BN → ReLU → Weight → BN → ReLU → Add
(d) ReLU-only pre-act
ReLU → Weight → BN → ReLU → Weight → BN → Add
(e) Full pre-activation
BN → ReLU → Weight → BN → ReLU → Weight → Add

Option (e) is the winner. By placing BN and ReLU before each weight layer, the addition becomes the last operation in the block. Nothing sits on the shortcut path after the add. The after-addition activation f is the identity. This is the only arrangement that simultaneously satisfies: (1) nonlinear activations exist for representational power, (2) the residual F can output both positive and negative values, and (3) nothing impedes the shortcut path.

Why is option (c) — moving ReLU before the addition — a bad idea?

Chapter 5: Pre-Activation Design

We have established that the shortcut must be the identity. Now we turn to the second condition: making the post-addition function f also an identity. This is where the paper's most influential architectural contribution emerges.

The key insight is a shift in perspective. In the original design, BN and ReLU are thought of as post-activation — they come after each weight layer. But when you have a branching structure with addition, the position relative to the addition matters more than the position relative to the weights.

The paper proposes a beautiful reinterpretation. Take the post-addition ReLU from block l and conceptually "push" it to the beginning of block l+1. What was the output activation of one block becomes the input activation of the next. Now it is a pre-activation — it processes the input before it enters the weight layers. The shortcut path remains untouched.

This is not adding or removing anything. It is a rewriting of the same computation. The same BN and ReLU operations happen, in the same order, on the same data. We are just drawing the block boundaries differently — and this change in perspective reveals that the activation can be moved entirely onto the F path.

Post-activation (original)
x → [Weight → BN → ReLU → Weight → BN] → Add(+x) → ReLU → output
↓ rearrange
Pre-activation (proposed)
x → [BN → ReLU → Weight → BN → ReLU → Weight] → Add(+x) → output

The two designs use exactly the same components. The only difference is the ordering. But the consequences are profound:

PropertyPost-activationPre-activation
After-addition function fReLU (clips negatives)Identity (clean pass-through)
Shortcut pathPasses through ReLUPure identity, unmodified
BN normalizationNormalizes F, then adds to un-normalized shortcutNormalizes input to ALL weight layers
Gradient highwayApproximate (ReLU can clip)Exact: ∂E/∂xl always includes ∂E/∂xL · 1
Two benefits at once. Pre-activation gives both easier optimization (clean gradient highway) and better regularization (BN normalizes all inputs to weight layers, including the signal coming through the shortcut). In the original design, the shortcut signal bypasses BN and arrives un-normalized at the next block's weights.

In code, the difference is remarkably small. A post-activation block does out = relu(x + F(x)). A pre-activation block does out = x + F(bn_relu(x)). Same components, different order. But the second version unlocks 1000-layer training.

The pre-activation design also has an elegant mathematical form:

xl+1 = xl + F(f̂(xl), Wl)

where f̂ is the BN-ReLU pre-activation. The activation only affects the F path. The shortcut is a pure additive identity. This is exactly the condition needed for the clean propagation equations to hold.

Think about why this matters at 1000 layers. In the original design, x passes through 333 post-addition ReLUs on the shortcut path. Even if each ReLU clips only 5% of the time, the probability that at least one clips grows rapidly — and each clip zeroes out the gradient along that path. In the pre-activation design, x passes through 333 additions and nothing else. The gradient highway is guaranteed to carry signal, regardless of depth.

The distinction between post-activation and pre-activation is only meaningful because of the branching structure. In a plain (non-residual) network with N layers, there are N−1 activations, and it doesn't matter whether you call them "post" or "pre" — the sequence is the same either way. But once you introduce a skip connection that branches around the weight layers, the position of the activation relative to the addition determines whether it sits on the shortcut path or not. This is why the paper's figure showing the equivalence between asymmetric output activation and pre-activation (their Figure 5) is so clarifying: what looks like a different design is really just a different way of drawing the same components, but the change in perspective reveals which path the activation affects.

Besides cleaner gradient flow, what is the second benefit of the pre-activation design?

Chapter 6: Showcase — Post vs Pre

This simulation shows both architectures simultaneously. The top half displays the post-activation (original) design; the bottom half displays the pre-activation (proposed) design. Each bar represents one residual block's gradient magnitude during backpropagation.

Watch how the post-activation gradients become uneven and weakened as you increase depth, while the pre-activation gradients remain strong and stable. The difference is especially dramatic at 20+ blocks.

Press "Run Simulation" to generate forward signals, then "Show Gradients" to compare

At depth 30+, notice how some post-activation gradient bars drop to near zero — those are blocks where the ReLU clipped the signal on the shortcut path. Meanwhile, every pre-activation bar stays healthy. This is the gradient highway at work.

What to look for. Run the simulation at depth 10, then increase to 30 or 40. At low depth, both architectures look similar. At high depth, the post-activation design develops "dead" blocks with near-zero gradients, while pre-activation maintains uniform gradient flow. This matches the paper's finding that pre-activation only matters at extreme depth.

Also notice the block diagram at the top of the canvas. It shows the internal structure of each architecture — where BN, ReLU, and the weight layers sit relative to the addition. In the post-activation design, the ReLU (shown in orange) sits after the addition, on the shortcut path. In the pre-activation design, BN+ReLU (shown in teal) sit before the weight layers, keeping the path after addition completely clean.

Try this experiment: run the simulation at depth 10 and note the gradient ratio (first/last block). Then increase to 20 and run again. Then 30. Then 40. In the post-activation column, the ratio drops — meaning the first block gets less gradient relative to the last block. In the pre-activation column, the ratio stays near 1.0 regardless of depth. This is the gradient highway in action.

The forward signal view is also informative. In the post-activation design, the signal is always non-negative (because ReLU clips after every addition). In the pre-activation design, the signal can take any value — the residuals are free to add or subtract. This flexibility is important for representational power, because a good residual function should be able to both increase and decrease activations.

The block diagrams at the top of the canvas highlight the key structural difference. In the post-activation diagram, the highlighted ReLU box sits after the addition — on the shortcut path. In the pre-activation diagram, the highlighted BN+ReLU boxes sit before the weight layers, leaving the addition as the final, clean operation. Same components, but in the pre-activation arrangement, the shortcut path touches nothing except the element-wise add.

In the simulation, what happens to gradient flow at the earliest layers as you increase depth in post-activation mode?

Chapter 7: The Experiments

Theory is only as good as its empirical validation. The paper runs systematic ablations on CIFAR-10 using two architectures:

Every result is the median of 5 independent training runs to reduce the impact of random initialization and data ordering. This is unusually rigorous for a 2016 paper and makes the comparisons highly trustworthy.

Activation placement (keeping identity shortcuts):

ArrangementResNet-110ResNet-164
(a) Original (post-act)6.61%5.93%
(b) BN after addition8.17%6.50%
(c) ReLU before addition7.84%6.14%
(d) ReLU-only pre-act6.71%5.91%
(e) Full pre-activation6.37%5.46%

The full pre-activation design wins on both architectures. The improvement is modest at 110-164 layers — about 0.24% on ResNet-110 and 0.47% on ResNet-164. These depths are not extreme enough for the post-addition ReLU to cause serious optimization problems.

But notice that the improvement is larger on the deeper network (164 vs 110). This trend continues dramatically at 1001 layers, as we will see in the next chapter.

BN after addition is worst. Option (b) puts BN on the shortcut path, which modifies the signal passing through. This is exactly the kind of disruption the theory predicts will harm optimization. The training loss struggles to decrease at the start of training, consistent with impeded information flow.

Why ReLU-only pre-activation (d) barely helps: This arrangement moves ReLU before the weights but keeps BN after them. The ReLU alone does not enjoy the benefits of batch normalization — it operates on un-normalized inputs. The full pre-activation (e) places BN and ReLU together before the weights, ensuring every weight layer receives normalized, activated input. This gives both the optimization benefit (clean shortcut) and the regularization benefit (uniform normalization).

Why option (c) fails: Moving ReLU before the addition seems logical — it gets the ReLU off the shortcut path. But it also constrains the residual function F to output only non-negative values (because ReLU is the last thing F computes before the add). A residual should be a correction that can go in either direction: positive to increase activations, negative to decrease them. If F can only add, never subtract, the forward signal is monotonically increasing through the network — each block can only make things bigger, never smaller. The training error confirms the representational loss: 7.84% vs 6.61% baseline.

ImageNet results: On ImageNet, the original ResNet-200 overfits — it has lower training error but higher test error than ResNet-152. The pre-activation ResNet-200 fixes this, achieving 20.7% top-1 error (vs. 21.8% for the original ResNet-200 and 21.3% for ResNet-152). The pre-activation BN regularization prevents the overfitting.

Why does putting BN after the addition (option b) perform the worst?

Chapter 8: 1001 Layers

The ultimate test of any theory is a dramatic experiment. If identity mappings really matter, then the benefit should be most visible at extreme depth, where any impurity on the shortcut path compounds across hundreds of blocks.

So the authors build a 1001-layer ResNet. This network has 333 bottleneck residual units (111 per feature map size, across three stages at 32×32, 16×16, and 8×8 spatial resolution). With 10.2M parameters, it is not even particularly large — just extremely deep.

With the original post-activation design, ResNet-1001 on CIFAR-10 gets 7.61% error — worse than the 164-layer version (5.93%). The training loss decreases very slowly at the beginning, suggesting severe optimization difficulties. The post-addition ReLU, compounded over 333 blocks, creates enough gradient disruption to cripple learning.

With the pre-activation design, the same ResNet-1001 achieves 4.92% error — an improvement of 2.69 percentage points from the exact same architecture, differing only in the ordering of BN and ReLU. The training loss drops quickly and smoothly from the start. There is no initial period of stagnation. The clean gradient highway means every layer — all 1001 of them — receives useful gradient signal from the very first iteration.

With a smaller mini-batch size of 64 (which provides more gradient noise and updates per epoch, acting as additional regularization), it reaches 4.62% — state-of-the-art on CIFAR-10 at the time, using only basic data augmentation (random flips and translations). No dropout, no special regularization, no ensemble. Just depth.

NetworkDesignCIFAR-10CIFAR-100
ResNet-164Original5.93%25.16%
ResNet-164Pre-activation5.46%24.33%
ResNet-1001Original7.61%27.82%
ResNet-1001Pre-activation4.92%22.71%
The reversal. Original ResNet-1001 is worse than original ResNet-164 — the degradation problem returns at extreme depth even with skip connections! But pre-activation ResNet-1001 is much better than pre-activation ResNet-164. The clean gradient highway makes depth purely beneficial again. This is the paper's strongest empirical evidence: the same architecture that fails at 1001 layers with post-activation thrives with pre-activation.

On CIFAR-100 (a harder dataset with 100 classes instead of 10), the gap is even more dramatic. Original ResNet-1001 gets 27.82% — worse than ResNet-164's 25.16%. The depth is actively harmful. But pre-activation ResNet-1001 gets 22.71%, a 5 percentage point improvement over the original and 2.5 points better than pre-activation ResNet-164. The pre-activation design does not just enable training at extreme depth; it enables the network to productively use all 1001 layers, extracting genuine benefit from the additional capacity.

On ImageNet, the original ResNet-200 overfits (21.8% top-1, worse than ResNet-152's 21.3%). Pre-activation ResNet-200 achieves 20.7%, beating both. With scale and aspect ratio augmentation, it reaches 20.1% top-1 / 4.8% top-5, competitive with Inception v3.

MethodTop-1Top-5
ResNet-152, original21.3%5.5%
ResNet-200, original21.8%6.0%
ResNet-200, pre-act20.7%5.3%
ResNet-200, pre-act + aug20.1%4.8%
Inception v321.2%5.6%

The ImageNet results reveal something subtle: the original ResNet-200 has lower training error than ResNet-152 but higher test error. It is overfitting. The pre-activation design fixes this because BN normalizes every input to every weight layer — including the shortcut signal that bypasses BN in the original design. This regularization effect is separate from the optimization benefit.

Implementation details. The first and last residual units require special handling in the pre-activation design:

These boundary conditions arise naturally when you "push" activations from block outputs to block inputs and handle the edges of the network.

The computational cost of pre-activation is identical to post-activation. No extra layers, no extra parameters, no extra FLOPs. It is purely a reordering. ResNet-1001 takes about 27 hours on 2 GPUs for CIFAR; ResNet-200 takes about 3 weeks on 8 GPUs for ImageNet — comparable to VGGNet training. The cost is linear in depth: a 1001-layer net is roughly 10× the cost of a 100-layer net, as you would expect.

One practical note: the original ResNet required learning rate warmup (starting at 0.01 for 400 iterations before jumping to 0.1). The pre-activation design does not need this warmup, though the authors used it anyway for fair comparison. The smoother optimization landscape means the network can handle the full learning rate from the start.

Why does the original ResNet-1001 perform worse than the original ResNet-164, despite having far more layers?

Chapter 9: Connections

Pre-norm Transformers. When the Transformer arrived in 2017, it used post-norm: LayerNorm after the addition, directly on the residual stream. This is equivalent to BN-after-addition (option b) in this paper — the worst-performing variant. By 2020, researchers discovered that pre-norm Transformers (LayerNorm before the attention/FFN sublayers) trained much more stably, especially at scale. This is precisely the pre-activation insight from this paper. GPT-2, GPT-3, LLaMA, and most modern LLMs use pre-norm. The original Transformer's post-norm design required careful learning rate warmup; pre-norm eliminated that need.

The Residual Stream. The "anthropic residual stream" interpretation of transformers views the skip connections as a persistent information highway, with attention and MLP blocks writing to it additively. This perspective is a direct descendant of the propagation analysis in this paper: xL = x0 + ∑ Fi.

DenseNet. DenseNet (Huang et al., 2017) took the information-flow insight further: instead of adding residuals, it concatenates features from all previous blocks. Every block receives the raw features from every earlier block as input, creating maximally direct gradient paths. The trade-off is memory: concatenation grows the channel count linearly with depth, requiring careful "growth rate" control.

Highway Networks. This paper empirically settles the debate between identity shortcuts (ResNet) and gated shortcuts (Highway Networks). Gates can help in some regimes, but for very deep networks, the identity shortcut is strictly superior. The simplest design wins.

Batch normalization placement. The finding that BN should go before weight layers (pre-activation) rather than after them has influenced architecture design across domains. In modern practice, normalization-before-transformation is the default pattern. The paper showed that BN after addition hurts because it modifies the shortcut signal, while BN before weights provides regularization to all inputs uniformly.

Depth as a dimension. The paper's conclusion is often overlooked: with the right architecture, depth is purely beneficial. Pre-activation ResNet-1001 beats ResNet-164 by a large margin. This contradicts the common belief that "wider is better than deeper" and suggests that depth remains underexplored. The caveat is that the architecture must maintain clean information highways.

Ensemble interpretation. The summation form xL = x0 + ∑ Fi hints that a ResNet behaves like an implicit ensemble. Veit et al. (2016) later showed that ResNets can be viewed as ensembles of many shallow networks of different lengths, because deleting individual blocks barely affects performance. This paper's propagation analysis is the mathematical foundation for that interpretation.

From ResNet to GPT. The architectural DNA of this paper lives in every modern LLM. GPT uses: (1) residual connections around every block (from ResNet, 2015), (2) pre-normalization before attention and FFN (from this paper, 2016), (3) learned residual functions (attention, MLP) added to an identity stream. The "residual stream" interpretation of transformers — where attention heads and MLP layers write to a persistent information highway — is precisely the xL = x0 + ∑ Fi equation derived here. A 96-layer GPT model has 192 residual additions (one for attention, one for MLP per layer). The clean shortcut principle is what makes that depth trainable.

The lasting principle. "Keep the residual stream clean" has become a foundational design principle in deep learning. Every modern architecture — transformers, diffusion U-Nets, vision transformers, state-space models — follows this rule. This paper provided the mathematical justification and empirical proof that the identity mapping is not just one valid choice among many. It is the only optimal choice for the shortcut path. Any complexity belongs in the residual function F, never on the highway.

Paper details. "Identity Mappings in Deep Residual Networks," Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. ECCV 2016. arXiv:1603.05027. Submitted March 2016. Code: github.com/KaimingHe/resnet-1k-layers

← Back to Veanors Hub

Modern LLMs like GPT-3 use "pre-norm" Transformers (LayerNorm before attention/FFN). How does this relate to the pre-activation design in this paper?