Initialization — Xavier to Kaiming

Chapter 0: Why Initialization Matters

Initialize a 20-layer network with weights drawn from N(0, 1) — standard normal. Run a single forward pass. Print the activation statistics at each layer. Layer 1: mean 0, std 1. Layer 5: std 50. Layer 10: std 10,000. Layer 20: overflow to infinity. Training hasn't even started and your network is already dead.

This isn't an edge case. This is what happens by default when you don't think about initialization. The very first time data flows through your network — before a single gradient is computed, before a single weight is updated — the initialization alone determines whether the network can learn or not.

Let's see exactly why, by hand.

Hand Calculation: The First Forward Pass

We'll trace a forward pass through a tiny 3-layer network with 2 neurons per layer. All weights are drawn from N(0, 1) — the naive default. No bias, ReLU activation.

Setup. Input x = [1.0, 0.5]. Three 2×2 weight matrices, all sampled from a standard normal distribution. We'll use specific draws to show what happens in a concrete case.

Layer 1: W₁ = [[0.8, -1.2], [0.5, 1.1]]

Compute h₁ = W₁ · x:

h₁[0] = 0.8 × 1.0 + (-1.2) × 0.5 = 0.8 - 0.6 = 0.2
h₁[1] = 0.5 × 1.0 + 1.1 × 0.5 = 0.5 + 0.55 = 1.05

Apply ReLU: max(0, [0.2, 1.05]) = [0.2, 1.05]. Both positive, so unchanged.

Layer 2: W₂ = [[1.5, -0.3], [0.9, 1.8]]

Compute h₂ = W₂ · [0.2, 1.05]:

h₂[0] = 1.5 × 0.2 + (-0.3) × 1.05 = 0.30 - 0.315 = -0.015
h₂[1] = 0.9 × 0.2 + 1.8 × 1.05 = 0.18 + 1.89 = 2.07

Apply ReLU: max(0, [-0.015, 2.07]) = [0, 2.07]. One neuron is already dead.

Layer 3: W₃ = [[1.3, 0.7], [-0.4, 1.6]]

Compute h₃ = W₃ · [0, 2.07]:

h₃[0] = 1.3 × 0 + 0.7 × 2.07 = 0 + 1.449 = 1.449
h₃[1] = -0.4 × 0 + 1.6 × 2.07 = 0 + 3.312 = 3.312

Apply ReLU: [1.449, 3.312].

Input scale was ~1. After just 3 layers, output values have tripled. In a real network with 512 neurons per layer, the effect is exponentially worse. Why? Because each layer multiplies the signal by a matrix whose entries have variance 1 — far too large.

The Variance Explosion

Here is the core mathematical fact. A single linear layer computes y = W · x, where W is a d×d weight matrix and x is a d-dimensional input. If the weights have variance σ_w² and the input has variance σ_x², then the output variance is:

Var(y) = d · σ_w² · σ_x²

With d = 512 neurons and σ_w² = 1 (standard normal), a single layer multiplies the variance by 512. After two layers: 512² = 262,144. After 10 layers: 512¹⁰ ≈ 10²⁷. The numbers have long since overflowed to infinity.

Layers	Variance Factor (512ⁿ)	What Happens
1	512	10× too large
2	262,144	Activations saturate
5	3.5 × 10¹³	float32 overflow
10	1.3 × 10²⁷	Infinity / NaN
20	1.6 × 10⁵⁴	Beyond any float

And if we make the weights too small — say σ_w = 0.01, so σ_w² = 0.0001 — each layer multiplies variance by 512 × 0.0001 = 0.0512. After 10 layers: 0.0512¹⁰ ≈ 10^-13. All activations collapse to zero. The network outputs the same thing regardless of input.

You might think "the optimizer will fix bad initialization." It can't. If activations at layer 20 are infinity or zero, the gradients are also infinity or zero. The optimizer never gets a useful signal. Initialization doesn't just affect WHERE training starts — it determines WHETHER training can happen at all. A network with exploded activations has infinite loss and NaN gradients on the very first step. There is no recovery.

See It: Activation Propagation

The simulation below shows a 10-layer network. Input enters at the top. As it flows through each layer, the activation magnitude is shown as a colored bar. Green means healthy (magnitude near 1), yellow means concerning, red means exploding or vanishing. Drag the slider to control the initial weight standard deviation.

Activation Propagation Through 10 Layers

Drag the weight std slider. At std≈1.0, activations explode. At std≈0.01, they vanish. Find the sweet spot where all 10 layers stay green.

Weight Std 1.00

Notice the pattern. At std = 1.0 (the default in many frameworks), the bars turn red almost immediately — variance explodes through layers. At std = 0.01, the bars shrink to nothing — variance vanishes. There's a narrow sweet spot around std ≈ 0.04-0.07 (for 512-dim layers) where all bars stay green.

That sweet spot is not something you find by trial and error. It has an exact mathematical formula. The next chapter derives it.

From Scratch: Watching the Explosion

python
import torch

torch.manual_seed(42)
x = torch.randn(64, 512)  # batch of 64, 512 features

print("=== N(0, 1) weights: EXPLOSION ===")
for i in range(20):
    W = torch.randn(512, 512)  # std = 1.0
    x = x @ W
    x = torch.relu(x)
    print(f"Layer {i:2d}: mean={x.mean():12.2f}  std={x.std():12.2f}")

# Reset and try too-small weights
torch.manual_seed(42)
x = torch.randn(64, 512)

print("\n=== N(0, 0.01) weights: VANISHING ===")
for i in range(20):
    W = torch.randn(512, 512) * 0.01
    x = x @ W
    x = torch.relu(x)
    print(f"Layer {i:2d}: mean={x.mean():12.6f}  std={x.std():12.6f}")

# Reset and try the sweet spot
torch.manual_seed(42)
x = torch.randn(64, 512)

print("\n=== N(0, sqrt(2/512)) weights: STABLE ===")
for i in range(20):
    W = torch.randn(512, 512) * (2/512)**0.5
    x = x @ W
    x = torch.relu(x)
    print(f"Layer {i:2d}: mean={x.mean():.4f}  std={x.std():.4f}")

Run this yourself. The first block overflows by layer 10. The second collapses to zero by layer 5. The third — using √(2/512) as the weight standard deviation — stays stable across all 20 layers. That √(2/n) is He initialization, which we'll derive in Chapter 3. But first, we need to understand the underlying principle: variance preservation.

If you initialize all weights from N(0, 1) in a 20-layer network with 512 neurons per layer and no normalization, what happens on the first forward pass?

Training converges slowly but eventually works Activations explode exponentially through layers, reaching infinity long before layer 20 Activations stay roughly the same at each layer The network automatically adjusts its weights to compensate

Chapter 1: The Variance Preservation Principle

The explosion/collapse problem from Chapter 0 has a clean mathematical solution. If we can make each layer preserve the variance of its input — variance in equals variance out — then even a 100-layer network will have healthy activations at every layer. The question is: what weight variance achieves this?

The answer turns out to be beautifully simple. And once you see it, the entire zoo of initialization methods — Xavier, He, Lecun — becomes obvious.

Deriving the Variance Formula

Consider a single neuron with n_in input connections. It computes:

y = w₁x₁ + w₂x₂ + … + w_nx_n = ∑_i=1^n_in w_i x_i

We need three assumptions (all standard and approximately true in practice):

Weights and inputs are independent. This holds because weights are initialized randomly before seeing any data.
Weights have zero mean. E[w_i] = 0. True for all standard initialization schemes.
Inputs have zero mean. E[x_i] = 0. Approximately true after centering or batch normalization.

Under these assumptions, the variance of a product of two independent zero-mean random variables is:

Var(w_i · x_i) = Var(w_i) · Var(x_i)

And the variance of a sum of independent random variables is:

Var(y) = ∑_i=1^n_in Var(w_i) · Var(x_i)

Since all weights share the same variance Var(w) and all inputs share the same variance Var(x), this simplifies to:

Var(y) = n_in · Var(w) · Var(x)

This is the key equation. It says the output variance equals the input variance multiplied by n_in · Var(w). If this product is greater than 1, variance grows at each layer. If less than 1, it shrinks. For stability, we need n_in · Var(w) = 1, which gives us Var(w) = 1/n_in.

Hand Calculation: The Numbers

Example 1: fan_in = 512, naive initialization N(0, 1)

Weight variance Var(w) = 1. Input variance Var(x) = 1 (standardized input).

Var(y) = 512 × 1 × 1 = 512

Output variance is 512× the input. After 10 layers: 512¹⁰ ≈ 1.3 × 10²⁷. Catastrophic explosion.

Example 2: fan_in = 512, correct initialization Var(w) = 1/512

Var(w) = 1/512 = 0.00195 Std(w) = √(1/512) = 0.0442

Var(y) = 512 × (1/512) × 1 = 1

Output variance equals input variance. After 10 layers: 1¹⁰ = 1. After 100 layers: 1¹⁰⁰ = 1. Perfectly preserved.

Example 3: fan_in = 64

Var(w) = 1/64 = 0.01563 Std(w) = √(1/64) = 0.125

Notice the weights are larger for smaller layers. This makes intuitive sense: with fewer inputs, each weight contributes more to the output, so each individual weight can be larger without blowing up the sum.

See It: Variance at Each Layer

The simulation below shows two panels. Left: variance at each layer for naive N(0,1) initialization (exponential growth). Right: variance for Var(w) = 1/n_in (flat line at 1). Use the fan_in slider to see how layer width affects the explosion rate, and the weight variance slider to find the sweet spot yourself.

Variance Propagation: Naive vs Correct

Left panel uses your chosen weight variance. Right panel uses the theoretically correct 1/fan_in. Adjust fan_in to see how wider layers make the explosion worse.

fan_in 512

Var(w) 0.0020

Play with this. When fan_in = 512 and Var(w) = 1/512 ≈ 0.00195, the left panel shows a flat line at variance = 1 — perfect preservation. Increase Var(w) even slightly, and the left panel curves upward exponentially. Decrease it, and it curves downward toward zero.

The right panel always shows the correct initialization. Notice how the theoretically correct Var(w) changes with fan_in: wider layers need smaller weights.

From Scratch: Empirical Verification

python
import numpy as np

np.random.seed(42)
fan_in = 512
n_layers = 10
batch_size = 1000

# --- Naive: Var(w) = 1 ---
print("Naive N(0,1):")
x = np.random.randn(batch_size, fan_in)
for i in range(n_layers):
    W = np.random.randn(fan_in, fan_in)  # Var = 1
    x = x @ W
    print(f"  Layer {i}: Var = {np.var(x):.2e}")

# --- Correct: Var(w) = 1/fan_in ---
print(f"\nCorrect Var(w) = 1/{fan_in} = {1/fan_in:.6f}:")
x = np.random.randn(batch_size, fan_in)
for i in range(n_layers):
    W = np.random.randn(fan_in, fan_in) * np.sqrt(1/fan_in)
    x = x @ W
    print(f"  Layer {i}: Var = {np.var(x):.4f}")

Run this. The naive version shows variance doubling at every layer: 500, 260000, 10⁸, ... until overflow. The correct version shows variance hovering near 1.0 at every layer, with small random fluctuations.

Variance preservation is an AVERAGE property, not a guarantee. Individual activations can still grow or shrink — we're only ensuring the expected variance stays constant. In practice, this is good enough because the law of large numbers kicks in with hundreds of neurons per layer. With fan_in = 512, the actual output variance will be within a few percent of the theoretical prediction on any given forward pass.

The Fan-In Rule

We've arrived at the fundamental principle behind all modern initialization:

The Fan-In Rule. To preserve variance through a linear layer with n_in input connections, set the weight variance to Var(w) = 1/n_in. This makes the output variance equal to the input variance, preventing both explosion and collapse.

This single equation — Var(w) = 1/n_in — is the seed from which Xavier, He, and every other initialization method grows. Xavier extends it to account for the backward pass. He extends it to account for ReLU. But the core idea is always: make each layer preserve variance.

For a layer with 256 input neurons, what should the weight variance be to preserve input variance?

256 1/256 ≈ 0.0039 1.0 256/2 = 128

Chapter 2: Xavier/Glorot Initialization

Chapter 1 derived Var(w) = 1/fan_in for the forward pass. But gradients flow backward. For gradients to preserve variance during backpropagation, we need a different condition: Var(w) = 1/fan_out. These are different numbers when the layer isn't square.

Xavier Glorot and Yoshua Bengio noticed this problem in 2010. Their solution is elegant: use the average of the two requirements.

The Backward Pass Problem

During backpropagation, the gradient at layer l flows from layer l+1. The gradient with respect to the input is:

∂L/∂x = W^T · ∂L/∂y

This is another matrix multiplication, but now the matrix is W^T, which has shape fan_out × fan_in. The "effective fan_in" from the gradient's perspective is fan_out. By the same variance argument from Chapter 1:

Var(∂L/∂x) = fan_out · Var(w) · Var(∂L/∂y)

For gradient variance preservation, we need fan_out · Var(w) = 1, so Var(w) = 1/fan_out.

The Compromise

We have two conflicting requirements:

Direction	Requirement	Preserves
Forward	Var(w) = 1/fan_in	Activation variance
Backward	Var(w) = 1/fan_out	Gradient variance

We can satisfy both simultaneously only when fan_in = fan_out (a square weight matrix). For non-square layers, Xavier uses the harmonic mean — the average of the two:

Var(w) = 2 / (fan_in + fan_out)

This doesn't perfectly satisfy either requirement, but it comes close to both. The forward pass variance is slightly off, and the backward pass variance is slightly off, but neither explodes or vanishes.

Two Distributions

Xavier initialization can be implemented with either a normal or uniform distribution:

Xavier Normal:

W ~ N(0, 2 / (fan_in + fan_out))

Xavier Uniform:

W ~ U(-a, a) where a = √(6 / (fan_in + fan_out))

Why √6? Because the variance of a uniform distribution U(-a, a) is a²/3. Setting a²/3 = 2/(fan_in + fan_out) gives a² = 6/(fan_in + fan_out), so a = √(6/(fan_in + fan_out)).

Hand Calculation: Concrete Numbers

Layer: fan_in = 512, fan_out = 256.

Xavier Var = 2 / (512 + 256) = 2 / 768 = 0.00260

Xavier Std = √0.00260 = 0.0510

Xavier Uniform: a = √(6 / 768) = √0.00781 = 0.0884

So W ~ U(-0.0884, 0.0884). Every weight is between -0.09 and +0.09.

Compare with naive N(0, 1): that's weights with std = 1.0 — over 20× too large. It's like setting the volume on a guitar amp to 11 before checking if the mic is plugged in.

Layer: fan_in = fan_out = 512 (square).

Xavier Var = 2 / (512 + 512) = 1/512 = 0.00195

When the layer is square, Xavier reduces to exactly the 1/fan_in rule from Chapter 1. The forward and backward requirements agree.

See It: Xavier vs Naive

The simulation below shows a 10-layer tanh network. We show activation histograms at layers 1, 5, and 10. With Xavier, all layers have healthy bell-curve distributions. With naive N(0,1), the later layers saturate — all outputs crowd near ±1, where tanh is flat and gradients vanish.

Activation Distributions: Xavier vs Naive (tanh)

Toggle between Xavier and naive initialization. Watch how the activation histograms change at deep layers. Switch between Normal and Uniform Xavier variants.

fan_in 512

The naive initialization produces a devastating pattern: at layer 1, the histogram looks fine. By layer 5, values are pushed toward the extremes. By layer 10, every activation is saturated at -1 or +1. The tanh function is completely flat at these extremes, so the gradient is essentially zero. Learning stops.

With Xavier, the histogram at layer 10 looks nearly identical to layer 1 — a smooth bell curve well within the sensitive region of tanh. Gradients flow freely. Learning works.

From Scratch: Implementation

python
import torch
import torch.nn as nn
import numpy as np

# --- Xavier from scratch ---
def xavier_normal(fan_in, fan_out):
    std = np.sqrt(2.0 / (fan_in + fan_out))
    return torch.randn(fan_out, fan_in) * std

def xavier_uniform(fan_in, fan_out):
    a = np.sqrt(6.0 / (fan_in + fan_out))
    return torch.empty(fan_out, fan_in).uniform_(-a, a)

# --- Compare activations through 10 tanh layers ---
torch.manual_seed(42)
fan_in = 512

# Naive
x = torch.randn(64, fan_in)
print("Naive N(0,1) + tanh:")
for i in range(10):
    W = torch.randn(fan_in, fan_in)
    x = torch.tanh(x @ W)
    print(f"  Layer {i}: mean={x.mean():.4f}  std={x.std():.4f}")

# Xavier
torch.manual_seed(42)
x = torch.randn(64, fan_in)
print("\nXavier Normal + tanh:")
for i in range(10):
    W = xavier_normal(fan_in, fan_in)
    x = torch.tanh(x @ W)
    print(f"  Layer {i}: mean={x.mean():.4f}  std={x.std():.4f}")

# PyTorch built-in (identical)
W = torch.empty(512, 512)
nn.init.xavier_uniform_(W)   # in-place initialization
nn.init.xavier_normal_(W)    # normal variant

With naive initialization, you'll see std drop to ~0.05 by layer 5 — the network is saturated and nearly dead. With Xavier, std stays around 0.5-0.65 across all 10 layers. The slight decrease below 1.0 is expected: tanh squishes values, so it always reduces variance slightly. Xavier accounts for this by starting each layer's output at the right scale.

Xavier was derived assuming LINEAR activations (or symmetric activations like tanh where the derivative at zero is 1). It does NOT work well with ReLU. ReLU kills half the neurons — negative inputs map to zero — which effectively halves the fan_in. Xavier underestimates the needed variance for ReLU networks, causing gradients to slowly vanish. You'll see this in the simulation: with ReLU and Xavier init, later layers have progressively smaller activations. The fix is He initialization (next chapter).

Why does Xavier use 2/(fan_in + fan_out) instead of just 1/fan_in?

Because it's computationally cheaper Because it was discovered empirically to work better To preserve variance in BOTH the forward pass (needs 1/fan_in) and backward pass (needs 1/fan_out) — the average satisfies both approximately Because PyTorch requires it

Chapter 3: He/Kaiming Initialization

Xavier assumes the activation function is linear or symmetric. ReLU is neither — it kills everything below zero. If half your neurons output zero, the effective fan_in is halved, and Xavier's variance estimate is too small by a factor of 2. He initialization adds that missing factor.

Published by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun in 2015 — the same team behind ResNets — this initialization made it possible to train networks with over 100 layers using ReLU activations. Before He init, deep ReLU networks suffered from the slow variance collapse that Xavier couldn't prevent.

The ReLU Problem

Recall from Chapter 1: Var(y) = n_in · Var(w) · Var(x). This assumed linear activations. Now add ReLU:

y = ReLU(w₁x₁ + w₂x₂ + … + w_nx_n)

ReLU(z) = max(0, z). If the pre-activation z follows a symmetric distribution around zero (which it does with zero-mean weights and inputs), then exactly half the values are negative and get zeroed out. The other half pass through unchanged.

What does this do to variance? If we zero out half the values and keep the other half, the expected squared value is halved:

E[ReLU(z)²] = 0.5 × E[z²]

Since E[z] = 0 (zero-mean), E[z²] = Var(z). So:

Var(ReLU(z)) = 0.5 × Var(z)

ReLU halves the variance at every layer. With Xavier's Var(w) = 1/fan_in, the variance after ReLU is:

Var(output) = 0.5 × n_in × (1/n_in) × Var(x) = 0.5 × Var(x)

Each layer halves the variance. After 10 layers: 0.5¹⁰ = 0.001. After 20 layers: 0.5²⁰ = 10^-6. The signal slowly bleeds out. Not as dramatic as N(0,1) explosion, but equally fatal for deep networks.

The Fix: Multiply by 2

The fix is almost comically simple. If ReLU halves the variance, double the weight variance to compensate:

Var(w) = 2 / fan_in

Check it:

Var(output) = 0.5 × n_in × (2/n_in) × Var(x) = 0.5 × 2 × Var(x) = Var(x) ✓

The factor of 2 from He exactly cancels the factor of 0.5 from ReLU. Variance is preserved.

Hand Calculation: Xavier vs He

Layer with fan_in = 512, ReLU activation.

Method	Var(w)	Std(w)	After ReLU: Var(out)
Xavier	1/512 = 0.00195	0.0442	0.5 × 512 × 0.00195 × 1 = 0.5
He	2/512 = 0.00391	0.0625	0.5 × 512 × 0.00391 × 1 = 1.0

With Xavier, each layer loses half its variance. With He, variance is perfectly preserved despite ReLU zeroing out half the neurons.

After 10 ReLU layers:

Method	Cumulative Variance Factor	Signal Strength
Xavier	0.5¹⁰ = 0.001	0.1% of original — nearly dead
He	1.0¹⁰ = 1.0	100% of original — perfectly healthy

For Other Activations

He initialization assumes standard ReLU, where exactly half the values are zeroed. For variants:

Activation	Fraction Passed	Correction Factor	Var(w)
ReLU	50%	2	2/fan_in
Leaky ReLU (α=0.01)	~50% at full + 50% at α	2/(1+α²)	≈ 2/fan_in
Leaky ReLU (α=0.2)	mixed	2/(1+0.04) ≈ 1.92	1.92/fan_in
GELU / SiLU	~57%	~1.7	~1.7/fan_in
Linear / tanh	100%	1 (use Xavier)	1/fan_in

In practice, He init (factor = 2) works well enough for GELU, SiLU, and Leaky ReLU too. The correction for these activations is between 1.6 and 2.0, so using exactly 2 is close enough.

See It: Xavier vs He with ReLU

The simulation below shows a 10-layer ReLU network. Side-by-side activation histograms at layers 1, 5, and 10. With Xavier, variance dies across layers (histograms narrow progressively). With He, variance is preserved (histograms keep the same spread).

Xavier vs He: 10-Layer ReLU Network

Toggle between Xavier and He initialization. Watch the activation histograms at layers 1, 5, and 10 — Xavier slowly dies, He stays healthy.

fan_in 512

The difference is striking at layer 10. Xavier's histogram has collapsed to a narrow spike near zero — the signal has been halved 10 times, leaving only 0.1% of the original variance. He's histogram at layer 10 looks nearly identical to layer 1. The ReLU is still zeroing half the values (visible as a tall bar at 0), but the surviving values maintain their spread.

From Scratch: Implementation

python
import torch
import torch.nn as nn
import numpy as np

# --- He init from scratch ---
def he_normal(fan_in, fan_out):
    std = np.sqrt(2.0 / fan_in)
    return torch.randn(fan_out, fan_in) * std

def he_uniform(fan_in, fan_out):
    a = np.sqrt(6.0 / fan_in)  # Var = a²/3 = 2/fan_in
    return torch.empty(fan_out, fan_in).uniform_(-a, a)

# --- Compare Xavier vs He through ReLU layers ---
torch.manual_seed(42)
fan_in = 512

# Xavier + ReLU (slow death)
x = torch.randn(64, fan_in)
print("Xavier + ReLU:")
for i in range(10):
    W = torch.randn(fan_in, fan_in) * np.sqrt(1/fan_in)
    x = torch.relu(x @ W)
    print(f"  Layer {i}: mean={x.mean():.4f}  std={x.std():.4f}")

# He + ReLU (stable)
torch.manual_seed(42)
x = torch.randn(64, fan_in)
print("\nHe + ReLU:")
for i in range(10):
    W = he_normal(fan_in, fan_in)
    x = torch.relu(x @ W)
    print(f"  Layer {i}: mean={x.mean():.4f}  std={x.std():.4f}")

# PyTorch built-in
W = torch.empty(512, 512)
nn.init.kaiming_normal_(W, mode='fan_in', nonlinearity='relu')
nn.init.kaiming_uniform_(W, mode='fan_in', nonlinearity='relu')

# mode='fan_in' (default) preserves forward pass variance
# mode='fan_out' preserves backward pass variance
# nonlinearity='relu' uses gain=√2; 'leaky_relu' adjusts for slope

With Xavier + ReLU, std drops to ~0.15 by layer 10 — the network is dying. With He + ReLU, std stays near 0.58 (the expected ReLU output std for unit input) across all 10 layers. The network lives.

He initialization uses fan_in (not fan_out) by default because it's designed for the forward pass with ReLU. PyTorch's kaiming_normal_ has a mode parameter: 'fan_in' preserves forward variance, 'fan_out' preserves backward variance. For most networks, 'fan_in' is the right choice — it keeps activations healthy, which matters more than perfect gradient flow in practice. Use 'fan_out' only if you're specifically worried about backward pass stability (rare with modern optimizers like Adam).

The Complete Picture

Linear / tanh / sigmoid

Use Xavier: Var(w) = 2 / (fan_in + fan_out)

↓

ReLU / Leaky ReLU

Use He: Var(w) = 2 / fan_in

↓

GELU / SiLU / Swish

He works well enough; or use gain ≈ 1.7 instead of 2

↓

Transformers

Usually Xavier or scaled variants; normalization layers compensate

He initialization is the default for modern deep learning. PyTorch uses Kaiming uniform (He uniform) by default for nn.Linear and nn.Conv2d layers. When you write nn.Linear(512, 256), the weights are automatically initialized with He uniform. You've been using He init all along — you just didn't know it.

Summary so far. Chapter 1: each layer should preserve variance, giving Var(w) = 1/fan_in. Chapter 2: Xavier averages forward and backward requirements, giving 2/(fan_in+fan_out). Chapter 3: He corrects for ReLU's 50% kill rate, giving 2/fan_in. Each method builds on the last. The core principle never changes: make each layer preserve the scale of its signal.

What factor does He initialization add compared to Xavier, and why?

A factor of 0.5, because ReLU doubles the effective neurons A factor of 2, because ReLU zeros out half the neurons, halving the effective variance — the ×2 compensates A factor of π, derived from the Gaussian integral No factor — He and Xavier are identical

Chapter 4: Orthogonal Initialization

Xavier and He use statistical arguments — on average, variance is preserved. Roll the dice a million times and the mean comes out right. But averages are cold comfort when a single unlucky sample explodes layer 47 and kills your training run.

What if we could guarantee norm preservation exactly? Not on average. Not in expectation. For every single input vector, the output has exactly the same magnitude as the input. No stretching, no shrinking, ever.

Such a guarantee exists, and it comes from linear algebra, not statistics. The tool is the orthogonal matrix.

What Is an Orthogonal Matrix?

A square matrix Q is orthogonal if its transpose is its inverse:

Q^T Q = I (identity matrix)

This single property has a powerful geometric consequence: multiplying any vector by Q preserves its length. Here's why. Take any vector x and compute the squared norm of Qx:

||Qx||² = (Qx)^T(Qx) = x^TQ^TQx = x^TIx = x^Tx = ||x||²

So ||Qx|| = ||x||. The matrix Q rotates (and possibly reflects) the vector, but never stretches or shrinks it. Think of it like spinning a ball on a table — the ball's size doesn't change, only its orientation.

The geometric intuition. Xavier says "on average, the variance is preserved." Orthogonal init says "for every single vector, the norm is preserved exactly." It's the difference between "on average, the bridge holds" and "the bridge holds, guaranteed." The guarantee comes from geometry, not statistics.

How to Build One: QR Decomposition

You can't just write down a random orthogonal matrix — random matrices are almost never orthogonal. But you can extract one from any random matrix using QR decomposition.

Every matrix A can be factored as A = QR, where Q is orthogonal and R is upper-triangular. We want Q — we throw away R. The process is:

Step 1

Generate a random matrix A with entries from N(0, 1)

↓

Step 2

Compute QR decomposition: A = QR

↓

Step 3

Take Q — it's orthogonal by construction

↓

Step 4

Optionally multiply by a gain factor g: W = g · Q

For rectangular weight matrices (fan_in ≠ fan_out), we generate a larger square matrix, decompose it, and take the appropriate slice of Q. The rows (or columns) of Q are still orthonormal, so the norm-preserving property holds for the dimensions we keep.

Hand Calculation: 2×2 Case

Let's walk through every step with a concrete 2×2 example.

Step 1: Random matrix.

A = [[0.3, -0.8], [1.2, 0.5]]

Step 2: QR decomposition. We use Gram-Schmidt. First column of A is a₁ = [0.3, 1.2]. Normalize it:

||a₁|| = √(0.3² + 1.2²) = √(0.09 + 1.44) = √1.53 ≈ 1.237

q₁ = [0.3/1.237, 1.2/1.237] = [0.243, 0.970]

Second column of A is a₂ = [-0.8, 0.5]. Remove the component along q₁:

proj = (a₂ · q₁) = (-0.8)(0.243) + (0.5)(0.970) = -0.194 + 0.485 = 0.291

a₂′ = a₂ - proj · q₁ = [-0.8 - 0.291(0.243), 0.5 - 0.291(0.970)] = [-0.871, 0.218]

||a₂′|| = √(0.871² + 0.218²) = √(0.759 + 0.047) = √0.806 ≈ 0.898

q₂ = [-0.871/0.898, 0.218/0.898] = [-0.970, 0.243]

Step 3: Our orthogonal matrix Q:

Q = [[0.243, -0.970], [0.970, 0.243]]

Verification: Q^TQ = I?

	Check	Result
Row 1 · Row 1	0.243² + 0.970²	0.059 + 0.941 = 1.000
Row 2 · Row 2	(-0.970)² + 0.243²	0.941 + 0.059 = 1.000
Row 1 · Row 2	0.243(-0.970) + 0.970(0.243)	-0.236 + 0.236 = 0.000

Rows are unit-length and perpendicular. That's orthogonality.

Now test norm preservation. Input x = [1, 0], so ||x|| = 1.

Qx = [0.243(1) + (-0.970)(0), 0.970(1) + 0.243(0)] = [0.243, 0.970]

||Qx|| = √(0.243² + 0.970²) = √(0.059 + 0.941) = √1.000 = 1.000

The vector rotated from pointing along the x-axis to pointing mostly along the y-axis, but its length is exactly 1. Try any other input — the norm will always be preserved.

The Gain Parameter

Pure orthogonal initialization preserves norms with gain = 1. But sometimes you want controlled scaling. PyTorch's nn.init.orthogonal_ accepts a gain parameter that multiplies Q:

W = gain · Q

With gain = √2 (appropriate for ReLU), the output norm is √2 times the input norm. This compensates for ReLU zeroing half the activations — exactly like He initialization, but with the rotational structure of an orthogonal matrix.

When to Use Orthogonal Init

Orthogonal initialization shines in three specific scenarios:

Scenario	Why Orthogonal Works
RNNs / LSTMs	The hidden state is multiplied by the recurrence matrix at every timestep. 100 timesteps = 100 multiplications by the same matrix. Orthogonal ensures no explosion or vanishing through time.
Very deep networks without normalization	If you can't use BatchNorm or LayerNorm (rare today), orthogonal init is the strongest guarantee of signal preservation.
GAN training	GANs are notoriously unstable. Orthogonal init on the discriminator helps prevent mode collapse by ensuring gradients flow reliably.

Misconception: "Orthogonal init is always better than He/Xavier." It's not. Orthogonal matrices preserve norms for linear transformations. But neural networks have nonlinear activations. After applying ReLU (which zeroes half the values), the norm guarantee breaks — you lose about half the energy regardless of the weight matrix. For ReLU networks, He initialization with its statistical factor-of-2 correction is usually better than orthogonal init. Orthogonal's real advantage is in recurrent networks, where the same weight matrix is applied many times without an interleaved correction.

Simulation: Rotation vs. Stretching

The widget below shows what happens to a 2D vector when multiplied by a weight matrix. With random init, the vector stretches or shrinks. With orthogonal init, it rotates but keeps the same length. Adjust the gain slider to see controlled scaling. Below: activation variance through 20 layers.

Orthogonal vs. Random Init

Watch how orthogonal init preserves vector norms exactly, while random init causes drift.

Gain 1.0

Implementation: From Scratch & PyTorch

python
import numpy as np

def orthogonal_init(shape, gain=1.0):
    """Orthogonal initialization via QR decomposition."""
    # Step 1: Random matrix from N(0, 1)
    rows, cols = shape
    # Make square matrix (larger dimension)
    n = max(rows, cols)
    A = np.random.randn(n, n)

    # Step 2: QR decomposition
    Q, R = np.linalg.qr(A)

    # Fix sign ambiguity (ensures uniform distribution)
    d = np.diag(R)
    Q *= np.sign(d)  # flip columns where R's diagonal is negative

    # Step 3: Slice to desired shape
    Q = Q[:rows, :cols]

    # Step 4: Apply gain
    return gain * Q


# Test: verify norm preservation
W = orthogonal_init((4, 4), gain=1.0)
x = np.random.randn(4)
print(f"||x||  = {np.linalg.norm(x):.4f}")
print(f"||Wx|| = {np.linalg.norm(W @ x):.4f}")
# They will be identical (up to floating-point precision)

python
# PyTorch built-in
import torch.nn as nn

layer = nn.Linear(256, 256)
nn.init.orthogonal_(layer.weight, gain=1.0)
# For ReLU networks, use gain=sqrt(2):
nn.init.orthogonal_(layer.weight, gain=nn.init.calculate_gain('relu'))

What geometric property makes orthogonal matrices ideal for weight initialization?

They have the largest possible determinant They minimize the condition number of the Jacobian They preserve vector norms exactly: ||Wx|| = ||x|| for every input They ensure all eigenvalues are positive

Chapter 5: Transformer & LLM Init

Transformers have a structural feature that changes the initialization game entirely: residual connections. Every sublayer — attention and feedforward — adds its output to a running residual stream. And addition accumulates.

Picture a river. Each transformer layer is a tributary that pours into it. With 96 layers (GPT-3), that's 192 tributaries (attention + FFN each). Even if each tributary carries a modest flow, 192 of them together create a flood.

The Residual Accumulation Problem

In a standard transformer block, the computation is:

x = x + Attention(x) then x = x + FFN(x)

After one block, x contains the original signal plus two additive terms. After N blocks, x has accumulated 2N additive terms:

x_final = x₀ + ∑_i=1^2N f_i(x)

If the original x₀ has variance 1, and each sublayer output f_i(x) has variance σ², then the total variance of x_final is approximately:

Var(x_final) ≈ 1 + 2N · σ²

We want this to stay near 1. Solving for σ²:

σ² ≈ 1/(2N) → σ ≈ 1/√(2N)

Each sublayer's output projection must be initialized with a standard deviation scaled down by 1/√(2N). More layers → smaller initial contributions. This is the core insight behind every modern LLM initialization recipe.

The river analogy. Think of the residual stream as a river and each sublayer as a tributary. If you have 192 tributaries (96-layer GPT-3), each one needs to be a trickle, not a torrent. The scaling factor 1/√(2N) ensures the river doesn't flood. Deeper models need weaker tributaries.

The GPT-2 Recipe

GPT-2's initialization code (and GPT-3, which follows the same pattern) uses a base standard deviation of 0.02 for most weights. But the output projections in attention and FFN get a special scaling:

σ_residual = 0.02 / √(2 · n_layers)

This 0.02 base isn't arbitrary. For GPT-2's d_model = 768:

1/√d_model = 1/√768 ≈ 0.036

GPT-2 uses 0.02, which is conservatively smaller than 0.036. This extra conservatism gives a margin of safety — it's better to start too small (slow initial learning) than too large (unstable training).

Let's compute the residual scaling for different GPT-2 variants:

Model	Layers	d_model	Base σ	Residual σ
GPT-2 Small	12	768	0.02	0.02 / √24 = 0.00408
GPT-2 Medium	24	1024	0.02	0.02 / √48 = 0.00289
GPT-2 Large	36	1280	0.02	0.02 / √72 = 0.00236
GPT-2 XL	48	1600	0.02	0.02 / √96 = 0.00204

The pattern is clear: deeper models use dramatically smaller initialization for their residual projections. A 48-layer model's residual weights start 5× smaller than a 12-layer model's.

Hand Calculation: 48-Layer Transformer

Let's verify the math for GPT-2 XL (48 layers). Does residual scaling actually keep the variance stable?

Without scaling:

Each sublayer output has variance σ² = 0.02² = 0.0004
After 96 sublayers (48 attention + 48 FFN): accumulated variance = 96 × 0.0004 = 0.0384
Total variance = 1 + 0.0384 = 1.038 — this doesn't look bad!

Wait — that's because 0.02 is already small. The real problem is what happens inside each sublayer. The output projection multiplies by a d_model × d_model matrix. With standard init (σ = 0.02 per element), the output variance is approximately d_model × 0.02² × input_variance. For d_model = 1600:

σ_sublayer² ≈ 1600 × 0.0004 = 0.64

After 96 sublayers: accumulated variance = 96 × 0.64 = 61.4
Total variance = 1 + 61.4 = 62.4 — the residual stream has 62× the variance it should!

With scaling:

Residual σ = 0.02 / √96 = 0.00204
Per-element variance = 0.00204² = 4.16 × 10^-6
Sublayer output variance ≈ 1600 × 4.16 × 10^-6 = 0.00666
After 96 sublayers: accumulated variance = 96 × 0.00666 = 0.639
Total variance = 1 + 0.639 = 1.64 — much more stable!

Reality check. Without residual scaling, a 48-layer transformer's residual stream blows up to 62× its intended variance. With scaling, it stays around 1.6×. Not perfect, but stable enough for training to begin. Layer normalization handles the rest.

μP (Maximal Update Parametrization)

The GPT-2 recipe works but has a limitation: when you change model size, you need to re-tune learning rate and other hyperparameters. μP (Maximal Update Parametrization) from Yang & Hu (2021) fixes this by adjusting initialization, learning rate, and activation scaling together in a principled way.

The key insight: μP sets init and LR so that the size of parameter updates relative to parameter magnitudes stays constant across model widths. You tune hyperparameters on a small model and they transfer directly to the large model. Cerebras and some research labs use μP for efficient hyperparameter search.

Simulation: Residual Stream Variance

Adjust the number of layers and toggle residual scaling on/off. Watch how variance grows through the network.

Residual Stream Variance Through a Transformer

Without scaling, variance grows linearly with depth. With scaling, it stays controlled.

Layers 48

d_model 768

Implementation: GPT-2 Style Init

python
import torch
import torch.nn as nn

class GPT2Block(nn.Module):
    def __init__(self, d_model, n_heads, n_layers):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = nn.MultiheadAttention(d_model, n_heads)
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),  # output projection
        )
        self.n_layers = n_layers

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            # Standard init: N(0, 0.02)
            torch.nn.init.normal_(module.weight, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)

        # Special: scale down residual projections
        # These are the LAST linear layers in attention & FFN
        if hasattr(module, '_is_residual_proj'):
            std = 0.02 / ((2 * self.n_layers) ** 0.5)
            torch.nn.init.normal_(module.weight, std=std)

# For GPT-2 XL (48 layers, d_model=1600):
# Standard layers: std = 0.02
# Residual projections: std = 0.02 / sqrt(96) = 0.00204

Misconception: "0.02 works for all model sizes." The 0.02 standard deviation in GPT-2 is approximately 1/√d_model for d_model = 768 (since 1/√768 ≈ 0.036, and 0.02 is conservatively smaller). Different model sizes need different base standard deviations. Using 0.02 for a 4096-dimensional model is too small (1/√4096 = 0.0156, already below 0.02 — so 0.02 is actually too large). Using 0.02 for a 128-dimensional model is far too small (1/√128 ≈ 0.088). Always think about 1/√d_model as the baseline.

Why do transformer output projections need to be initialized with smaller weights than other layers?

Because attention heads need small initial values to be stable Because residual connections accumulate contributions from many layers — without scaling, the residual stream's variance grows linearly with depth Because the softmax function is sensitive to large inputs Because layer normalization requires small inputs to work correctly

Chapter 6: The Deep Network Explorer

Time to put everything together. This is the showcase simulation — a full deep network where you pick the initialization method, activation function, number of layers, and layer width, then watch what happens to activations and gradients as signals propagate through the network.

You've learned the theory: Xavier for tanh, He for ReLU, orthogonal for exact preservation, residual scaling for transformers. Now break things. Pick the wrong init for the wrong activation. Make the network absurdly deep. Watch activations collapse to zero or explode to infinity. Then fix it with the right initialization and see everything turn green.

How to read the explorer. The canvas shows three things. Left column: mini activation histograms at each layer — green means healthy (variance near 1), yellow means drifting, red means dead or exploding. Center: a pipe diagram where pipe width shows signal magnitude — healthy pipes are wide and green, dying pipes are thin and red, exploding pipes burst out of the frame. Right column: gradient magnitudes at each layer — same color coding. The stats bar at bottom shows exact numbers.

Experiments to Try

Here's what to look for with each combination. Try these in order:

Init	Activation	What Happens
Zeros	Any	Everything is dead immediately. All activations are zero, all gradients are zero. The network is a brick wall.
N(0,1)	None (Linear)	Activations explode exponentially — variance doubles (approximately) at each layer. By layer 10, values are in the thousands.
N(0,1)	Sigmoid	Activations saturate. Everything collapses to ~0.5. Gradients vanish because sigmoid's gradient is near zero in the saturated region.
Xavier	Tanh	Healthy! Green all the way through. This is what Xavier was designed for.
Xavier	ReLU	Gradients slowly die. Xavier doesn't account for ReLU's zeroing — variance drops by ~half each layer. By layer 16+, things are dim.
He	ReLU	Healthy! Green all the way through. He's factor-of-2 correction fixes Xavier's shortcoming.
Orthogonal	None (Linear)	Perfect preservation. Every layer has variance exactly 1.0. The flattest line you'll ever see.

Deep Network Explorer

Pick an init method, activation, depth, and width. Then run the forward and backward pass.

Layers 16

Width 128

Initialization:

Activation:

No quiz for the showcase chapter — the simulation is the test. If you can predict what will happen before clicking each combination, you've mastered initialization.

Chapter 7: Zero Init, Bias Init & Practical Recipes

We've covered the general strategies: Xavier for tanh/sigmoid, He for ReLU, orthogonal for exact preservation, residual scaling for transformers. Now the specific tricks. What do you initialize to zero? What do you initialize to a small constant? And what does the actual init code in Llama look like?

Zero Init for Residual Branches

Here's a counterintuitive idea: initialize the last layer of each residual block to all zeros.

In a residual block, the computation is:

y = x + f(x)

If f's last layer has weights W = 0, then f(x) = 0, so:

y = x + 0 = x

The network starts as the identity function. Every residual block passes its input straight through, unchanged. The network then gradually learns to deviate from identity — each block making small, incremental modifications rather than random large ones.

Why this helps. At initialization, a randomly initialized residual block adds random noise to the signal. Stack 96 of these and the residual stream is a mess. With zero init on the residual branch, the network starts as a clean identity: x goes in, x comes out. Training then gradually adds useful transformations. It's like starting with a blank canvas instead of a canvas splattered with random paint.

This technique is used in:

GPT-2/GPT-3: Output projections of attention and FFN are scaled down (not pure zero, but toward zero)
ResNet-v2 (Pre-activation ResNet): BatchNorm's γ parameter initialized to 0 for the last norm in each block
FixUp initialization: Explicitly zeros the last layer of each residual branch, allowing training without normalization layers

Bias Initialization

Biases are simpler than weights. The standard rule: initialize all biases to zero. Here's why this works:

A bias adds a constant to every neuron's pre-activation: z = Wx + b. If the weights are properly initialized (Xavier, He, etc.), the pre-activations already have zero mean. A zero bias preserves this property. A nonzero bias would shift all neurons in the same direction, which the network would need to unlearn.

There's one historical exception. With ReLU activation, some practitioners initialized biases to a small positive value like 0.01:

b = 0.01 (ensures neurons fire initially)

The argument: if a neuron's pre-activation is slightly negative at init, ReLU kills it (output = 0). A small positive bias shifts the distribution so more neurons fire. In practice, this made little difference when combined with proper He initialization, and modern networks use zero biases universally.

One more case: many modern LLMs have no biases at all. Llama, PaLM, and other recent architectures remove biases entirely from linear layers. The argument: biases add parameters but contribute little when combined with layer normalization, which re-centers activations anyway.

The Complete LLM Recipe

Let's look at what modern LLMs actually do. Here's the initialization recipe for three major architectures:

Component	GPT-2	Llama	BERT
Embeddings	N(0, 0.02)	N(0, 1/√d)	N(0, 0.02)
Linear layers	N(0, 0.02)	He normal	N(0, 0.02)
Normalization γ	ones	ones (RMSNorm)	ones (LayerNorm)
Normalization β	zeros	N/A (RMSNorm has no β)	zeros
Output projections	N(0, 0.02/√2N)	scaled by 1/√2N	N(0, 0.02)
Biases	zeros	none (removed)	zeros

The pattern across all three: (1) most weights use a small normal distribution, (2) normalization layers start as identity (γ=1, β=0), (3) output projections are scaled down for deep models.

Hand Calculation: Llama's Init

Llama 2 7B has d_model = 4096 and 32 layers. Let's compute every init value:

Embeddings:

σ = 1/√4096 = 1/64 = 0.015625

Linear layers (Q, K, V, gate, up projections):

He normal: σ = √(2/fan_in). For a 4096 → 4096 layer:

σ = √(2/4096) = √(0.000488) = 0.0221

Output projections (attn_out, ffn_down):

σ_base × 1/√(2 × 32) = 0.0221 / √64 = 0.0221 / 8 = 0.00276

RMSNorm weights:

γ = [1, 1, 1, ..., 1] (all ones, 4096 values)

That's the entire init. No biases (Llama removes them). No β (RMSNorm doesn't have one). Clean and minimal.

The Symmetry Breaking Problem

Misconception: "Any nonzero constant init works." Don't initialize everything to the same constant (even a nonzero one). If all weights are identical, all neurons compute the same function, receive the same gradient, and update identically forever. The network effectively has one neuron per layer, regardless of width. This is called the symmetry breaking problem — random initialization is essential for giving neurons different roles. Even tiny random perturbations (adding N(0, 10^-6) to a constant) break the symmetry, but it's much better to use proper Xavier/He init from the start.

Let's verify this with a quick hand calculation. Take a 2-layer network with 2 neurons each, all weights initialized to 0.5, bias 0:

Forward pass with input [1, 1]:

Neuron 1: 0.5(1) + 0.5(1) = 1.0
Neuron 2: 0.5(1) + 0.5(1) = 1.0

Both neurons produce identical output: 1.0.

Backward pass:

Both neurons receive identical gradients (because their outputs were identical and they have identical weights to the next layer)
Both neurons update identically: w_new = 0.5 - η × (same gradient)

After the update, the weights are still identical. Repeat forever — the two neurons are permanently locked in sync. You paid for two neurons but only got one.

Simulation: Recipe Builder

Select a component and an architecture to see its initialization value. Toggle between GPT-2, Llama, and BERT recipes.

LLM Init Recipe Builder

Select a model architecture and see the initialization recipe for each component.

Implementation: Complete LLM Init

python
import torch
import torch.nn as nn
import math

def init_llama(model, d_model, n_layers):
    """Initialize a Llama-style model."""
    for name, param in model.named_parameters():
        if param.dim() < 2:
            continue  # skip 1D params (norms)

        if 'embed' in name:
            # Embeddings: N(0, 1/sqrt(d_model))
            nn.init.normal_(param, std=1.0 / math.sqrt(d_model))

        elif 'out_proj' in name or 'down_proj' in name:
            # Residual projections: He normal / sqrt(2*n_layers)
            fan_in = param.shape[1]
            std = math.sqrt(2.0 / fan_in) / math.sqrt(2 * n_layers)
            nn.init.normal_(param, std=std)

        else:
            # All other linear layers: He normal
            nn.init.kaiming_normal_(param, mode='fan_in')

    # RMSNorm weights: all ones (PyTorch default)
    for name, param in model.named_parameters():
        if 'norm' in name and param.dim() == 1:
            nn.init.ones_(param)

# Llama 2 7B: d_model=4096, n_layers=32
# Embedding std   = 1/sqrt(4096) = 0.0156
# Linear std      = sqrt(2/4096) = 0.0221
# Residual std    = 0.0221 / sqrt(64) = 0.00276
# RMSNorm weights = [1, 1, 1, ..., 1]
# Biases          = none (Llama has no biases)

Why do some architectures initialize the last layer of each residual block to zero?

To reduce the total number of parameters So the residual block computes f(x)=0, making y=x+0=x — the network starts as identity and gradually learns useful modifications To prevent the weights from being updated during the first epoch Because zero-initialized layers converge faster than randomly initialized ones

Chapter 8: The Initialization Arena

Let's race them all.

You've learned five initialization strategies: Zero, Random, Xavier, He/Kaiming, and Orthogonal. Each has strengths and weaknesses — but reading about failure modes is one thing. Watching them happen in real time is another.

This simulation trains five identical networks on the same task, differing only in initialization. Drag the sliders to create the conditions where each method fails. You'll discover that He + ReLU isn't always king — context matters.

How to use the Arena. Hit Play to start training. Adjust depth, fan-in width, learning rate, and activation function with the controls. Watch the loss curves diverge. Try these experiments: (1) Set depth=32 with ReLU and compare Xavier vs He. (2) Set activation=tanh and watch He overshoot. (3) Set depth=4 with large fan_in and see that all methods converge. (4) Set depth=64 and watch everything except Orthogonal collapse.

Initialization Racing Arena

Five methods train simultaneously. Each has characteristic dynamics. Find each method's failure mode.

Zero Random Xavier He Orthogonal

Depth 16

Fan-in 256

Log LR 0.010

Activation

Speed 3

What to Discover

Experiment 1: Depth = 32, ReLU. Xavier's loss flatlines or oscillates wildly — its variance halves every layer through ReLU (the ½ factor). He stays stable because it compensates with 2/fan_in. This is the textbook case.

Experiment 2: Tanh activation. Switch to tanh. Now Xavier is the winner — it was designed for symmetric activations. He overshoots because its larger variance pushes activations into tanh's saturation regions.

Experiment 3: Depth = 64. At extreme depth, even He struggles. Orthogonal initialization keeps gradients flowing because it preserves norms exactly (no expansion, no shrinkage). This is why Orthogonal is popular in RNNs and very deep networks.

Experiment 4: Shallow (depth = 4), wide (fan_in = 1024). All methods converge to similar performance. Initialization matters less for shallow networks — the problem is self-correcting over a few layers. This is why you rarely hear about initialization for 3-layer MLPs.

The Arena reveals a crucial insight: there is NO universally best initialization. Zero always fails (symmetry). Random Gaussian is a dice roll. Xavier is optimal for linear/tanh but dies with ReLU at depth. He fixes ReLU but overshoots for tanh. Orthogonal is the most robust at extreme depth but gains nothing for shallow networks. Know the failure modes, choose accordingly.

Failure Mode Summary

Method	Fails when...	Symptom	Fix
Zero	Always	All neurons identical forever	Use any non-zero init
Random N(0,0.01)	Depth > 8	Activations vanish exponentially	Scale with fan_in
Xavier	ReLU + depth > 16	Variance halves per layer	Switch to He/Kaiming
He/Kaiming	Tanh/sigmoid activations	Saturation, slow convergence	Switch to Xavier
Orthogonal	Never truly "fails"	Same as Xavier/He in shallow nets	Use Xavier/He for simplicity

Chapter 9: Cheat Sheet & Connections

You now understand the complete initialization toolkit — from the symmetry-breaking problem to the recipes used in every modern LLM. This chapter is your practical reference. No new concepts. Just the formulas, the decision guide, and the connections to where you go next.

Every Formula at a Glance

Method	Distribution	Variance	Best for
Xavier Normal	N(0, σ²)	σ² = 2 / (fan_in + fan_out)	Tanh, sigmoid, linear
Xavier Uniform	U(-a, a)	a = √(6 / (fan_in + fan_out))	Same as Xavier Normal
He Normal	N(0, σ²)	σ² = 2 / fan_in	ReLU, Leaky ReLU
He Uniform	U(-a, a)	a = √(6 / fan_in)	Same as He Normal
Orthogonal	QR decomposition × gain	Preserves norms exactly	Very deep nets, RNNs
Truncated Normal	N(0, σ²), \|x\| < 2σ	Varies	Transformers (avoids outliers)

The Decision Flowchart

Follow the path that matches your situation:

What's your activation function?

The first branch point

↓

ReLU / Leaky ReLU / PReLU

Use He/Kaiming init. Normal or uniform, fan_in mode. This is the default.

↓

Tanh / Sigmoid / Linear

Use Xavier/Glorot init. Preserves variance for symmetric activations.

↓

GELU / SiLU / Swish

Use He init with a correction factor ~1.7 instead of 2.0 (empirical). Most frameworks use standard He.

↓

Transformer / LLM?

Use scaled init: output projections × 1/√(2N), residual scaling, small embeddings. See Chapter 7.

↓

Very deep (> 48 layers) or RNN?

Use Orthogonal init. Preserves gradient norms through many steps.

Symbol Glossary

Symbol	Meaning	Typical values
fan_in	Number of input connections to a neuron	64–12288
fan_out	Number of output connections from a neuron	64–12288
σ²	Variance of the weight distribution	~0.001–0.1
gain	Scaling factor for the activation function	1.0 (linear), √2 (ReLU)
d_model	Hidden dimension in transformers	256–8192
N	Number of transformer layers	6–128

PyTorch One-Liner Reference

python
import torch.nn as nn

# Xavier for tanh/sigmoid
nn.init.xavier_normal_(layer.weight)
nn.init.xavier_uniform_(layer.weight)

# He/Kaiming for ReLU
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
nn.init.kaiming_uniform_(layer.weight, mode='fan_in', nonlinearity='relu')

# Orthogonal for deep nets / RNNs
nn.init.orthogonal_(layer.weight, gain=1.0)

# Truncated normal for transformer embeddings
nn.init.trunc_normal_(layer.weight, std=0.02)

# LLM residual scaling (GPT-2 style)
nn.init.normal_(layer.weight, std=0.02 / math.sqrt(2 * n_layers))

Chapter Recap

Ch 0: Why

Bad init kills training. Variance explodes or vanishes.

↓

Ch 1: Variance

Each layer multiplies variance by fan_in × Var(w). Must stay ~1.

↓

Ch 2: Xavier

Var(w) = 2/(fan_in + fan_out). Preserves variance for linear activations.

↓

Ch 3: He/Kaiming

Var(w) = 2/fan_in. Corrects for ReLU's factor-of-2 kill.

↓

Ch 4: Orthogonal

QR decomposition preserves norms perfectly. Best for extreme depth.

↓

Ch 5: Transformers

Residual scaling, small embeddings, 1/√(2N) output projections.

↓

Ch 6: Explorer

Interactive lab: any init × any activation × any depth.

↓

Ch 7: Recipes

Real init configs from GPT-2, LLaMA, BERT, ViT.

↓

Ch 8: Arena

Race them all. Know the failure modes.

Where to Go Next

If you want to learn...	Go to
How normalization keeps activations stable during training	Normalization
How optimizers adapt learning rates per-parameter	Optimizers
How loss functions shape the gradient landscape	Loss Functions
How attention mechanisms use scaled init	Transformer
How residual connections interact with init	Universal Architecture

"What I cannot create, I do not understand." — Richard Feynman. You now understand initialization from first principles. You can derive Xavier's formula from variance propagation. You know why He adds a factor of 2. You know when Orthogonal wins. You've raced them in the Arena. Go build something.

You're building a 96-layer Transformer with SwiGLU activations for an LLM. Which initialization strategy should you use for the MLP weight matrices?

Xavier uniform — it's the standard for transformers He normal with fan_out mode — SwiGLU is ReLU-like He normal (fan_in) with residual scaling 1/√(2N) on output projections Orthogonal with gain=√2 — best for deep networks