← Gleams
Stanford CS 231n · Lecture 3 · Regularization and Optimization

Training Neural Nets: Regularization & Optimization

You have a loss function. You have weights. Now how do you actually find the best weights — and prevent them from memorizing your training data?

L1 & L2 Regularization Softmax & Cross-Entropy SGD, Momentum, Adam Learning Rate Schedules
Roadmap

What You'll Master

Chapter 01

The Overfitting Problem

You've trained a classifier. On your training set, it nails 99% of the images. You proudly test it on new images — and it drops to 65%. What happened?

Your model didn't learn the underlying pattern. It memorized the training data — every quirk, every noise artifact, every coincidence. This is called overfitting: the model fits the training data too well, and that tight fit doesn't generalize to data it hasn't seen.

A Visual Intuition

Imagine fitting a curve to a handful of data points. A straight line (simple model) might miss some points, but it captures the general trend. A high-degree polynomial (complex model) passes through every point perfectly — but oscillates wildly between them, giving terrible predictions for new inputs.

The Core Tension

Training is a tug-of-war between two goals: (1) fit the training data well, and (2) don't fit it so well that you're fitting noise instead of signal. Every technique in this lecture — regularization, careful optimization, learning rate schedules — is about managing this tension.

Bias vs. Variance

This tension has a formal name: the bias-variance tradeoff.

Bias is systematic error — the model is too simple to capture the true relationship. A linear classifier trying to learn a circular decision boundary has high bias. It underfits.

Variance is sensitivity to the specific training data. A 100-degree polynomial changes dramatically if you add or remove one data point. It has high variance. It overfits.

Definition
Generalization

A model generalizes when it performs well on data it has never seen during training. The gap between training performance and test performance is the generalization gap. Our goal is to minimize this gap while keeping training performance high.

RegimeTraining LossTest LossProblem
UnderfittingHighHighModel too simple (high bias)
Good fitLowLowNone — the sweet spot
OverfittingVery lowHighModel memorizes (high variance)
Worked Example — Spotting Overfitting

You train a linear classifier on CIFAR-10: training accuracy 40%, test accuracy 38%. That's underfitting — both are bad. You switch to a deep neural network: training accuracy 99%, test accuracy 72%. The 27-point gap screams overfitting. You need regularization.

It Gets Worse in High Dimensions

A linear classifier on 32×32×3 images has 3,072 input dimensions. A neural network with millions of parameters has enormous capacity to memorize. Without explicit constraints, deep networks will overfit almost every dataset given enough training time.

Chapter 02

Regularization

Regularization is the antidote to overfitting. The idea is beautifully simple: add a penalty to the loss function that punishes complex models. The total loss becomes:

Regularized Loss L = Ldata + λ R(W)
Ldata = how well we fit the training data. R(W) = how "complex" the weights are. λ = how much we care about simplicity.

The hyperparameter λ (lambda) controls the tradeoff. Too small: no effect, we still overfit. Too large: we over-penalize and underfit. Just right: the model learns the true pattern and ignores noise.

L2 Regularization (Weight Decay)

The most common form. L2 penalizes the sum of squared weights:

L2 Regularization R(W) = Σi Σj Wij2

What does this do? It pushes all weights toward zero — but not to zero. More precisely, it prefers weight vectors where the values are spread out rather than concentrated in a few large entries.

Worked Example — L2 Prefers Spread Weights

Consider input x = [1, 1, 1, 1]. Two weight vectors produce the same score:

w1 = [1, 0, 0, 0] → dot product = 1

w2 = [0.25, 0.25, 0.25, 0.25] → dot product = 1

L2 penalty: ||w1||2 = 1.0, ||w2||2 = 0.25. L2 prefers w2 by a factor of 4. Why? Because w2 uses all features equally rather than relying on a single feature. This makes the model more robust — if feature 1 is noisy, w1 breaks, but w2 barely notices.

The Occam's Razor Analogy

William of Ockham (1285–1347): "Among competing hypotheses, the simplest is best." L2 regularization is Occam's Razor in equation form. By penalizing large weights, you're saying: "I'd rather have a simple model that's slightly wrong on the training data than a complex model that's perfect on training but fails on test."

L1 Regularization (Sparsity)

L1 penalizes the sum of absolute values:

L1 Regularization R(W) = Σi Σj |Wij|

L1 has a dramatically different effect: it drives many weights exactly to zero. The resulting weight matrix is sparse — most entries are zero, and only a few features are used.

Definition
Sparsity

A weight vector is sparse when most of its entries are exactly zero. L1 regularization encourages sparsity because its gradient has constant magnitude regardless of the weight's size — it pushes small weights all the way to zero rather than just shrinking them.

PropertyL2 (Weight Decay)L1 (Sparsity)
PenaltyΣ W2Σ |W|
Gradient contribution2W (proportional to weight)sign(W) (constant magnitude)
Effect on weightsShrinks toward zero, never reaches itDrives many to exactly zero
PreferenceSpread-out, small weightsSparse, few nonzero weights
Use caseDefault for neural networksFeature selection

Elastic Net

Why choose? Elastic net combines both: R(W) = β1 |W| + β2 W2. You get sparsity from L1 plus the smoothness of L2.

Beyond Weight Penalties

L1 and L2 are just the simplest forms of regularization. Modern deep learning uses many other tricks: dropout (randomly zero out neurons during training), batch normalization (normalize layer activations), data augmentation (artificially expand the training set), and early stopping (stop training before overfitting kicks in). All serve the same purpose: prevent memorization, encourage generalization.

Chapter 03

Softmax & Cross-Entropy

Last lecture introduced the SVM loss (hinge loss). Now we'll meet the other major loss for classification: cross-entropy loss, built on the softmax function. It's the standard loss for modern neural networks.

From Scores to Probabilities

Your linear classifier outputs a vector of raw scores — one per class. For a 3-class problem, you might get scores [3.2, 1.3, 2.2]. But these are just numbers. How confident is the model? Is 3.2 way better than 2.2, or barely?

The softmax function converts raw scores into a probability distribution:

Softmax P(Y = k | x) = esk / Σj esj
where sk is the score for class k. The exponentials make everything positive. The division makes them sum to 1.
Worked Example — Softmax in Action

Scores: [3.2, 1.3, 2.2]

Exponentials: [e3.2, e1.3, e2.2] = [24.53, 3.67, 9.03]

Sum: 24.53 + 3.67 + 9.03 = 37.23

Probabilities: [24.53/37.23, 3.67/37.23, 9.03/37.23] = [0.659, 0.099, 0.243]

The model is 65.9% confident in class 0. These probabilities are meaningful — you can compare them, threshold them, and use them downstream.

The Cross-Entropy Loss

Now that we have probabilities, how do we penalize wrong predictions? Take the negative log of the probability assigned to the correct class:

Cross-Entropy Loss Li = −log P(Y = yi | xi) = −log( esyi / Σj esj )
yi is the correct class label. We want P(correct class) to be close to 1, so −log(P) is close to 0.

Why negative log? When P(correct) = 1.0, the loss is −log(1) = 0 (perfect). When P(correct) = 0.01, the loss is −log(0.01) = 4.6 (terrible). When P(correct) → 0, the loss → ∞. The negative log penalizes confidently wrong predictions very harshly.

Cross-Entropy vs. SVM Loss

SVM (hinge) loss says: "I only care that the correct class score beats the others by a margin of 1. Beyond that, I'm happy." Cross-entropy says: "I always want more probability on the correct class. Even if you're at 99%, I'll keep pushing toward 100%." Cross-entropy never stops caring — it always produces gradients, which makes it better for training deep networks.

Numerical Stability

There's a practical trap. If scores are large (say s = [1000, 999, 998]), then e1000 overflows to infinity in floating point. The fix is simple and elegant: subtract the maximum score from all scores before exponentiating.

Stable Softmax s'k = sk − maxj(sj)
P(Y = k) = es'k / Σj es'j
This is mathematically identical (the max cancels in the ratio) but numerically safe because the largest exponent is e0 = 1.
Why Subtracting the Max Works

esk − C / Σj esj − C = (esk · e−C) / (Σj esj · e−C) = esk / Σj esj. The e−C factors cancel. Setting C = max(s) ensures all exponents are ≤ 0, so all values are in [0, 1]. No overflow possible.

Always Use Stable Softmax

Never implement softmax without the max-subtraction trick. In deep networks, score magnitudes can be unpredictable. One NaN from an overflow will corrupt your entire training run. PyTorch's F.cross_entropy and TensorFlow's tf.nn.softmax_cross_entropy_with_logits handle this automatically — they take raw scores (logits) and apply the stable version internally.

Chapter 04

Gradient Descent

You have a loss function L(W). You want to find the weights W that minimize it. How do you navigate the loss landscape — a surface in million-dimensional space — toward the lowest point?

Imagine standing on a foggy hillside. You can't see the valley floor. But you can feel the slope beneath your feet. The steepest downhill direction is the negative gradient. Take a step in that direction. Repeat.

Gradient Descent Update W ← W − α ∇W L(W)
α = learning rate (step size). ∇W L = gradient of loss with respect to weights.

That's it. The entire optimization algorithm fits in one line. Everything else in this lecture is about making this one step work better.

Definition
Learning Rate (α)

The learning rate controls how big each step is. Too large: you overshoot the minimum and the loss explodes. Too small: training takes forever and you get stuck in bad local optima. Choosing the right learning rate is one of the most important decisions in training neural networks.

Loss Landscape & Gradient Descent

A 2D contour plot of a loss function. Watch gradient descent trace a path from the starting point toward the minimum. Adjust the learning rate to see how it affects convergence.

The Loss Landscape

For a linear classifier with, say, 3072 input features and 10 classes, the weight matrix has 30,720 entries. The loss is a function of all 30,720 numbers — a surface in 30,720-dimensional space. We can't visualize that, but we can build intuition from 2D slices.

In 2D, the loss landscape looks like a topographic map. Contour lines connect points of equal loss. The gradient at any point is perpendicular to the contour line, pointing uphill. We walk in the opposite direction — downhill.

Problems in the Landscape

Local minima: A valley that isn't the deepest valley. Gradient descent can get trapped — the gradient is zero, so it stops. In practice, local minima are less of a problem than you'd think in high dimensions, because most "flat" spots are saddle points, not true minima.

Saddle points: Points where the surface curves up in some directions and down in others — like a mountain pass. The gradient is zero, but it's not a minimum. In high-dimensional spaces, saddle points vastly outnumber local minima (Dauphin et al., 2014). This is actually a bigger concern than local minima.

Ill-conditioning: When the loss changes quickly in one direction and slowly in another, gradient descent oscillates along the steep direction while barely making progress along the flat one. Think of a narrow, elongated valley — the gradient zigzags side to side instead of heading straight for the bottom.

Analytic vs. Numeric Gradients

You could compute gradients by bumping each weight by a tiny amount h and measuring how the loss changes: ∇L ≈ (L(W+h) − L(W)) / h. This numerical gradient is slow (one forward pass per weight!) but easy to implement. The analytic gradient uses calculus (backpropagation) to compute all gradients in one backward pass. In practice: always use analytic gradients, but verify them against numerical gradients as a debugging check.

Chapter 05

SGD & Mini-Batch

There's a problem with vanilla gradient descent. The loss is a sum over all N training examples:

Full-Batch Gradient L(W) = (1/N) Σi=1N Li(xi, yi, W) + λ R(W)

With N = 1,000,000 images, computing the gradient requires a forward and backward pass through every single example before you can take one step. That's absurdly expensive.

The Stochastic Shortcut

Stochastic Gradient Descent (SGD) approximates the full gradient using a small random subset — a mini-batch — of the training data:

Mini-Batch SGDW L ≈ (1/B) Σi ∈ batchW Li
B = mini-batch size (typically 32, 64, or 128). Each step uses a fresh random batch.

Instead of one expensive step using all 1,000,000 examples, you take many cheap steps using 64 examples each. Each individual step is noisy — the mini-batch gradient only approximates the true gradient — but the noise averages out over many steps, and you make progress much faster in wall-clock time.

Noise Is a Feature, Not a Bug

The stochasticity in SGD isn't just a compromise for speed — it actually helps. The random fluctuations can kick the optimizer out of sharp, narrow minima (which generalize poorly) and into broad, flat minima (which generalize well). Adding noise to optimization is one of the few cases where imprecision makes things better.

Batch Size Tradeoffs

Batch SizeGradient QualitySpeedGeneralization
B = 1 (pure SGD)Very noisySlow (poor GPU utilization)Good (lots of noise)
B = 32–128Good approximationFast (GPU parallelism)Good sweet spot
B = N (full batch)Exact gradientSlow per stepCan overfit to sharp minima
Worked Example — How Much Faster?

Dataset: 50,000 images (CIFAR-10). Full-batch GD: one gradient step requires all 50,000 forward+backward passes. SGD with B=64: one step requires 64 passes. In the time full-batch GD takes one step, SGD takes ~781 steps. Each step is noisier, but 781 noisy steps beat one precise step. Every time.

Three Problems with Plain SGD

(1) Ill-conditioning: When the loss surface is an elongated valley, SGD oscillates across the narrow dimension and crawls along the long one. (2) Saddle points: The gradient is near-zero, so SGD stalls. (3) Noisy gradients: Mini-batch randomness causes the path to jitter. All three problems motivate the next chapter: momentum.

Chapter 06

Momentum & Nesterov

Think of optimization as rolling a ball down a hilly landscape. Plain SGD is like placing a ball on the surface and teleporting it in the steepest downhill direction at each step. There's no inertia — the ball has no memory of which direction it was going.

Momentum gives the ball mass. Instead of teleporting, the ball accelerates. If the gradient keeps pointing in the same direction, the ball speeds up. If the gradient oscillates, the ball's inertia smooths out the zigzag.

SGD + Momentum v ← ρ v − α ∇W L(W)
W ← W + v
v = velocity (accumulated gradient direction). ρ = momentum coefficient (friction), typically 0.9 or 0.99.

The velocity v is a running average of recent gradients. The parameter ρ (rho) acts like friction — it determines how much of the previous velocity is retained. With ρ = 0.9, the ball "remembers" about the last 10 gradients.

SGD vs. Momentum

Side-by-side optimization on a narrow, elongated valley. SGD oscillates wildly; momentum smooths the path. Click Run to watch them race.

Why Momentum Solves All Three SGD Problems

(1) Ill-conditioning: Oscillations across the narrow dimension cancel out in the velocity; consistent motion along the long dimension accumulates. (2) Saddle points: Even if the gradient is near zero, the accumulated velocity carries the ball through. (3) Noise: Random fluctuations average out in the running mean, producing smoother updates.

Nesterov Momentum

Standard momentum computes the gradient at the current position, then combines it with velocity. Nesterov momentum does something smarter: first "look ahead" to where the velocity would take you, then compute the gradient there.

Nesterov Momentum v ← ρ v − α ∇W L(W + ρ v)
W ← W + v
The gradient is evaluated at W + ρv (the "lookahead" position), not at W.

The intuition: if the velocity is about to carry you past the minimum, the lookahead gradient "sees" the uphill slope and applies a corrective force before you get there. It's like a driver who looks at the road ahead, not just under the hood.

Nesterov: The Change of Variables Trick

In practice, we can't easily evaluate the gradient at W + ρv if our model expects inputs at W. The standard trick: define W̃ = W + ρv. Then the update in terms of W̃ becomes: v ← ρv − α∇L(W̃), then W̃ ← W̃ − ρvold + (1+ρ)vnew. This way we always evaluate gradients at the current "effective" position W̃.

MethodGradient Evaluated AtBehavior
SGDCurrent position WTeleports downhill; no memory
SGD + MomentumCurrent position WRolls downhill with inertia
NesterovLookahead position W + ρvRolls with anticipation; better braking
Chapter 07

Adaptive Methods

Momentum solves the "same learning rate for all directions" problem by accumulating velocity. But what if different parameters need different learning rates? A weight that rarely gets large gradients needs a bigger step size. A weight that gets huge gradients needs a smaller one.

Adaptive methods give each parameter its own effective learning rate, automatically tuned based on the history of gradients for that parameter.

AdaGrad

The simplest adaptive method. Keep a running sum of squared gradients for each parameter, then scale the learning rate by the inverse of that sum:

AdaGrad h ← h + (∇W L)2     (element-wise square, accumulate)
W ← W − α · ∇W L / (√h + ε)
ε ≈ 10−7 prevents division by zero. Operations are element-wise.

Parameters with large accumulated gradients (steep directions) get smaller effective learning rates. Parameters with small accumulated gradients (flat directions) get larger ones. This equalizes progress across all directions.

AdaGrad's Fatal Flaw

The denominator h only grows — it never shrinks. Over time, the effective learning rate for every parameter decays toward zero. Training eventually stalls. AdaGrad works well for sparse problems (NLP, recommender systems) but poorly for deep learning's long training runs.

RMSProp

RMSProp fixes AdaGrad's decay problem with a simple change: use a leaky (exponentially weighted) running average of squared gradients instead of the cumulative sum:

RMSProp h ← β2 h + (1 − β2) (∇W L)2
W ← W − α · ∇W L / (√h + ε)
β2 ≈ 0.999 is the decay rate. Recent gradients matter more than old ones.

The "leaky" average means old gradient information gradually fades. If a parameter stops getting large gradients, its learning rate recovers. No more death spiral.

Adam: The King of Optimizers

Adam (Adaptive Moment Estimation) combines the best of both worlds: momentum's velocity accumulation and RMSProp's per-parameter scaling. It maintains two running averages:

First moment (m): an exponentially weighted average of gradients. This is momentum.

Second moment (v): an exponentially weighted average of squared gradients. This is RMSProp.

Adam Optimizer (Kingma & Ba, 2015)
  1. Initialize: m = 0, v = 0, t = 0
  2. At each step t:
    Compute gradient: g = ∇W L(W)
    Update first moment (momentum): m ← β1 m + (1 − β1) g
    Update second moment (RMSProp): v ← β2 v + (1 − β2) g2
    Bias correction: m̂ = m / (1 − β1t),   v̂ = v / (1 − β2t)
    Update: W ← W − α · m̂ / (√v̂ + ε)

Why Bias Correction?

Both m and v are initialized to zero. At the first step (t=1), m = (1 − β1) g — which is much smaller than g because β1 = 0.9 means you're keeping 90% of nothing. The estimates are biased toward zero in early training.

Derivation — Bias Correction

After t steps: mt = (1−β1) Σi=1t β1t−i gi. Taking expectations: E[mt] = E[g] · (1−β1) Σi=1t β1t−i = E[g] · (1 − β1t). So mt underestimates E[g] by a factor of (1 − β1t). Dividing by this factor corrects the bias. At t=1 with β1=0.9: correction = 1/(1−0.9) = 10×. At t=100: correction ≈ 1.0 (bias is negligible). The same logic applies to vt with β2.

Why Adam Dominates in Practice

Adam combines three critical ingredients. Momentum1) smooths noisy gradients and accelerates through flat regions. Adaptive scaling2) gives each parameter its own learning rate, handling ill-conditioning. Bias correction ensures sensible updates from the very first step. The default hyperparameters (β1=0.9, β2=0.999, α=1e-3 or 5e-4) work well across a huge range of architectures. You rarely need to tune them.

AdamW: Decoupled Weight Decay

There's a subtle interaction between L2 regularization and Adam. Standard Adam adds the L2 gradient (λW) to g, which then gets scaled by the adaptive learning rate. This means weight decay is also adapted per-parameter — not what we want.

AdamW (Loshchilov & Hutter, 2019) fixes this by applying weight decay after the Adam update, directly to the weights:

AdamW Update W ← W − α · m̂ / (√v̂ + ε) − α λ W
The weight decay term αλW is added separately, not through the gradient. This is "decoupled" weight decay.

AdamW consistently outperforms standard Adam + L2 regularization, especially for training large models like BERT and GPT. It's the default in modern deep learning.

OptimizerFirst MomentSecond MomentBias CorrectionKey Feature
SGDSimple, needs tuning
MomentumSmooths oscillations
AdaGrad✓ (cumulative)Adaptive LR, dies over time
RMSProp✓ (leaky)Adaptive LR, doesn't die
Adam✓ (leaky)Best of momentum + RMSProp
AdamW✓ (leaky)Adam + decoupled weight decay
Chapter 08

Learning Rate Schedules

The learning rate α doesn't have to stay constant. In fact, the best strategy is almost always to change it during training. Start with a relatively large learning rate to make fast initial progress, then decay it so the optimizer can settle into a precise minimum.

Step Decay

The simplest schedule: multiply the learning rate by a factor (e.g., 0.1) at specific epochs. For ResNets on ImageNet, the classic recipe is: start at α = 0.1, multiply by 0.1 at epochs 30, 60, and 90.

Step Decay αt = α0 · 0.1⌊t/30⌋
Sudden drops in learning rate cause sudden drops in loss — the optimizer "settles in" to a more precise region after each step.

Cosine Annealing

A smooth, gradual decay following a cosine curve from the initial learning rate down to near zero:

Cosine Annealing αt = ½ α0 (1 + cos(π t / T))
α0 = initial LR, t = current epoch, T = total epochs. Smoothly decays from α0 to 0.

Cosine annealing has become the default for training transformers and vision models. It avoids the discontinuities of step decay while still reaching very small learning rates by the end of training.

Linear Warmup

Starting with a large learning rate can destabilize early training. Linear warmup gradually increases the learning rate from 0 to α0 over the first few thousand steps:

Linear Warmup αt = α0 · (t / Twarmup)     for t < Twarmup
Twarmup is typically 1,000–5,000 steps. After warmup, switch to your chosen decay schedule.

Why does warmup help? In early training, the gradients can be very large and erratic because the model hasn't learned anything yet. A large learning rate amplifies this chaos. Warmup gives the optimizer time to "find its footing" before taking big steps.

The Modern Recipe

The standard schedule for transformers in 2024-2025: linear warmup for ~5,000 steps, then cosine decay to zero. This combines the stability of warmup with the smooth convergence of cosine annealing. An empirical rule of thumb: if you increase the batch size by N, also scale the initial learning rate by N (Goyal et al., 2017).

Learning Rate Schedules

See how different schedules change the learning rate over training. Toggle between constant, step decay, cosine annealing, and warmup+cosine.

Other Schedules

ScheduleFormulaUsed By
Linear decayαt = α0(1 − t/T)BERT, GPT-2
Inverse sqrtαt = α0 / √tOriginal Transformer (Vaswani et al.)
Cosine + restartsCosine with periodic resetsSGDR (Loshchilov & Hutter)
1-cycleWarmup to high LR, then cosine downSuper-convergence (Smith & Topin)
Chapter 09

Showcase: Optimizer Race

Time to see everything in action. Four optimizers — SGD, Momentum, RMSProp, and Adam — all start from the same point and race to minimize the same loss function. Watch how their paths differ. Who reaches the minimum first? Who oscillates? Who takes the smoothest path?

Optimizer Race — Interactive Loss Landscape

Four optimizers race across a 2D loss landscape. Adjust learning rates and watch their behavior change. The contour lines show constant-loss curves; darker = lower loss.

What to Look For

SGD (red): oscillates in narrow valleys, gets stuck at saddle points. Momentum (blue): smoother path, can overshoot but recovers. RMSProp (green): adapts step sizes per dimension, good at narrow valleys but no momentum to escape saddle points. Adam (gold): combines both advantages — typically fastest convergence with the smoothest path.

Try increasing the learning rate and watch SGD diverge while Adam stays stable. Try the saddle point landscape and watch SGD stall while momentum carries through. Try the narrow valley (Beale) and watch RMSProp navigate it smoothly while SGD zigzags.

Pay attention to the number of steps each optimizer needs. Adam typically converges in far fewer steps, but each step is slightly more expensive (maintaining two running averages). In practice, the per-step overhead is negligible compared to the forward and backward passes, so Adam's faster convergence translates directly to faster training.

Also notice how the paths diverge more dramatically as you increase the learning rate. SGD is the first to blow up; Momentum can overshoot but usually recovers; RMSProp's adaptive scaling keeps it stable; Adam is the last to lose stability. This robustness to learning rate choice is a major practical advantage.

Experiment Guide

Beale landscape: A narrow valley that curves. SGD oscillates across the valley walls. Momentum helps but can overshoot the curve. Adam handles both the narrowness (via RMSProp component) and the curve (via momentum component). Rosenbrock: A long, banana-shaped valley. Progress is easy in the steep direction but agonizingly slow along the flat bottom. Saddle point: Flat in one direction, curved in another. Pure SGD stalls; momentum pushes through.

Chapter 10

Weight Initialization & Connections

Before optimization even begins, you have to initialize the weights. This choice matters far more than you might expect. Bad initialization can make training impossible.

Why Not All Zeros?

If every weight starts at zero, every neuron in each layer computes the exact same output. Gradients are identical too, so all weights update identically. They stay equal forever. The network has effectively collapsed to a single neuron per layer — it can never learn diverse features. This is called the symmetry problem.

Never Initialize to Zero

All-zero initialization permanently breaks neural networks (except biases, which can safely be zero). You must break symmetry by using random initialization. But how much randomness?

Too Small vs. Too Large

If weights are initialized too small, signals shrink as they pass through layers. By the time you reach layer 50, activations are essentially zero. Gradients vanish. Training dies.

If weights are initialized too large, signals explode. Activations saturate, gradients explode, and the loss becomes NaN after a few steps.

The sweet spot: initialize so that the variance of activations stays roughly constant across layers. This is the principle behind both Xavier and He initialization.

Xavier Initialization

For layers with tanh or sigmoid activations, Glorot & Bengio (2010) derived:

Xavier Initialization W ~ N(0, 2 / (nin + nout))
nin = number of input units, nout = number of output units. The variance is calibrated so that signal magnitude is preserved through the layer.

He/Kaiming Initialization

ReLU activations kill half the signal (all negative values become zero). He et al. (2015) compensated by doubling the variance:

Kaiming (He) Initialization W ~ N(0, 2 / nin)
The factor of 2 compensates for ReLU zeroing out ~half the activations. Use this for networks with ReLU or its variants.
Rule of Thumb

Using ReLU (or GELU, Swish, etc.)? Use Kaiming/He initialization. Using tanh or sigmoid? Use Xavier/Glorot. In PyTorch: nn.init.kaiming_normal_(layer.weight) or nn.init.xavier_normal_(layer.weight). Modern frameworks like PyTorch use Kaiming by default for linear layers.

Derivation Sketch — Kaiming Init

For a layer y = Wx, if entries of x have variance Var(x) and entries of W have variance Var(W), then each output yj = Σi Wji xi has variance Var(y) = nin · Var(W) · Var(x). For Var(y) = Var(x), we need Var(W) = 1/nin. After ReLU, half the outputs are zeroed, halving the effective variance. So we need Var(W) = 2/nin to compensate.

The Full Training Recipe

Putting It All Together
  1. Initialize weights: Kaiming/He for ReLU networks.
  2. Choose optimizer: AdamW (β1=0.9, β2=0.999, ε=1e-8) is the safe default.
  3. Choose learning rate: Start with α = 1e-3 or 5e-4 for Adam, 0.1 for SGD+Momentum.
  4. Choose schedule: Linear warmup (5,000 steps) + cosine decay for transformers. Step decay for CNNs.
  5. Choose regularization: Weight decay λ ≈ 0.01–0.1 (in AdamW). Possibly dropout.
  6. Train and monitor both training and validation loss. If the gap grows, increase regularization.

Practical Advice from CS 231n

SituationRecommendation
Starting a new projectAdamW with default hyperparameters. It often works without tuning.
Squeezing out last 1% accuracySGD + Momentum with careful LR schedule. Can outperform Adam with tuning.
Full-batch optimization (small dataset)Consider L-BFGS — a second-order method that converges very fast when you can afford full gradients.
Training is unstable (loss spikes)Reduce learning rate. Add warmup. Check initialization.
Large generalization gapIncrease weight decay. Add dropout. Use data augmentation.

Second-Order Methods (A Brief Mention)

Everything we've discussed is first-order optimization: we only use the gradient (first derivative). Second-order methods also use the Hessian (matrix of second derivatives) to model the curvature of the loss surface. Newton's method jumps directly to the estimated minimum of a quadratic approximation. The problem? The Hessian has O(N2) entries and inverting it takes O(N3) operations. With N = 100 million parameters, that's utterly infeasible.

Approximations like L-BFGS avoid storing the full Hessian, but they still require deterministic full-batch gradients. In the stochastic mini-batch world of deep learning, first-order methods (especially Adam) reign supreme.

Key Equations Summary

SGD W ← W − α ∇L
SGD + Momentum v ← ρv − α ∇L,    W ← W + v
Adam m ← β1m + (1−β1)g,    v ← β2v + (1−β2)g2
m̂ = m/(1−β1t),    v̂ = v/(1−β2t)
W ← W − α m̂/(√v̂ + ε)
Cross-Entropy Loss L = −log( esy / Σj esj )

What Comes Next

Backpropagation (Lecture 4): How do you actually compute ∇L? The chain rule applied recursively through the network. Without backprop, gradient descent is useless — you can't compute the gradient for neural networks with millions of parameters any other way.

Neural Networks: We've been optimizing linear classifiers (f = Wx + b). Next: stack multiple layers with nonlinearities to learn arbitrary functions. The optimization tools from this lecture — Adam, momentum, learning rate schedules — carry over directly.

Batch Normalization, Dropout, Data Augmentation: More regularization techniques, designed specifically for deep networks. These complement weight decay and are covered in later lectures.

Quick Reference: Default Hyperparameters

HyperparameterAdam/AdamWSGD+Momentum
Learning rate α1e-3 or 5e-40.1
Momentum ρ0.9 (β1)0.9
RMSProp decay β20.999N/A
ε1e-8N/A
Weight decay λ0.01–0.11e-4
Warmup steps1,000–5,000Optional
LR scheduleCosineStep decay or cosine
Worked Example — The PyTorch Recipe

Here is what training setup looks like in practice for a typical vision transformer:

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.05)

scheduler = CosineAnnealingLR(optimizer, T_max=300, eta_min=1e-6)

The optimizer handles the per-step weight updates (momentum, adaptive scaling, bias correction, decoupled weight decay). The scheduler handles the epoch-level learning rate decay (cosine annealing from 5e-4 down to 1e-6 over 300 epochs). Together they implement the full modern training recipe.

The Historical Arc

YearMilestoneKey Idea
1847Cauchy proposes gradient descentFollow the negative gradient
1951Robbins & Monro: stochastic approximationUse noisy gradients
1983Nesterov: accelerated gradientLookahead momentum
2010Glorot & Bengio: Xavier initializationPreserve signal variance
2011Duchi et al.: AdaGradPer-parameter learning rates
2012Tieleman & Hinton: RMSPropLeaky AdaGrad
2015Kingma & Ba: AdamMomentum + RMSProp + bias correction
2015He et al.: Kaiming initializationCompensate for ReLU
2017Loshchilov & Hutter: cosine annealing, AdamWSmooth LR decay, decoupled weight decay
The One Sentence

Regularization prevents overfitting by penalizing complexity; optimization finds good weights by following gradients intelligently. Adam + weight decay + cosine schedule is the modern default.