Goodfellow et al., Chapter 8

Optimization for Training

SGD, momentum, adaptive learning rates, and the landscape of loss surfaces. How we actually minimize the loss function in practice.

Prerequisites: Chapter 4 (gradient-based optimization), Chapter 6 (feedforward networks, backprop).
10
Chapters
4+
Simulations
10
Quizzes

Chapter 0: Training vs Pure Optimization

Optimization in machine learning is fundamentally different from optimization in mathematics. In pure optimization, you want to minimize a function exactly. In deep learning, you want to minimize the expected loss on data you have never seen — the test set. The training loss is only a proxy.

This means we are doing empirical risk minimization: minimizing the average loss over the training set, hoping it approximates the true expected loss. The gap between training loss and test loss is the generalization error (Chapter 7).

Key distinction: We do not actually want to reach the minimum of the training loss. A model at the global minimum of training loss is almost certainly overfitting. Our optimization algorithms are tuned for good generalization, not perfect training fit. This is why techniques like early stopping, noise in SGD, and learning rate decay all help.
Pure Optimization
Find exact minimum of f(x). That is the answer.
ML Optimization
Minimize training loss as a proxy for test loss. Stop before overfitting.
Challenges
Saddle points, ill-conditioning, local minima, stochastic gradients.

Another critical difference: we can only afford approximate gradients. Computing the true gradient requires a pass over the entire dataset. Instead, we estimate it from a small mini-batch, introducing noise into the optimization. This noise is not just tolerable — it actually helps generalization.

Why don't we want to reach the exact minimum of the training loss?

Chapter 1: Stochastic Gradient Descent

The most important optimization algorithm in deep learning is stochastic gradient descent (SGD). Instead of computing the gradient over the entire dataset, SGD estimates the gradient from a random mini-batch of m examples:

ĝ = (1/m) ∑i=1mθ L(f(x(i); θ), y(i))

Then the parameters are updated: θ ← θ − ε ĝ, where ε is the learning rate — the single most important hyperparameter in deep learning.

Why does this work? The mini-batch gradient is an unbiased estimator of the true gradient. On average, it points in the right direction. The noise from sampling introduces variance, but this variance actually helps escape sharp minima and find flatter, better-generalizing regions.

Batch size tradeoff: Larger batches give lower-variance gradient estimates but cost more compute per step. Smaller batches give noisier estimates but more parameter updates per epoch. The sweet spot is typically 32-256 for most problems. Very large batches (thousands) tend to converge to sharp minima that generalize worse.

The learning rate ε must decrease over time. If it stays constant, SGD will never converge — it will oscillate around the minimum forever, pushed by the mini-batch noise. Common decay schedules include linear decay, step decay, and cosine annealing.

SGD on a 2D Loss Surface

Watch SGD navigate a loss landscape. Notice the noisy path compared to full-batch gradient descent.

Learning rate0.10
Batch noise0.8
Why must the learning rate decrease over training?

Chapter 2: Momentum

Vanilla SGD struggles on loss surfaces with high curvature in some directions and low curvature in others. The gradient points down the steep direction, causing oscillation, while making painfully slow progress along the shallow ravine.

Momentum fixes this by accumulating a velocity vector v that smooths out the oscillations:

v ← β v − ε ∇J(θ)
θ ← θ + v

Here β is the momentum coefficient (typically 0.9). Think of it as a ball rolling downhill: it builds up speed in directions of consistent gradient (the ravine) and the back-and-forth oscillations cancel out.

Physical analogy: Momentum adds inertia. A ball rolling through a narrow valley does not zigzag — its velocity along the valley accumulates while the perpendicular oscillations dampen. The parameter β controls how much "memory" the velocity has. β = 0 is vanilla SGD. β = 0.99 means the ball has heavy inertia and takes longer to change direction.

Nesterov momentum is a refinement: instead of computing the gradient at the current position, it computes the gradient at the anticipated next position (θ + βv). This "lookahead" gives a corrective signal that reduces overshooting. In practice, Nesterov momentum provides a modest improvement over standard momentum, especially for convex problems.

SGD vs Momentum vs Nesterov

Compare convergence on an elongated quadratic. Momentum accumulates velocity along the ravine.

Momentum β0.90
How does momentum help SGD on ill-conditioned loss surfaces?

Chapter 3: Adaptive Learning Rates

The learning rate is the most important hyperparameter, but a single global rate treats all parameters equally. Parameters connected to frequently-active features may need smaller rates; parameters connected to rare features may need larger ones. Adaptive methods give each parameter its own effective learning rate.

AdaGrad divides each parameter's learning rate by the square root of the sum of all past squared gradients for that parameter:

r ← r + g ⊙ g
Δθ = − (ε / √(r + δ)) ⊙ g

Parameters with large accumulated gradients get their learning rate reduced. Parameters with small accumulated gradients keep a relatively large rate. This is great for sparse data (NLP, recommender systems) where some features appear rarely.

AdaGrad's fatal flaw: The accumulator r only grows. It never shrinks. Over long training runs, the effective learning rate shrinks to near zero and training stalls. This makes AdaGrad unsuitable for deep learning where training runs are long.

RMSProp fixes this by using an exponentially decaying average of squared gradients instead of the sum:

r ← ρ r + (1 − ρ) g ⊙ g
Δθ = − (ε / √(r + δ)) ⊙ g

The decay rate ρ (typically 0.9 or 0.99) means old gradients are forgotten. This gives RMSProp a "sliding window" view of recent gradient magnitudes, preventing the learning rate from decaying to zero. RMSProp was proposed by Hinton in a Coursera lecture — not a paper — and quickly became one of the most popular optimizers.

Key insight: Adaptive methods solve a different problem than momentum. Momentum helps with direction (smoothing oscillations). Adaptive rates help with scale (adjusting step size per parameter). The two ideas can be combined, which leads us to Adam.
Why does AdaGrad fail on long training runs?

Chapter 4: Adam

Adam (Adaptive Moment Estimation) combines the best of momentum and RMSProp. It maintains two running averages: a first moment (mean of gradients, like momentum) and a second moment (mean of squared gradients, like RMSProp).

s ← β1 s + (1 − β1) g     (first moment)
r ← β2 r + (1 − β2) g ⊙ g     (second moment)
ŝ = s / (1 − β1t),   r̂ = r / (1 − β2t)     (bias correction)
θ ← θ − ε · ŝ / (√r̂ + δ)

The bias correction (dividing by 1 − βt) is critical. Since s and r are initialized to zero, the early estimates are biased toward zero. The correction compensates for this, especially in the first few steps when βt is still large.

Default hyperparameters: The Adam paper recommends β1 = 0.9, β2 = 0.999, ε = 10−3, δ = 10−8. These defaults work remarkably well across a wide range of problems. In practice, the learning rate ε is the only one you usually need to tune.

AdamW is a critical variant. Standard Adam applies weight decay inside the adaptive gradient update, which means large-gradient parameters get less regularization. AdamW decouples weight decay from the gradient step, applying it directly to the parameters: θ ← (1 − λ)θ − update. This is the default optimizer for training transformers and most modern architectures.

Optimizer Comparison

Watch SGD, Momentum, RMSProp, and Adam converge on the same surface. Adam combines the benefits of both.

Learning rate0.050
What does Adam's bias correction fix?

Chapter 5: Learning Rate Schedules

Even with adaptive optimizers, the base learning rate matters enormously. Starting too high causes divergence. Starting too low wastes compute on timid steps. Learning rate schedules prescribe how ε changes over training.

Step decay multiplies the learning rate by a factor (e.g., 0.1) at fixed epochs. Simple, effective, requires knowing roughly when to drop. Used in classic ImageNet training recipes (drop at epoch 30 and 60 of 90).

Cosine annealing smoothly decreases the learning rate following a cosine curve from εmax to εmin:

εt = εmin + ½(εmax − εmin)(1 + cos(π t / T))

Cosine annealing has become the default in modern training. It spends more time at moderate learning rates (the productive regime) and smoothly cools down to fine-tune at the end.

Warmup: Large models (especially transformers) benefit from a warmup phase: start with a tiny learning rate and linearly ramp up to the target over the first few hundred to few thousand steps. Without warmup, the randomly initialized parameters receive large gradient updates that can destabilize training. After warmup, follow a cosine decay.

Cyclical learning rates and warm restarts periodically reset the learning rate to a high value, exploring multiple basins of the loss landscape. This can be combined with snapshot ensembles: save the model at each cycle's end and ensemble the snapshots.

Learning Rate Schedules

Visualize different schedules over 100 epochs. The shaded area shows the "productive" training zone.

Schedule
Why do large transformer models need learning rate warmup?

Chapter 6: Loss Surfaces

The loss function of a deep network is a complex, high-dimensional surface. Understanding its geometry helps explain why optimization works (and when it fails).

Local minima were long feared as traps. In low dimensions, a random function has many isolated local minima. But in high dimensions, most critical points are saddle points — minima in some directions and maxima in others. For a random function in n dimensions, the probability that all n eigenvalues of the Hessian are positive (true local minimum) is exponentially small: ~2−n.

Saddle points, not local minima, are the real obstacle. Gradient descent slows dramatically near saddle points because the gradient is near zero. But the gradient still eventually escapes (SGD noise helps). Newton's method can actually be attracted to saddle points, which is one reason second-order methods are tricky in deep learning.

Ill-conditioning is the most common problem. When the Hessian has a very large condition number (ratio of largest to smallest eigenvalue), the loss surface looks like a narrow ravine. The gradient points mostly across the ravine (steep direction), not along it (useful direction). This causes oscillation and slow progress.

Flat vs sharp minima: Empirically, flat minima (low curvature around the minimum) tend to generalize better than sharp minima (high curvature). SGD with small batches tends to find flatter minima, which may explain why small-batch training generalizes better. This connects optimization to generalization in a deep way.

Cliffs and exploding gradients: Some loss surfaces have regions of very high curvature — "cliffs" where the gradient magnitude jumps by orders of magnitude. A normal gradient step from a cliff can launch the parameters into a bad region. Gradient clipping caps the gradient norm to prevent this: if ||g|| > threshold, set g ← g · threshold / ||g||.
In high-dimensional loss surfaces, why are saddle points more problematic than local minima?

Chapter 7: Batch Normalization for Optimization

We covered batch normalization as a regularizer in Chapter 7. Here we look at its optimization benefits, which are arguably even more important.

BatchNorm reparameterizes the network in a way that makes the loss surface smoother. By normalizing each layer's inputs, it reduces the dependence between layers — updating one layer's weights does not dramatically shift the input distribution of the next layer.

Why BatchNorm helps optimization: The original paper attributed the benefit to reducing "internal covariate shift." Later research (Santurkar et al., 2018) showed this is not quite right. The real benefit is that BatchNorm makes the loss surface smoother — the gradients become more predictive of the loss change, and the loss changes more smoothly as parameters move. This allows larger learning rates and faster convergence.

Layer normalization normalizes across features within a single example (rather than across the batch). It does not depend on batch statistics, making it suitable for RNNs, transformers, and small-batch settings. LayerNorm is the standard normalization in transformers.

Weight normalization reparameterizes each weight vector as w = g · v/||v||, decoupling the magnitude g from the direction v/||v||. This is simpler than BatchNorm but less effective in practice.

Group normalization divides channels into groups and normalizes within each group. It works well with small batch sizes (common in detection and segmentation tasks where large images limit batch size). Instance normalization (group size = 1) is used in style transfer.

What is the primary optimization benefit of batch normalization?

Chapter 8: Optimizer Playground

Watch four optimizers race on a challenging 2D surface with a narrow valley. Drag the starting point to experiment with different initializations.

Optimizer Race

All optimizers start from the same point. SGD oscillates in the valley; momentum smooths the path; Adam adapts per-parameter and converges fastest.

Learning rate0.030
Steps80
Experiments: (1) Set LR=0.1 and watch SGD diverge while Adam stays stable. (2) Reduce LR to 0.01 and watch SGD converge but slowly. (3) Increase steps to 200 and compare final positions. Adam reaches the minimum first in almost every configuration.
Why does Adam converge faster than SGD on surfaces with different curvature in different directions?

Chapter 9: Connections

Optimization is the engine that drives all of deep learning. Here is where each concept connects:

ConceptWhere It Appears
SGDStill the foundation. Large-scale training (LLMs, vision) often uses SGD + momentum over Adam for better generalization.
Adam / AdamWDefault for transformers (Ch 10, NLP), fine-tuning pretrained models, GANs, and most new architectures.
Learning rate warmupEssential for transformer training. Also used in large-batch distributed training.
Cosine scheduleStandard in modern training recipes: warmup + cosine decay. Used in GPT, BERT, ViT, and nearly all foundation models.
Gradient clippingCritical for RNNs (Ch 10) to prevent exploding gradients. Also used in transformer training.
BatchNorm / LayerNormBatchNorm in CNNs (Ch 9), LayerNorm in transformers. Both smooth the loss surface and enable higher LR.
Loss surface geometryConnects to generalization (Ch 7): flat minima generalize better. Connects to architecture: skip connections (ResNet) smooth the surface.
What you should take away: Start with AdamW + cosine schedule + warmup for most problems. If generalization matters more than convergence speed (large-scale vision), consider SGD + momentum. Always use gradient clipping for sequence models. The learning rate is the most important hyperparameter — tune it first.

Up next: Chapter 9: Convolutional Networks — architectures that exploit spatial structure in data.

For training a transformer model, what is the recommended optimizer setup?