SGD, momentum, adaptive learning rates, and the landscape of loss surfaces. How we actually minimize the loss function in practice.
Optimization in machine learning is fundamentally different from optimization in mathematics. In pure optimization, you want to minimize a function exactly. In deep learning, you want to minimize the expected loss on data you have never seen — the test set. The training loss is only a proxy.
This means we are doing empirical risk minimization: minimizing the average loss over the training set, hoping it approximates the true expected loss. The gap between training loss and test loss is the generalization error (Chapter 7).
Another critical difference: we can only afford approximate gradients. Computing the true gradient requires a pass over the entire dataset. Instead, we estimate it from a small mini-batch, introducing noise into the optimization. This noise is not just tolerable — it actually helps generalization.
The most important optimization algorithm in deep learning is stochastic gradient descent (SGD). Instead of computing the gradient over the entire dataset, SGD estimates the gradient from a random mini-batch of m examples:
Then the parameters are updated: θ ← θ − ε ĝ, where ε is the learning rate — the single most important hyperparameter in deep learning.
Why does this work? The mini-batch gradient is an unbiased estimator of the true gradient. On average, it points in the right direction. The noise from sampling introduces variance, but this variance actually helps escape sharp minima and find flatter, better-generalizing regions.
The learning rate ε must decrease over time. If it stays constant, SGD will never converge — it will oscillate around the minimum forever, pushed by the mini-batch noise. Common decay schedules include linear decay, step decay, and cosine annealing.
Watch SGD navigate a loss landscape. Notice the noisy path compared to full-batch gradient descent.
Vanilla SGD struggles on loss surfaces with high curvature in some directions and low curvature in others. The gradient points down the steep direction, causing oscillation, while making painfully slow progress along the shallow ravine.
Momentum fixes this by accumulating a velocity vector v that smooths out the oscillations:
Here β is the momentum coefficient (typically 0.9). Think of it as a ball rolling downhill: it builds up speed in directions of consistent gradient (the ravine) and the back-and-forth oscillations cancel out.
Nesterov momentum is a refinement: instead of computing the gradient at the current position, it computes the gradient at the anticipated next position (θ + βv). This "lookahead" gives a corrective signal that reduces overshooting. In practice, Nesterov momentum provides a modest improvement over standard momentum, especially for convex problems.
Compare convergence on an elongated quadratic. Momentum accumulates velocity along the ravine.
The learning rate is the most important hyperparameter, but a single global rate treats all parameters equally. Parameters connected to frequently-active features may need smaller rates; parameters connected to rare features may need larger ones. Adaptive methods give each parameter its own effective learning rate.
AdaGrad divides each parameter's learning rate by the square root of the sum of all past squared gradients for that parameter:
Parameters with large accumulated gradients get their learning rate reduced. Parameters with small accumulated gradients keep a relatively large rate. This is great for sparse data (NLP, recommender systems) where some features appear rarely.
RMSProp fixes this by using an exponentially decaying average of squared gradients instead of the sum:
The decay rate ρ (typically 0.9 or 0.99) means old gradients are forgotten. This gives RMSProp a "sliding window" view of recent gradient magnitudes, preventing the learning rate from decaying to zero. RMSProp was proposed by Hinton in a Coursera lecture — not a paper — and quickly became one of the most popular optimizers.
Adam (Adaptive Moment Estimation) combines the best of momentum and RMSProp. It maintains two running averages: a first moment (mean of gradients, like momentum) and a second moment (mean of squared gradients, like RMSProp).
The bias correction (dividing by 1 − βt) is critical. Since s and r are initialized to zero, the early estimates are biased toward zero. The correction compensates for this, especially in the first few steps when βt is still large.
AdamW is a critical variant. Standard Adam applies weight decay inside the adaptive gradient update, which means large-gradient parameters get less regularization. AdamW decouples weight decay from the gradient step, applying it directly to the parameters: θ ← (1 − λ)θ − update. This is the default optimizer for training transformers and most modern architectures.
Watch SGD, Momentum, RMSProp, and Adam converge on the same surface. Adam combines the benefits of both.
Even with adaptive optimizers, the base learning rate matters enormously. Starting too high causes divergence. Starting too low wastes compute on timid steps. Learning rate schedules prescribe how ε changes over training.
Step decay multiplies the learning rate by a factor (e.g., 0.1) at fixed epochs. Simple, effective, requires knowing roughly when to drop. Used in classic ImageNet training recipes (drop at epoch 30 and 60 of 90).
Cosine annealing smoothly decreases the learning rate following a cosine curve from εmax to εmin:
Cosine annealing has become the default in modern training. It spends more time at moderate learning rates (the productive regime) and smoothly cools down to fine-tune at the end.
Cyclical learning rates and warm restarts periodically reset the learning rate to a high value, exploring multiple basins of the loss landscape. This can be combined with snapshot ensembles: save the model at each cycle's end and ensemble the snapshots.
Visualize different schedules over 100 epochs. The shaded area shows the "productive" training zone.
The loss function of a deep network is a complex, high-dimensional surface. Understanding its geometry helps explain why optimization works (and when it fails).
Local minima were long feared as traps. In low dimensions, a random function has many isolated local minima. But in high dimensions, most critical points are saddle points — minima in some directions and maxima in others. For a random function in n dimensions, the probability that all n eigenvalues of the Hessian are positive (true local minimum) is exponentially small: ~2−n.
Ill-conditioning is the most common problem. When the Hessian has a very large condition number (ratio of largest to smallest eigenvalue), the loss surface looks like a narrow ravine. The gradient points mostly across the ravine (steep direction), not along it (useful direction). This causes oscillation and slow progress.
Flat vs sharp minima: Empirically, flat minima (low curvature around the minimum) tend to generalize better than sharp minima (high curvature). SGD with small batches tends to find flatter minima, which may explain why small-batch training generalizes better. This connects optimization to generalization in a deep way.
We covered batch normalization as a regularizer in Chapter 7. Here we look at its optimization benefits, which are arguably even more important.
BatchNorm reparameterizes the network in a way that makes the loss surface smoother. By normalizing each layer's inputs, it reduces the dependence between layers — updating one layer's weights does not dramatically shift the input distribution of the next layer.
Layer normalization normalizes across features within a single example (rather than across the batch). It does not depend on batch statistics, making it suitable for RNNs, transformers, and small-batch settings. LayerNorm is the standard normalization in transformers.
Weight normalization reparameterizes each weight vector as w = g · v/||v||, decoupling the magnitude g from the direction v/||v||. This is simpler than BatchNorm but less effective in practice.
Group normalization divides channels into groups and normalizes within each group. It works well with small batch sizes (common in detection and segmentation tasks where large images limit batch size). Instance normalization (group size = 1) is used in style transfer.
Watch four optimizers race on a challenging 2D surface with a narrow valley. Drag the starting point to experiment with different initializations.
All optimizers start from the same point. SGD oscillates in the valley; momentum smooths the path; Adam adapts per-parameter and converges fastest.
Optimization is the engine that drives all of deep learning. Here is where each concept connects:
| Concept | Where It Appears |
|---|---|
| SGD | Still the foundation. Large-scale training (LLMs, vision) often uses SGD + momentum over Adam for better generalization. |
| Adam / AdamW | Default for transformers (Ch 10, NLP), fine-tuning pretrained models, GANs, and most new architectures. |
| Learning rate warmup | Essential for transformer training. Also used in large-batch distributed training. |
| Cosine schedule | Standard in modern training recipes: warmup + cosine decay. Used in GPT, BERT, ViT, and nearly all foundation models. |
| Gradient clipping | Critical for RNNs (Ch 10) to prevent exploding gradients. Also used in transformer training. |
| BatchNorm / LayerNorm | BatchNorm in CNNs (Ch 9), LayerNorm in transformers. Both smooth the loss surface and enable higher LR. |
| Loss surface geometry | Connects to generalization (Ch 7): flat minima generalize better. Connects to architecture: skip connections (ResNet) smooth the surface. |
Up next: Chapter 9: Convolutional Networks — architectures that exploit spatial structure in data.