Gradient checks, optimizers, learning rate schedules, and hyperparameter search — everything between "I have a network" and "it actually works."
You've built a neural network. You've implemented backpropagation. You hit "train." The loss goes... nowhere. Or it explodes. Or it drops beautifully for 10 epochs then flatlines. What went wrong?
Training a neural network is not like compiling code. There's no error message that says "your learning rate is too high." There's no compiler warning for a broken gradient. You stare at a loss curve and try to read the tea leaves. And the difference between a network that works and one that doesn't is often a single hyperparameter choice.
Same network, same data, three different learning rates. Click "Run" to see how dramatically a single number changes the outcome. Green = good, orange = slow, red = diverged.
This lesson covers the full pipeline: verifying your gradients are correct, sanity-checking your setup before training, reading loss curves like a doctor reads vital signs, choosing and tuning optimizers, scheduling learning rates, and searching the hyperparameter space efficiently.
You've written backpropagation code. It runs. But is it correct? A subtle bug in the gradient computation won't crash your program — it'll just silently produce a model that learns slowly or learns the wrong thing. You need a way to verify that your analytic gradients match reality.
The idea is simple: compare your backprop gradient to a numerical gradient. The numerical gradient uses the definition of a derivative directly — no chain rule, no clever math. It's slow (one forward pass per parameter) but almost impossible to get wrong.
This is the centered difference formula. It's more accurate than the one-sided version (f(x+h) - f(x))/h because the error is O(h²) instead of O(h). Use h ≈ 1e-5.
To check if two gradient values agree, compute the relative error:
We compute the analytic gradient (via backprop) and numerical gradient (via centered differences) for a simple function. Adjust h to see how step size affects accuracy.
A few more practical tips from cs231n: use double precision (float64) during gradient checks for better numerical accuracy. Start from a random point (not zero — symmetry can hide bugs). And always gradient-check with a small network on a tiny dataset — it's slow, so you only do it once during development.
Before you invest hours of GPU time, two cheap sanity checks can catch 90% of bugs in under a minute.
With random weights, you can predict what the loss should be. For a softmax classifier with C classes, the initial loss should be −log(1/C) = log(C). With 10 classes (CIFAR-10), that's log(10) ≈ 2.302. If your initial loss is 8.5 or 0.3, something is broken.
Add regularization and the initial loss should be slightly higher. If adding regularization doesn't change the loss at all, your regularization code is dead.
Take 5–10 training examples. Turn off regularization. Train until the loss hits zero (or near-zero). If your network can't memorize 5 examples, something is fundamentally broken — a bug in the data pipeline, a wrong loss function, or a broken backward pass.
Adjust the number of classes. The expected initial loss for softmax is log(C). If your actual initial loss is far from this, you have a bug.
Here's the checklist practitioners run through before every training job:
| Check | Expected | If Wrong |
|---|---|---|
| Initial loss (no reg) | log(C) for softmax | Bug in loss or model output |
| Initial loss (with reg) | Slightly higher | Regularization is dead code |
| Overfit 5 examples | Loss → 0, accuracy → 100% | Bug in backward pass or data |
| Full train, low reg | Loss decreases steadily | Learning rate issue |
Once training starts, the loss curve is your primary diagnostic tool. A healthy loss curve drops quickly at first, then gradually flattens. But real loss curves are rarely this clean. Learning to read them is like learning to read an EKG — each shape tells a story.
Select a scenario to see its characteristic loss curve shape, and learn to diagnose what's happening.
Good training: Loss drops steeply in the first few epochs, then curves smoothly toward a plateau. The gap between training loss and validation loss stays small.
Learning rate too high: Loss oscillates wildly or explodes to NaN. The optimizer is overshooting minima, bouncing around the loss landscape like a pinball.
Learning rate too low: Loss decreases, but at a glacial pace. You're taking tiny steps downhill. The curve is almost flat — it looks like nothing is happening. You'll run out of compute budget before converging.
Overfitting: Training loss keeps dropping, but validation loss bottoms out and starts increasing. The network is memorizing training examples instead of learning general patterns. The growing gap between the two curves is the telltale sign.
If you could tune only one hyperparameter, it should be the learning rate. Too high and you overshoot, too low and you waste days going nowhere. The right value depends on your model, your data, and your optimizer — but the symptoms of a wrong choice are universal.
That η (eta) is the learning rate. It controls the step size in parameter space. Think of it as the stride length of someone walking downhill blindfolded. Too long and you step over the valley. Too short and you'll be walking forever.
Watch gradient descent on a simple 1D quadratic loss. Drag the learning rate to see the three regimes: too low (crawling), just right (smooth convergence), too high (divergence).
A practical tip: if your loss plateaus but hasn't converged, try reducing the learning rate by a factor of 10. This is the simplest form of learning rate scheduling, and it works surprisingly often. We'll formalize this in Chapter 6.
Vanilla SGD has a problem: it treats every direction equally. If the loss landscape is shaped like a long narrow valley (and in high dimensions, it usually is), SGD oscillates back and forth across the narrow direction while making slow progress along the long direction. We need a smarter way to step.
Momentum fixes this by adding a "velocity" to the parameter updates. Instead of stepping directly in the gradient direction, you accumulate a running average of past gradients. This smooths out the oscillations and accelerates progress in consistent directions.
Here μ is the momentum coefficient (typically 0.9). Think of a ball rolling downhill: it builds speed in the downhill direction and its inertia carries it through small bumps. The gradient is the slope; momentum is the ball's mass.
Nesterov accelerated gradient is a clever twist: instead of computing the gradient at the current position, you first take a step in the direction of the accumulated velocity, then compute the gradient at that "lookahead" point. This gives a better correction — you're correcting based on where you're about to be, not where you are.
Adam combines the best of two ideas: momentum (first moment) and RMSProp (second moment). RMSProp keeps a running average of squared gradients, and divides the update by their square root. This means dimensions with large gradients get smaller steps, and dimensions with small gradients get larger steps — adapting the learning rate per-parameter.
The hat notation (m̂, v̂) means bias-corrected estimates: m̂ = m / (1 − β1t). Without this correction, the estimates are biased toward zero at the start of training because m and v are initialized to zero.
Watch SGD, Momentum, and Adam navigate an elongated loss surface. Momentum smooths oscillations. Adam adapts per-dimension.
| Optimizer | Key Idea | Typical Use |
|---|---|---|
| SGD | Raw gradient step | Rarely used alone |
| SGD + Momentum | Accumulate velocity | ConvNets, when tuned well |
| RMSProp | Adapt per-parameter LR via squared gradients | RNNs (historically) |
| Adam | Momentum + RMSProp + bias correction | Default for most tasks |
A fixed learning rate is a compromise. Early in training, you want large steps to make rapid progress. Late in training, you want small steps to fine-tune into a precise minimum. Learning rate schedules give you both: they start high and decrease over time.
The simplest schedule: multiply the learning rate by a factor (like 0.1) every N epochs. Typical recipe: reduce by 10x at epochs 30, 60, and 90 for a 100-epoch run. You can see this in the loss curve as sudden drops after each LR reduction.
Where γ is the decay factor (e.g., 0.1) and S is the step size in epochs.
Cosine annealing smoothly reduces the learning rate following a half-cosine curve from ηmax to ηmin. No sharp drops, no hyperparameters to tune beyond the total number of epochs. It's become the default schedule in modern practice.
Learning rate warmup starts training with a very small LR and linearly increases it to the target value over the first few epochs. Why? With random weights, the early gradients can be unreliable. Large steps based on bad gradients can put the model in a region of parameter space from which it never recovers. Warmup lets the model "find its footing" first.
Compare different schedules over 100 epochs. The y-axis is the learning rate at each epoch.
Time to see everything in action. Three optimizers — vanilla SGD, SGD with Momentum, and Adam — all start at the same point on the same 2D loss surface. Watch their trajectories diverge as each follows its own strategy to reach the minimum.
Three optimizers, one loss surface. Click "Race!" to watch them compete. Change the surface and speed to explore different landscapes.
Notice how the differences become most dramatic on the elongated bowl. This shape is common in real neural networks: the loss landscape has very different curvatures in different directions. SGD wastes energy oscillating across the steep direction. Momentum dampens the oscillation. Adam recognizes that one dimension has large gradients and automatically shrinks its step there.
On the Rosenbrock valley, all optimizers struggle with the flat, curved bottom. But Adam and Momentum make steady progress while SGD barely moves. On the Beale function (with its irregular shape), Adam's adaptive per-parameter rates shine.
You have a dozen knobs: learning rate, regularization strength, batch size, number of layers, neurons per layer, dropout rate, momentum... How do you find good values? You search.
The naive approach is grid search: try every combination on a regular grid. LR in {0.1, 0.01, 0.001} times reg in {0.1, 0.01, 0.001} = 9 experiments. But this is wasteful. If one hyperparameter matters much more than the other (and one usually does), grid search wastes most of its budget testing irrelevant values of the unimportant one.
Random search (Bergstra & Bengio, 2012) is better: sample each hyperparameter independently from a distribution. With the same budget of 9 experiments, random search tests 9 different values of each hyperparameter instead of 3. You get more coverage where it matters.
Both methods try 25 points. The horizontal axis is the "important" hyperparameter, the vertical is less important. Random search covers the important axis much better. Click to resample.
Learning rates span orders of magnitude (1e-5 to 1e-1). Sampling uniformly gives you mostly large values. Instead, sample on a log scale: pick a random number between -5 and -1, then raise 10 to that power.
python import numpy as np lr = 10 ** np.random.uniform(-5, -1) # e.g. 3.2e-4 reg = 10 ** np.random.uniform(-5, -1) # e.g. 1.7e-3
Don't spend your entire budget on a single search. Run a coarse search over a wide range (e.g., LR from 1e-5 to 1e-1) with a small epoch budget. Look at the results. Then zoom in on the promising region (e.g., LR from 1e-4 to 1e-3) and run a finer search with a bigger epoch budget.
You've learned the full training pipeline: verify, train, and tune. But there are a few more tricks that can squeeze out additional performance without changing your architecture at all.
Ensembling is the simplest way to improve any model: train several models independently and average their predictions. Different random initializations lead to different local minima, and averaging their outputs smooths out individual errors. A 5-model ensemble typically improves accuracy by 1–2%.
Variations include: training the same architecture with different random seeds, using models from different points during training (snapshot ensembles via cyclic LR), or averaging the weights of the last few checkpoints (Polyak averaging).
Why train from scratch when someone else already trained a massive model on millions of images? Transfer learning takes a pretrained model (e.g., ResNet trained on ImageNet), freezes the early layers (which learn generic features like edges and textures), and fine-tunes only the last few layers on your specific dataset. This works spectacularly well even with tiny datasets.
| Technique | What It Does | When to Use |
|---|---|---|
| Gradient Check | Verifies backprop correctness | Once, during development |
| Sanity Check | Catches setup bugs early | Before every training run |
| LR Schedules | Starts fast, fine-tunes slowly | Always (cosine is the default) |
| Adam | Adaptive per-parameter LR | Default optimizer for most tasks |
| Random Search | Efficient hyperparameter exploration | When tuning more than 2 hyperparams |
| Ensembles | Average multiple models | When you need every last percent |
| Transfer Learning | Start from pretrained weights | When your dataset is small |
Related lessons: