Deep Learning Foundations

Training Neural Nets
The Art of Learning

Gradient checks, optimizers, learning rate schedules, and hyperparameter search — everything between "I have a network" and "it actually works."

Prerequisites: Neural Networks Part 1 & 2 (architecture + backprop). That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Training Is an Art

You've built a neural network. You've implemented backpropagation. You hit "train." The loss goes... nowhere. Or it explodes. Or it drops beautifully for 10 epochs then flatlines. What went wrong?

Training a neural network is not like compiling code. There's no error message that says "your learning rate is too high." There's no compiler warning for a broken gradient. You stare at a loss curve and try to read the tea leaves. And the difference between a network that works and one that doesn't is often a single hyperparameter choice.

The uncomfortable truth: Training neural networks is part science, part craft. The math tells you what to compute; experience tells you how to make it actually converge. This lesson gives you the practitioner's toolkit — the checks, tricks, and intuitions that turn broken training runs into working models.
The Training Lottery

Same network, same data, three different learning rates. Click "Run" to see how dramatically a single number changes the outcome. Green = good, orange = slow, red = diverged.

This lesson covers the full pipeline: verifying your gradients are correct, sanity-checking your setup before training, reading loss curves like a doctor reads vital signs, choosing and tuning optimizers, scheduling learning rates, and searching the hyperparameter space efficiently.

1. Verify
Gradient checks & sanity checks
2. Train
Optimizer + learning rate + schedule
3. Tune
Hyperparameter search & ensembles
Why is training neural networks often called "an art"?

Chapter 1: Gradient Checking — Trust but Verify

You've written backpropagation code. It runs. But is it correct? A subtle bug in the gradient computation won't crash your program — it'll just silently produce a model that learns slowly or learns the wrong thing. You need a way to verify that your analytic gradients match reality.

The idea is simple: compare your backprop gradient to a numerical gradient. The numerical gradient uses the definition of a derivative directly — no chain rule, no clever math. It's slow (one forward pass per parameter) but almost impossible to get wrong.

∂f/∂x ≈ (f(x + h) − f(x − h)) / (2h)

This is the centered difference formula. It's more accurate than the one-sided version (f(x+h) - f(x))/h because the error is O(h²) instead of O(h). Use h ≈ 1e-5.

To check if two gradient values agree, compute the relative error:

relative error = |f'analytic − f'numeric| / max(|f'analytic|, |f'numeric|)
Rules of thumb: Relative error < 1e-7 means you're golden. Between 1e-5 and 1e-7 is okay but suspicious — check for kinks (ReLU). Above 1e-3 means your gradient is almost certainly wrong.
Gradient Check Simulator

We compute the analytic gradient (via backprop) and numerical gradient (via centered differences) for a simple function. Adjust h to see how step size affects accuracy.

log10(h) 1e-5
When gradient checks fail: If you're using ReLU, the gradient is undefined at exactly 0. A numerical check that straddles x=0 will get the wrong answer — that's the function's fault, not yours. Also: always disable dropout and use a fixed random seed during gradient checking. And don't forget to check the bias gradients too, not just weights.

A few more practical tips from cs231n: use double precision (float64) during gradient checks for better numerical accuracy. Start from a random point (not zero — symmetry can hide bugs). And always gradient-check with a small network on a tiny dataset — it's slow, so you only do it once during development.

Why use the centered difference (f(x+h) - f(x-h)) / 2h instead of the one-sided (f(x+h) - f(x)) / h?

Chapter 2: Sanity Checks — Before You Train

Before you invest hours of GPU time, two cheap sanity checks can catch 90% of bugs in under a minute.

Check 1: Loss at initialization

With random weights, you can predict what the loss should be. For a softmax classifier with C classes, the initial loss should be −log(1/C) = log(C). With 10 classes (CIFAR-10), that's log(10) ≈ 2.302. If your initial loss is 8.5 or 0.3, something is broken.

Add regularization and the initial loss should be slightly higher. If adding regularization doesn't change the loss at all, your regularization code is dead.

Check 2: Overfit a tiny batch

Take 5–10 training examples. Turn off regularization. Train until the loss hits zero (or near-zero). If your network can't memorize 5 examples, something is fundamentally broken — a bug in the data pipeline, a wrong loss function, or a broken backward pass.

The overfit-tiny-batch test is sacred. It costs almost nothing and catches a staggering number of bugs: wrong label encoding, mismatched dimensions, accidentally frozen layers, bad data augmentation. Do this every single time before launching a full training run.
Sanity Check: Expected Initial Loss

Adjust the number of classes. The expected initial loss for softmax is log(C). If your actual initial loss is far from this, you have a bug.

Classes (C) 10

Here's the checklist practitioners run through before every training job:

CheckExpectedIf Wrong
Initial loss (no reg)log(C) for softmaxBug in loss or model output
Initial loss (with reg)Slightly higherRegularization is dead code
Overfit 5 examplesLoss → 0, accuracy → 100%Bug in backward pass or data
Full train, low regLoss decreases steadilyLearning rate issue
Your 10-class softmax classifier starts with an initial loss of 7.5. What does this tell you?

Chapter 3: Loss Curves — Reading the Vital Signs

Once training starts, the loss curve is your primary diagnostic tool. A healthy loss curve drops quickly at first, then gradually flattens. But real loss curves are rarely this clean. Learning to read them is like learning to read an EKG — each shape tells a story.

Loss Curve Gallery

Select a scenario to see its characteristic loss curve shape, and learn to diagnose what's happening.

What to look for

Good training: Loss drops steeply in the first few epochs, then curves smoothly toward a plateau. The gap between training loss and validation loss stays small.

Learning rate too high: Loss oscillates wildly or explodes to NaN. The optimizer is overshooting minima, bouncing around the loss landscape like a pinball.

Learning rate too low: Loss decreases, but at a glacial pace. You're taking tiny steps downhill. The curve is almost flat — it looks like nothing is happening. You'll run out of compute budget before converging.

Overfitting: Training loss keeps dropping, but validation loss bottoms out and starts increasing. The network is memorizing training examples instead of learning general patterns. The growing gap between the two curves is the telltale sign.

The train/val gap tells you everything. Small gap = underfitting (increase model capacity). Large gap = overfitting (add regularization, get more data, or reduce capacity). No gap and both are high = the model can't learn at all (check your setup).
Your training loss is decreasing nicely, but validation loss stopped improving and is now rising. What's happening?

Chapter 4: Learning Rate — The Most Important Hyperparameter

If you could tune only one hyperparameter, it should be the learning rate. Too high and you overshoot, too low and you waste days going nowhere. The right value depends on your model, your data, and your optimizer — but the symptoms of a wrong choice are universal.

w ← w − η · ∇L(w)

That η (eta) is the learning rate. It controls the step size in parameter space. Think of it as the stride length of someone walking downhill blindfolded. Too long and you step over the valley. Too short and you'll be walking forever.

Learning Rate Explorer

Watch gradient descent on a simple 1D quadratic loss. Drag the learning rate to see the three regimes: too low (crawling), just right (smooth convergence), too high (divergence).

Learning Rate η 0.30
The Goldilocks zone: Good learning rates typically live between 1e-4 and 1e-2 for SGD, and around 3e-4 for Adam. But these are just starting points. The learning rate finder (by Leslie Smith) sweeps the LR from tiny to huge over one epoch, plotting loss vs LR. The best LR is just before the loss starts exploding — where the curve is steepest.

A practical tip: if your loss plateaus but hasn't converged, try reducing the learning rate by a factor of 10. This is the simplest form of learning rate scheduling, and it works surprisingly often. We'll formalize this in Chapter 6.

Your loss oscillates wildly during training, sometimes spiking to very large values. The most likely cause is:

Chapter 5: Momentum & Adam — Smarter Steps

Vanilla SGD has a problem: it treats every direction equally. If the loss landscape is shaped like a long narrow valley (and in high dimensions, it usually is), SGD oscillates back and forth across the narrow direction while making slow progress along the long direction. We need a smarter way to step.

SGD + Momentum

Momentum fixes this by adding a "velocity" to the parameter updates. Instead of stepping directly in the gradient direction, you accumulate a running average of past gradients. This smooths out the oscillations and accelerates progress in consistent directions.

v ← μ · v − η · ∇L(w)
w ← w + v

Here μ is the momentum coefficient (typically 0.9). Think of a ball rolling downhill: it builds speed in the downhill direction and its inertia carries it through small bumps. The gradient is the slope; momentum is the ball's mass.

Nesterov Momentum

Nesterov accelerated gradient is a clever twist: instead of computing the gradient at the current position, you first take a step in the direction of the accumulated velocity, then compute the gradient at that "lookahead" point. This gives a better correction — you're correcting based on where you're about to be, not where you are.

v ← μ · v − η · ∇L(w + μ · v)
w ← w + v

Adam (Adaptive Moment Estimation)

Adam combines the best of two ideas: momentum (first moment) and RMSProp (second moment). RMSProp keeps a running average of squared gradients, and divides the update by their square root. This means dimensions with large gradients get smaller steps, and dimensions with small gradients get larger steps — adapting the learning rate per-parameter.

m ← β1 · m + (1 − β1) · g     (first moment / mean)
v ← β2 · v + (1 − β2) · g²     (second moment / variance)
w ← w − η · m̂ / (√v̂ + ε)

The hat notation (m̂, v̂) means bias-corrected estimates: m̂ = m / (1 − β1t). Without this correction, the estimates are biased toward zero at the start of training because m and v are initialized to zero.

Default Adam settings: β1 = 0.9, β2 = 0.999, ε = 1e-8, η = 3e-4. These work surprisingly well across a huge range of problems. Adam is the "just works" optimizer — it's what you should reach for first unless you have a reason not to.
Optimizer Comparison: 2D Contour

Watch SGD, Momentum, and Adam navigate an elongated loss surface. Momentum smooths oscillations. Adam adapts per-dimension.

OptimizerKey IdeaTypical Use
SGDRaw gradient stepRarely used alone
SGD + MomentumAccumulate velocityConvNets, when tuned well
RMSPropAdapt per-parameter LR via squared gradientsRNNs (historically)
AdamMomentum + RMSProp + bias correctionDefault for most tasks
What problem does momentum solve that vanilla SGD suffers from?

Chapter 6: Learning Rate Schedules — Changing Speed Mid-Race

A fixed learning rate is a compromise. Early in training, you want large steps to make rapid progress. Late in training, you want small steps to fine-tune into a precise minimum. Learning rate schedules give you both: they start high and decrease over time.

Step Decay

The simplest schedule: multiply the learning rate by a factor (like 0.1) every N epochs. Typical recipe: reduce by 10x at epochs 30, 60, and 90 for a 100-epoch run. You can see this in the loss curve as sudden drops after each LR reduction.

η(t) = η0 · γ⌊t / S⌋

Where γ is the decay factor (e.g., 0.1) and S is the step size in epochs.

Cosine Annealing

Cosine annealing smoothly reduces the learning rate following a half-cosine curve from ηmax to ηmin. No sharp drops, no hyperparameters to tune beyond the total number of epochs. It's become the default schedule in modern practice.

η(t) = ηmin + ½(ηmax − ηmin)(1 + cos(π · t / T))

Warmup

Learning rate warmup starts training with a very small LR and linearly increases it to the target value over the first few epochs. Why? With random weights, the early gradients can be unreliable. Large steps based on bad gradients can put the model in a region of parameter space from which it never recovers. Warmup lets the model "find its footing" first.

Modern recipe: Linear warmup for 5–10% of training, then cosine decay to zero. This is the schedule used by most large-scale models (BERT, GPT, ViT). It's simple, robust, and requires minimal tuning.
Learning Rate Schedule Viewer

Compare different schedules over 100 epochs. The y-axis is the learning rate at each epoch.

Why do modern training recipes start with a learning rate warmup?

Chapter 7: Optimizer Race — The Showcase

Time to see everything in action. Three optimizers — vanilla SGD, SGD with Momentum, and Adam — all start at the same point on the same 2D loss surface. Watch their trajectories diverge as each follows its own strategy to reach the minimum.

What to watch for: SGD (orange) will oscillate across the narrow valley. Momentum (teal) will overshoot but correct and converge faster. Adam (purple) will adapt its step size per-dimension and typically find the smoothest path. Try different surfaces and speed settings.
Optimizer Race: SGD vs Momentum vs Adam

Three optimizers, one loss surface. Click "Race!" to watch them compete. Change the surface and speed to explore different landscapes.

Speed 3
◼ SGD   ◼ Momentum   ◼ Adam

Notice how the differences become most dramatic on the elongated bowl. This shape is common in real neural networks: the loss landscape has very different curvatures in different directions. SGD wastes energy oscillating across the steep direction. Momentum dampens the oscillation. Adam recognizes that one dimension has large gradients and automatically shrinks its step there.

On the Rosenbrock valley, all optimizers struggle with the flat, curved bottom. But Adam and Momentum make steady progress while SGD barely moves. On the Beale function (with its irregular shape), Adam's adaptive per-parameter rates shine.

The takeaway: Adam is not always the fastest to the exact minimum, but it's almost always the most robust. It's hard to make Adam fail catastrophically, while SGD requires careful LR tuning. This is why Adam is the default — it's the optimizer that "just works."

Chapter 8: Hyperparameter Search — Finding the Sweet Spot

You have a dozen knobs: learning rate, regularization strength, batch size, number of layers, neurons per layer, dropout rate, momentum... How do you find good values? You search.

Grid Search vs Random Search

The naive approach is grid search: try every combination on a regular grid. LR in {0.1, 0.01, 0.001} times reg in {0.1, 0.01, 0.001} = 9 experiments. But this is wasteful. If one hyperparameter matters much more than the other (and one usually does), grid search wastes most of its budget testing irrelevant values of the unimportant one.

Random search (Bergstra & Bengio, 2012) is better: sample each hyperparameter independently from a distribution. With the same budget of 9 experiments, random search tests 9 different values of each hyperparameter instead of 3. You get more coverage where it matters.

Random beats grid. This isn't a heuristic — it's been proven theoretically and empirically. When some hyperparameters are more important than others (which is almost always the case), random search finds good values with fewer trials.
Grid vs Random Search

Both methods try 25 points. The horizontal axis is the "important" hyperparameter, the vertical is less important. Random search covers the important axis much better. Click to resample.

Log-scale for learning rates

Learning rates span orders of magnitude (1e-5 to 1e-1). Sampling uniformly gives you mostly large values. Instead, sample on a log scale: pick a random number between -5 and -1, then raise 10 to that power.

python
import numpy as np
lr = 10 ** np.random.uniform(-5, -1)  # e.g. 3.2e-4
reg = 10 ** np.random.uniform(-5, -1)  # e.g. 1.7e-3

Coarse-to-fine

Don't spend your entire budget on a single search. Run a coarse search over a wide range (e.g., LR from 1e-5 to 1e-1) with a small epoch budget. Look at the results. Then zoom in on the promising region (e.g., LR from 1e-4 to 1e-3) and run a finer search with a bigger epoch budget.

Coarse Search
Wide range, few epochs, many trials
↓ identify promising region
Fine Search
Narrow range, more epochs, fewer trials
↓ pick the best
Final Run
Best hyperparameters, full training
Watch the boundaries. If the best learning rate from your search is at the edge of the range you searched (e.g., 1e-5 when you searched [1e-5, 1e-1]), your optimal value might be outside your range. Widen the search and try again.
Why is random search better than grid search for hyperparameter tuning?

Chapter 9: Connections — Beyond a Single Model

You've learned the full training pipeline: verify, train, and tune. But there are a few more tricks that can squeeze out additional performance without changing your architecture at all.

Model Ensembles

Ensembling is the simplest way to improve any model: train several models independently and average their predictions. Different random initializations lead to different local minima, and averaging their outputs smooths out individual errors. A 5-model ensemble typically improves accuracy by 1–2%.

Variations include: training the same architecture with different random seeds, using models from different points during training (snapshot ensembles via cyclic LR), or averaging the weights of the last few checkpoints (Polyak averaging).

Transfer Learning

Why train from scratch when someone else already trained a massive model on millions of images? Transfer learning takes a pretrained model (e.g., ResNet trained on ImageNet), freezes the early layers (which learn generic features like edges and textures), and fine-tunes only the last few layers on your specific dataset. This works spectacularly well even with tiny datasets.

TechniqueWhat It DoesWhen to Use
Gradient CheckVerifies backprop correctnessOnce, during development
Sanity CheckCatches setup bugs earlyBefore every training run
LR SchedulesStarts fast, fine-tunes slowlyAlways (cosine is the default)
AdamAdaptive per-parameter LRDefault optimizer for most tasks
Random SearchEfficient hyperparameter explorationWhen tuning more than 2 hyperparams
EnsemblesAverage multiple modelsWhen you need every last percent
Transfer LearningStart from pretrained weightsWhen your dataset is small
The full training recipe: (1) Gradient check your backprop. (2) Sanity check the loss at init. (3) Overfit a tiny batch. (4) Start with Adam + cosine schedule. (5) Random search over LR and regularization on log scale. (6) Coarse-to-fine. (7) Train the best config to convergence. (8) Ensemble if you have budget.

Related lessons:

John Tukey: "An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem." Good training practice — the right optimizer, the right schedule, the right search strategy — is how you make sure you're solving the right problem well.
What's the simplest way to improve a trained model's performance without changing the architecture?