Architecture is ready — now make it actually trainable.
In Part 1, we built a neural network: layers of neurons, activation functions, forward passes. The architecture is ready. But if you just throw raw data at it with random weights and hit "train," you'll likely get garbage.
Training a neural network is like tuning a complex instrument. The architecture is the instrument itself — but you also need to tune the strings (weight initialization), prepare the sheet music (data preprocessing), keep the sound balanced (batch normalization), prevent feedback loops (regularization), and choose how to score the performance (loss function).
Both networks have the same architecture (2 hidden layers, 16 neurons each). Left: poor setup (no preprocessing, zero init). Right: good setup (normalized data, He init). Watch the loss curves diverge.
This lesson covers every decision you need to make before calling your training loop: how to prepare the data, how to set initial weights, how to keep activations healthy, how to prevent overfitting, and how to measure error. These aren't minor details — they're the foundation of successful training.
Before a single gradient is computed, your data needs to be cleaned up. Raw features come in all shapes: pixel intensities from 0 to 255, prices in the thousands, ages from 0 to 100. If one feature is 100x larger than another, the loss landscape becomes a long, narrow valley that gradient descent struggles to navigate.
Mean subtraction is the simplest fix. Compute the mean of each feature across your training set, then subtract it from every data point. This centers the data cloud around the origin. For images, you might compute the mean across all pixels (per-pixel mean) or a single global mean.
Normalization goes further: after centering, divide by the standard deviation. Now each feature has zero mean and unit variance. The loss landscape becomes more circular, and gradient descent converges faster.
Left: raw data creates an elongated loss contour. Right: normalized data creates circular contours. The gradient points directly toward the minimum.
PCA whitening is a more aggressive transform. First, PCA rotates the data so axes align with the directions of greatest variance. Then whitening divides each axis by its eigenvalue, making all directions have equal variance. The result: a perfectly spherical data cloud. In practice, standard normalization is almost always sufficient — PCA whitening is rarely needed for deep networks because the first layer can learn its own rotation.
You've preprocessed your data. Now, what values do the weights start at? This matters enormously. The wrong initialization can kill training before it begins.
All zeros: If every weight is zero, every neuron computes the same output. They all receive the same gradient. They all update identically. They stay identical forever. This is called the symmetry problem — the network has many neurons but they all do the same thing, as if you had just one.
Small random: Initialize weights from a small Gaussian, say N(0, 0.01). This works for shallow networks, but for deep ones the activations shrink layer by layer — by layer 10, they're effectively zero. Gradients vanish. Training stalls.
Too large: Initialize from N(0, 1.0). Now activations explode — they saturate sigmoid/tanh or blow up with ReLU. Gradients explode or vanish. Also unusable.
Xavier initialization (Glorot & Bengio, 2010): Set the variance to 1/nin, where nin is the number of inputs to the layer. This keeps the variance of activations roughly constant across layers. For a layer with nin inputs:
He initialization (He et al., 2015): For ReLU networks, half the activations are zeroed out, so we need to compensate by doubling the variance:
Watch how activation variance changes layer by layer with different initializations. Ideal: the bars stay roughly the same height.
Even with good initialization, activations can drift during training. As weights update, the distribution of inputs to each layer shifts — a problem called internal covariate shift. Later layers must constantly adapt to a moving target.
Batch normalization (Ioffe & Szegedy, 2015) fixes this by normalizing activations within each mini-batch. For each feature in a layer, compute the batch mean and variance, then normalize:
But wait — if we always force activations to zero mean and unit variance, we're limiting what the network can represent. Maybe the optimal activation distribution isn't standard normal. So batch norm adds two learnable parameters: a scale γ and a shift β:
The network can learn to undo the normalization entirely (by setting γ = σ and β = μ), or it can keep the normalized version, or anything in between. The key insight: the network chooses its activation distribution rather than having it imposed by upstream weight changes.
model.eval() in PyTorch — it switches batch norm from batch statistics to running statistics.Without batch norm, activations drift as training progresses (left). With batch norm, they stay centered and stable (right). Drag the epoch slider to see how distributions evolve.
| Benefit | Why |
|---|---|
| Faster training | Allows higher learning rates without divergence |
| Reduces sensitivity to init | Normalization dampens the effect of bad starting weights |
| Slight regularization | Batch statistics add noise, acting like mini-dropout |
A neural network with enough parameters can memorize the entire training set — every image, every label, every noise artifact. It achieves near-zero training loss but fails catastrophically on new data. This is overfitting: the model has learned the training data's noise instead of its underlying patterns.
Regularization is any technique that fights overfitting by constraining the model's complexity. The most common approach: add a penalty to the loss function that discourages large weights.
L2 regularization (weight decay) adds the sum of squared weights to the loss:
The gradient of the penalty is 2λw — it pushes every weight toward zero proportionally to its magnitude. Large weights get penalized heavily. The result: the network prefers many small weights over a few large ones, which produces smoother, more generalizable functions.
L1 regularization adds the sum of absolute values: λ ∑ |wi|. Unlike L2, L1 pushes weights all the way to exactly zero, producing sparse networks where many weights are inactive. This is useful for feature selection but less common in deep learning.
Max norm constraints clip the weight vector of each neuron if its norm exceeds a threshold c: if ||w|| > c, rescale w → c · w / ||w||. This bounds the maximum capacity directly.
A network fits noisy data. Increase λ to see L2 regularization smoothing the fit. Too little: overfitting. Too much: underfitting.
| Method | Penalty | Effect on Weights | Use Case |
|---|---|---|---|
| L2 | λ ∑ w2 | Shrinks toward zero | Default for deep learning |
| L1 | λ ∑ |w| | Drives to exactly zero | Feature selection, sparsity |
| Max norm | Clip if ||w|| > c | Bounds magnitude | Stability with high LR |
Dropout (Srivastava et al., 2014) is a beautifully simple idea: during each training step, randomly "turn off" each neuron with probability p (typically 0.5). Set its output to zero. Gone. The remaining neurons must learn to be useful on their own, without relying on any specific partner.
During training, we create a random binary mask for each layer and element-wise multiply it with the activations. But this introduces a problem at test time: if we used all neurons (no dropout), the expected activation would be (1−p) times larger than during training, because we'd have more active neurons.
Inverted dropout is the standard fix: during training, divide the surviving activations by (1−p) to compensate. Now the expected value stays the same whether dropout is on or off, and at test time you simply use all neurons without modification.
python # Inverted dropout during training mask = (np.random.rand(*h.shape) > p) # p = drop probability h = h * mask / (1 - p) # scale up survivors # At test time: just use h as-is. No scaling needed.
A 3-layer network with dropout. Click "Drop" to randomly zero out neurons (gray). Each forward pass uses a different random subset of the network.
The loss function tells the network how wrong it is. Every choice of loss function encodes a different belief about what "wrong" means. Pick the wrong one, and you're optimizing for the wrong thing.
For classification, the two dominant losses are:
Softmax cross-entropy (the standard for multi-class classification): First, convert raw scores (logits) into probabilities via the softmax function. Then measure how far the predicted distribution is from the true label using cross-entropy:
If the network is confident and correct (pcorrect close to 1), the loss is near zero. If it's wrong or uncertain, −log(p) grows steeply. This loss has a beautiful probabilistic interpretation: it's the negative log-likelihood under a categorical distribution.
SVM / hinge loss (multi-class SVM): For each wrong class j, penalize if its score is within a margin of the correct class score:
This only cares about getting the ordering right with a margin — it doesn't try to push probabilities to 0 or 1. In practice, cross-entropy is almost always preferred for neural networks because its gradients are smoother and it works naturally with softmax.
Predicted score for the correct class on the x-axis, loss on the y-axis. Cross-entropy grows sharply as confidence in the wrong answer increases. Hinge loss is piecewise linear.
| Loss | Task | Gradient Behavior | When to Use |
|---|---|---|---|
| Cross-entropy | Classification | Smooth, probabilistic | Default for classification |
| Hinge | Classification | Piecewise linear | SVMs, margin-based models |
| MSE (L2) | Regression | Proportional to error | Default for regression |
| MAE (L1) | Regression | Constant magnitude | Outlier-robust regression |
This is the payoff. We'll train the same 5-layer ReLU network on the same data, changing only the weight initialization. Watch how the distribution of activations evolves layer by layer. Healthy training keeps the histograms roughly bell-shaped with consistent spread. Bad initialization causes them to collapse to zero or explode to the extremes.
Four initialization strategies, same architecture. Each row shows activation histograms across 5 layers. Green = healthy, Red = collapsed/exploded. Click an init method, then "Forward Pass" to push random data through the network.
Try clicking "Forward Pass" multiple times for each initialization. He init produces consistent, healthy distributions every time. Small random sometimes works for early layers but always collapses by layer 5. Zeros are dead on arrival.
Every technique we've covered comes with knobs to turn. Setting these hyperparameters — values not learned by gradient descent but chosen by you — is both art and science. Here's a practical reference.
| Hyperparameter | Typical Range | Start With | Effect of Too High | Effect of Too Low |
|---|---|---|---|---|
| Learning rate | 1e-5 to 1e-1 | 1e-3 | Divergence, loss explodes | Training too slow, stuck |
| Reg strength λ | 1e-5 to 1e-1 | 1e-4 | Underfitting | Overfitting |
| Dropout rate p | 0.0 to 0.5 | 0.0 (no dropout) | Underfitting, noisy gradients | Overfitting |
| Batch size | 32 to 512 | 64 or 128 | Less noise, fewer updates | Noisy gradients, slow |
Learning rate is the single most important hyperparameter. Too high and the loss oscillates or diverges. Too low and training takes forever or gets stuck in a bad local minimum. A common technique: learning rate warmup (start low, increase linearly) followed by cosine decay (gradually decrease to near zero).
Batch size controls the noise in gradient estimates. Smaller batches give noisier but more frequent updates. Larger batches give more accurate gradients but fewer updates per epoch. There's evidence that moderate noise (batch size 32-256) actually helps generalization by preventing the optimizer from settling into sharp minima.
We've covered everything between designing an architecture and training it. Here's where these techniques fit in the bigger picture:
| This Lesson | What Comes Next |
|---|---|
| Data preprocessing | Data augmentation (flips, crops, color jitter) |
| Weight initialization | Residual connections (skip connections fix deep init) |
| Batch normalization | Layer norm, group norm, RMS norm (transformer variants) |
| Dropout | DropPath, stochastic depth (modern regularization) |
| Loss functions | Contrastive loss, focal loss, triplet loss |
| Hyperparameter tuning | Learning rate schedules, optimizers (Adam, AdamW) |
The techniques in this lesson scale remarkably well. Batch normalization was designed for convolutional networks in 2015 but its descendants (layer norm) power every transformer today. He initialization makes training 100-layer ResNets possible. Dropout, though less common in modern architectures, remains a reliable tool when data is scarce.
Related lessons: Neural Networks Part 1 (the architecture), Linear Classification (where it all started), Image Classification (data pipeline in practice).